Colombia Property Rent 2015 Descriptive Data Analysis
Dataset Source:
BigQuery Default Datasets: “properati_properties_co” Database
Data retrieval of all the 2015 tables were united with the following Query:
WITH Colombia_properties_rent_2015 AS (
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201501`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201502`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201503`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201504`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201505`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201506`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201507`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201508`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201509`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201510`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201511`
UNION ALL
SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201512`
)
SELECT created_on AS Creation_Date,
property_type AS Type,
place_name AS City,
state_name AS State,
country_name AS Country,
lat AS Latitude,
lon AS Longitude,
price_aprox_local_currency AS Price_COP,
price_aprox_usd AS Price_USD,
surface_covered_in_m2 AS Surface,
price_usd_per_m2 AS Price_m2_USD,
floor AS Floor,
rooms AS Rooms
FROM Colombia_properties_rent_2015
The subset was saved on the local machine as "Colombia_properties_rent_2015.CSV"
# Importing relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Assigning the data set to a variable
df = pd.read_csv('Colombia_properties_rent_2015.CSV')
df
Creation_Date | Type | City | State | Country | Latitude | Longitude | Price_COP | Price_USD | Surface | Price_m2_USD | Floor | Rooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-08-31 | house | Paipa | Boyacá | Colombia | 5.784000 | -73.116000 | 3821455.54 | 1307.71 | 130.0 | 10.059308 | 2.0 | NaN |
1 | 2015-10-27 | house | Cartagena | Bolívar | Colombia | 10.424000 | -75.549000 | 8547990.36 | 2925.14 | 300.0 | 9.750467 | 2.0 | NaN |
2 | 2015-07-04 | house | Armenia | Quindío | Colombia | 4.461000 | -75.667000 | 1508465.45 | 516.20 | 41.0 | NaN | NaN | NaN |
3 | 2015-08-05 | house | Retiro | Antioquia | Colombia | NaN | NaN | 25141168.76 | 8603.36 | 700.0 | NaN | 1.0 | NaN |
4 | 2015-06-01 | apartment | Antioquia | Antioquia | Colombia | 6.057394 | -75.502792 | 653678.10 | 223.69 | NaN | NaN | NaN | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5094 | 2015-05-07 | house | Ríomar | Atlántico | Colombia | 11.008000 | -74.822000 | 4499743.79 | 1548.60 | 45.0 | 34.413333 | 1.0 | NaN |
5095 | 2015-05-07 | house | Ríomar | Atlántico | Colombia | 11.012000 | -74.825000 | 5999658.38 | 2064.80 | NaN | 4.040705 | NaN | NaN |
5096 | 2015-05-08 | house | Ríomar | Atlántico | Colombia | 11.015000 | -74.823000 | 3999762.57 | 1376.53 | 180.0 | NaN | NaN | 3.0 |
5097 | 2015-05-08 | house | Puerto Colombia | Atlántico | Colombia | 11.023000 | -74.885000 | 1299916.29 | 447.37 | 85.0 | 5.263176 | NaN | 3.0 |
5098 | 2015-05-08 | house | San Antonio del Tequendama | Cundinamarca | Colombia | 4.616000 | -74.352000 | 4499743.79 | 1548.60 | 285.0 | 0.120984 | NaN | 4.0 |
5099 rows × 13 columns
# Checking how many empty values are in the data set
print(df.isnull().sum()) # Count missing values per column
df.shape
Creation_Date 0 Type 0 City 0 State 0 Country 0 Latitude 110 Longitude 110 Price_COP 21 Price_USD 21 Surface 274 Price_m2_USD 1588 Floor 2385 Rooms 2795 dtype: int64
(5099, 13)
Data Wrangling
# Removing the rows with missing values for Latitude and Longitude
df.dropna(subset=['Latitude', 'Longitude'], inplace=True)
print(df.isnull().sum())
Creation_Date 0 Type 0 City 0 State 0 Country 0 Latitude 0 Longitude 0 Price_COP 21 Price_USD 21 Surface 263 Price_m2_USD 1533 Floor 2304 Rooms 2744 dtype: int64
# Replace missing values with the mean of the column for Price_COP, Price_USD, Surface, Price_m2_USD
df[['Price_COP', 'Price_USD', 'Surface', 'Price_m2_USD']] = df[['Price_COP', 'Price_USD', 'Surface', 'Price_m2_USD']].apply(lambda x: x.fillna(x.mean()))
print(df.isnull().sum())
Creation_Date 0 Type 0 City 0 State 0 Country 0 Latitude 0 Longitude 0 Price_COP 0 Price_USD 0 Surface 0 Price_m2_USD 0 Floor 2304 Rooms 2744 dtype: int64
This code will fill the missing values in the Rooms
column based on the Surface
and Property_Type
columns. It groups the data by property type, calculates the surface ranges for each number of rooms, and then assigns the number of rooms to the missing values based on the surface range.
import pandas as pd
import numpy as np
# Assuming df is your DataFrame
# Function to fill missing rooms based on surface range
def fill_rooms(row, surface_ranges):
if pd.isnull(row['Rooms']):
for prop_type, ranges in surface_ranges.items():
if row['Type'] == prop_type:
for min_surf, max_surf, rooms in ranges:
if min_surf <= row['Surface'] <= max_surf:
return rooms
return row['Rooms']
# Group by property type and calculate surface ranges
surface_ranges = {}
for prop_type, group in df.groupby('Type'):
ranges = []
for rooms, sub_group in group.groupby('Rooms'):
if not pd.isnull(rooms):
min_surf = sub_group['Surface'].min()
max_surf = sub_group['Surface'].max()
ranges.append((min_surf, max_surf, rooms))
surface_ranges[prop_type] = ranges
# Apply the function to fill missing rooms
df['Rooms'] = df.apply(lambda row: fill_rooms(row, surface_ranges), axis=1)
# Check the result
print(df.isnull().sum())
Creation_Date 0 Type 0 City 0 State 0 Country 0 Latitude 0 Longitude 0 Price_COP 0 Price_USD 0 Surface 0 Price_m2_USD 0 Floor 2304 Rooms 96 dtype: int64
As 96 values could not be established by the code we will drop them together with 'Floor' column
df = df.drop(columns=['Floor'])
df.dropna(inplace=True)
print(df.isnull().sum())
df.shape
Creation_Date 0 Type 0 City 0 State 0 Country 0 Latitude 0 Longitude 0 Price_COP 0 Price_USD 0 Surface 0 Price_m2_USD 0 Rooms 0 dtype: int64
(4893, 12)
# Turning Rooms into an integer
df['Rooms'] = df['Rooms'].astype(int)
df
Creation_Date | Type | City | State | Country | Latitude | Longitude | Price_COP | Price_USD | Surface | Price_m2_USD | Rooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-08-31 | house | Paipa | Boyacá | Colombia | 5.784000 | -73.116000 | 3821455.54 | 1307.71 | 130.000000 | 10.059308 | 1 |
1 | 2015-10-27 | house | Cartagena | Bolívar | Colombia | 10.424000 | -75.549000 | 8547990.36 | 2925.14 | 300.000000 | 9.750467 | 1 |
2 | 2015-07-04 | house | Armenia | Quindío | Colombia | 4.461000 | -75.667000 | 1508465.45 | 516.20 | 41.000000 | 10.960724 | 1 |
4 | 2015-06-01 | apartment | Antioquia | Antioquia | Colombia | 6.057394 | -75.502792 | 653678.10 | 223.69 | 1371.975032 | 10.960724 | 1 |
5 | 2015-06-01 | apartment | Antioquia | Antioquia | Colombia | 6.057394 | -75.502792 | 653678.10 | 223.69 | 1371.975032 | 10.960724 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5094 | 2015-05-07 | house | Ríomar | Atlántico | Colombia | 11.008000 | -74.822000 | 4499743.79 | 1548.60 | 45.000000 | 34.413333 | 1 |
5095 | 2015-05-07 | house | Ríomar | Atlántico | Colombia | 11.012000 | -74.825000 | 5999658.38 | 2064.80 | 1371.975032 | 4.040705 | 1 |
5096 | 2015-05-08 | house | Ríomar | Atlántico | Colombia | 11.015000 | -74.823000 | 3999762.57 | 1376.53 | 180.000000 | 10.960724 | 3 |
5097 | 2015-05-08 | house | Puerto Colombia | Atlántico | Colombia | 11.023000 | -74.885000 | 1299916.29 | 447.37 | 85.000000 | 5.263176 | 3 |
5098 | 2015-05-08 | house | San Antonio del Tequendama | Cundinamarca | Colombia | 4.616000 | -74.352000 | 4499743.79 | 1548.60 | 285.000000 | 0.120984 | 4 |
4893 rows × 12 columns
#Check for duplicates
duplicates = df.duplicated()
# Print the number of duplicate rows
print(f"Number of duplicate rows: {duplicates.sum()}")
df.drop_duplicates(inplace=True)
df.shape
Number of duplicate rows: 3509
(1384, 12)
# Check Data Types
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1384 entries, 0 to 4432 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Creation_Date 1384 non-null object 1 Type 1384 non-null object 2 City 1384 non-null object 3 State 1384 non-null object 4 Country 1384 non-null object 5 Latitude 1384 non-null float64 6 Longitude 1384 non-null float64 7 Price_COP 1384 non-null float64 8 Price_USD 1384 non-null float64 9 Surface 1384 non-null float64 10 Price_m2_USD 1384 non-null float64 11 Rooms 1384 non-null int32 dtypes: float64(6), int32(1), object(5) memory usage: 135.2+ KB
This code will round the values in the Surface
and Price_m2_USD
columns to 2 decimal places.
df['Surface'] = df['Surface'].round(2)
df['Price_m2_USD'] = df['Price_m2_USD'].round(2)
df
Creation_Date | Type | City | State | Country | Latitude | Longitude | Price_COP | Price_USD | Surface | Price_m2_USD | Rooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-08-31 | house | Paipa | Boyacá | Colombia | 5.784000 | -73.116000 | 3821455.54 | 1307.71 | 130.00 | 10.06 | 1 |
1 | 2015-10-27 | house | Cartagena | Bolívar | Colombia | 10.424000 | -75.549000 | 8547990.36 | 2925.14 | 300.00 | 9.75 | 1 |
2 | 2015-07-04 | house | Armenia | Quindío | Colombia | 4.461000 | -75.667000 | 1508465.45 | 516.20 | 41.00 | 10.96 | 1 |
4 | 2015-06-01 | apartment | Antioquia | Antioquia | Colombia | 6.057394 | -75.502792 | 653678.10 | 223.69 | 1371.98 | 10.96 | 1 |
8 | 2015-07-01 | apartment | Antioquia | Antioquia | Colombia | 6.057394 | -75.502792 | 3419353.94 | 1170.11 | 1371.98 | 10.96 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4428 | 2015-01-29 | apartment | Bucaramanga | Santander | Colombia | 8.788330 | -76.035560 | 819112.60 | 281.90 | 40.00 | 10.96 | 1 |
4429 | 2015-01-29 | apartment | Bucaramanga | Santander | Colombia | 8.788330 | -76.035560 | 819112.60 | 281.90 | 37.00 | 10.96 | 1 |
4430 | 2015-01-29 | apartment | Bucaramanga | Santander | Colombia | 8.788330 | -76.035560 | 1222566.96 | 420.75 | 54.00 | 10.96 | 1 |
4431 | 2015-01-29 | apartment | Bucaramanga | Santander | Colombia | 8.788330 | -76.035560 | 1650458.13 | 568.01 | 27.00 | 10.96 | 1 |
4432 | 2015-01-29 | house | Bucaramanga | Santander | Colombia | 8.788330 | -76.035560 | 3056402.88 | 1051.87 | 180.00 | 10.96 | 3 |
1384 rows × 12 columns
This code will save the DataFrame to a CSV file without including the index.
df.to_csv('Clean_Colombia_properties_rent_2015.csv', index=False)
Exploratory Data Analysis
# Basic information
print(df.info())
print(df.describe())
<class 'pandas.core.frame.DataFrame'> Index: 1384 entries, 0 to 4432 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Creation_Date 1384 non-null object 1 Type 1384 non-null object 2 City 1384 non-null object 3 State 1384 non-null object 4 Country 1384 non-null object 5 Latitude 1384 non-null float64 6 Longitude 1384 non-null float64 7 Price_COP 1384 non-null float64 8 Price_USD 1384 non-null float64 9 Surface 1384 non-null float64 10 Price_m2_USD 1384 non-null float64 11 Rooms 1384 non-null int32 dtypes: float64(6), int32(1), object(5) memory usage: 135.2+ KB None Latitude Longitude Price_COP Price_USD Surface \ count 1384.000000 1384.000000 1.384000e+03 1384.000000 1384.000000 mean 7.304399 -74.165171 1.138219e+07 3906.051588 518.659379 std 3.259636 6.376757 2.069535e+07 7102.473683 913.167668 min -0.001614 -77.022000 0.000000e+00 0.000000 1.000000 25% 4.675000 -74.822000 2.399951e+06 825.920000 77.000000 50% 4.787000 -74.782000 4.683899e+06 1605.460000 186.000000 75% 10.998000 -74.054981 1.199935e+07 4129.610000 561.250000 max 11.245000 0.000907 2.413553e+08 82592.290000 8000.000000 Price_m2_USD Rooms count 1384.000000 1384.000000 mean 11.015166 1.999277 std 6.985004 1.758976 min 0.090000 1.000000 25% 6.880000 1.000000 50% 10.960000 1.000000 75% 12.120000 3.000000 max 66.250000 17.000000
Key Statistics in the Summary
- count
The number of non-missing values in the column.
Example: Latitude and Longitude have 1,384 entries, meaning no missing data in these columns.
- mean (Average)
The average value of the column.
Example: The average Price_USD is 3906.05 USD, and the average Surface is 518.66 m².
- std (Standard Deviation)
Measures how spread out the values are from the mean.
Example:
Price_USD has a standard deviation of 7102.47, indicating a wide range of prices.
Surface has a standard deviation of 913.16, showing variability in property sizes.
- min (Minimum Value)
The smallest value in the column.
Example: The minimum Price_USD is 0 (possibly indicating missing or free properties).
- 25% (1st Quartile)
The value below which 25% of the data falls.
Example: For Surface, 25% of properties are 130 m² or smaller.
- 50% (Median)
The middle value, where half the data falls below and half above.
Example: The median Surface is 300 m², indicating most properties are medium-sized.
- 75% (3rd Quartile)
The value below which 75% of the data falls.
Example: For Price_USD, 75% of properties cost 3000 USD or less.
- max
The largest value in the column.
Example: The largest Surface is 17,000 m², indicating an exceptionally large property.
Insights
Price Variability: The high standard deviation and large range (from 0 to millions in both Price_COP and Price_USD) suggest a wide variety of property prices.
Surface Insights: The average property size is around 518 m², but the large standard deviation shows some very small and very large properties.
Distribution of Numerical Features:
Plot histograms for numerical columns.
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])
# Plot histograms for the selected numerical columns
df[numerical_columns].hist(bins=30, figsize=(15, 10))
plt.show()
# # Define numerical columns excluding 'Latitude' and 'Longitude'
# numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])
# # Plot histograms for the selected numerical columns using Plotly
# for column in numerical_columns:
# fig = px.histogram(df, x=column, nbins=30, title=f'Histogram of {column}')
# fig.show()
- Price_COP (Colombian Pesos)
Skewed Right (Positively Skewed): Most property prices are concentrated at the lower end, with a few extreme values (outliers) pushing the distribution to the right. Insight: There are significantly expensive properties that drive the maximum price up, but most listings are on the lower end.
- Price_USD (US Dollars)
Similar to Price_COP, this is also right-skewed, meaning most properties are at the lower price range with a few very expensive ones. Insight: The real estate market has a few luxury properties but mostly consists of budget or mid-range properties
- Price_m2_USD (Price per Square Meter in USD)
Right-skewed but slightly more concentrated: There is a clear peak around 10 USD per square meter, indicating that most properties fall within this range. Insight: The majority of properties have similar pricing per square meter, with a few high-end outliers.
- Rooms (Number of Rooms per Property)
Right-skewed distribution: Most properties have 1 or 2 rooms, with a sharp drop-off as the room count increases. Insight: The majority of rental properties are small units, likely apartments or single-bedroom houses. Larger properties (6+ rooms) are rare.
- Surface (Property Size in Square Meters)
Highly Right-Skewed: Most properties are small (below 500m²), but there are some very large properties (up to 8000m²). Insight: The dataset includes a mix of small apartments and large estates, but the majority of properties are compact.
Overall Observations
Most distributions are right-skewed, meaning:
Most properties have low to mid-range values. A few high-priced or large properties significantly increase the maximum values.
Outliers exist in every distribution, especially for price and surface area.
Categorical Features:
Analyze the distribution of categorical columns.
# Exclude 'Creation_Date' and 'Country' columns
categorical_columns = ['Type', 'City', 'State']
# Get the top 20 cities
top_20_cities = df['City'].value_counts().nlargest(20).index
# Filter the DataFrame to include only the top 20 cities
df_top_cities = df[df['City'].isin(top_20_cities)]
# Order states by their count in descending order
state_order = df['State'].value_counts().index
# Plot the distribution of categorical columns
fig, axes = plt.subplots(3, 1, figsize=(15, 20))
for i, col in enumerate(categorical_columns):
if col == 'City':
sns.countplot(data=df_top_cities, x=col, ax=axes[i], order=top_20_cities)
axes[i].set_title('Distribution of Top 20 Cities')
elif col == 'State':
sns.countplot(data=df, x=col, ax=axes[i], order=state_order)
axes[i].set_title('Distribution of State (Descending)')
else:
sns.countplot(data=df, x=col, ax=axes[i])
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Count')
axes[i].tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()
- Distribution of Property Type
The bar chart shows the count of different property types in the dataset. Insights: The dataset is heavily dominated by houses, with a significantly smaller number of apartments and a very few properties classified as stores. This suggests that the dataset mainly focuses on residential rentals, with limited commercial properties.
- Distribution of Top 20 Cities
This bar chart displays the count of properties listed in the top 20 cities.
Insights:
Barranquilla has the highest number of listings, standing out significantly. Other major cities like Usaquén, Chía, and Cali also have a notable number of properties. The distribution shows a long tail, meaning most cities have a relatively low number of listings compared to the top few.
- Distribution of State (Descending Order)
This bar chart displays the count of properties by state, sorted in descending order.
Insights:
Atlántico has the most listings, followed by Bogotá D.C., Cundinamarca, and Valle del Cauca. The distribution follows a similar trend as cities, where a few states dominate the dataset. Possible Next Steps:
Overall Insights
The majority of properties are houses, meaning the dataset is more focused on residential properties than apartments or commercial spaces.
Barranquilla and Atlántico have the most listings, suggesting these areas are hotspots for rental properties.
The distribution is uneven, meaning a few cities and states dominate the dataset while many have significantly fewer listings.
Pairplot:
Visualize relationships between numerical features.
sns.pairplot(df)
plt.show()
- Understanding the Pair Plot
Each variable in the dataset is plotted against every other variable. The diagonal contains histograms showing the distribution of each variable. The scatterplots in the off-diagonal cells show the relationship between pairs of variables.
- Key Observations from the Scatterplots
Price_COP vs. Price_USD:
This plot shows a perfect linear relationship, as expected, because one is a converted version of the other (Colombian Pesos to USD). Price_USD vs. Surface:
There is a positive trend, meaning larger properties generally have higher prices. However, some properties with small surfaces have high prices, indicating outliers or luxury properties. Price_m2_USD vs. Surface:
This relationship appears non-linear, with higher price per square meter in certain cases. Some properties with very high price per square meter may be located in premium locations. Rooms vs. Surface:
The number of rooms is positively correlated with surface, as expected (larger properties tend to have more rooms). However, a few large properties have fewer rooms, which could indicate open-layout designs. Latitude & Longitude:
These show clusters of properties in certain areas. There might be geographic hotspots where most properties are concentrated.
- Insights from the Histograms (Diagonal)
Most variables, especially Price_COP, Price_USD, and Surface, have right-skewed distributions. This means that most properties are in the lower price range, with a few very expensive properties pushing the maximum values.
Boxplots:
Identify outliers in numerical features.
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])
# Plot vertical boxplots for the selected numerical columns
fig, axes = plt.subplots(len(numerical_columns), 1, figsize=(15, 5 * len(numerical_columns)))
for i, col in enumerate(numerical_columns):
sns.boxplot(data=df, x=col, ax=axes[i])
axes[i].set_title(f'Boxplot of {col}')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Value')
plt.tight_layout()
plt.show()
- Understanding Boxplots
Each boxplot consists of:
Box (Interquartile Range, IQR):
The bottom of the box represents the 25th percentile (Q1).
The top of the box represents the 75th percentile (Q3).
The middle line inside the box represents the median (Q2).
Whiskers:
Extend to the smallest and largest data points within 1.5 times the IQR.
Outliers (Dots):
Any values beyond 1.5 times the IQR are plotted as individual points, considered outliers.
- Interpretation of Each Boxplot
1️⃣ Boxplot of Price_COP (Colombian Pesos)
Highly Right-Skewed: Most values are concentrated near the lower end, while a few extremely high prices stretch the scale.
Many Outliers: Expensive properties significantly increase the price range.
Insight: The rental market has a few luxury properties that are much more expensive than the majority.
2️⃣ Boxplot of Price_USD
Similar to Price_COP, but converted to USD.
Right-Skewed Distribution: Most properties are on the lower end, with a few ultra-expensive listings.
Outliers: Some properties have prices exceeding 80,000 USD, which might indicate luxury homes.
3️⃣ Boxplot of Price_m2_USD (Price per Square Meter in USD)
Right-Skewed, but more concentrated: Most properties have a Price_m2_USD below 15 USD/m².
Outliers: Properties with very high price per square meter (above 50-60 USD/m²) could be premium locations.
Insight: While most rentals follow a reasonable pricing trend, some luxury or prime-location properties are extremely expensive per square meter.
4️⃣ Boxplot of Rooms
Majority have 1-3 rooms.
Outliers: Some properties have more than 10 rooms, which are likely luxury homes, hotels, or large estates.
Insight: The dataset mainly consists of small to mid-sized rentals, with a few large properties standing out.
5️⃣ Boxplot of Surface (Property Size in m²)
Extreme Right-Skew: Most properties are small (under 500m²), but a few exceed 5000m².
Outliers: Large estates (5000m² - 8000m²) appear as outliers.
Insight: The dataset includes some very large properties, but most properties have a small-to-medium size.
- Key Takeaways
All boxplots are highly right-skewed:
Most values are clustered at the lower end.
A few expensive or large properties significantly raise the max values.
Many extreme outliers:
Especially in Price_COP, Price_USD, and Surface, showing that a few properties are much larger or more expensive than the majority.
Market Characteristics:
Most properties are affordable, small-to-medium-sized rentals.
A small percentage of luxury properties significantly increases the price range.
This code will create vertical boxplots for all numerical columns in the DataFrame, excluding Latitude and Longitude, without displaying the outliers. Each boxplot will be displayed in a separate subplot.
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])
# Plot horizontal boxplots for the selected numerical columns
fig, axes = plt.subplots(len(numerical_columns), 1, figsize=(15, 5 * len(numerical_columns)))
for i, col in enumerate(numerical_columns):
sns.boxplot(data=df, x=col, ax=axes[i], showfliers=False)
axes[i].set_title(f'Boxplot of {col} (without outliers)')
axes[i].set_xlabel(col)
axes[i].set_ylabel('Value')
plt.tight_layout()
plt.show()
- What Changed Compared to the Previous Boxplots?
No Outliers: The extreme values that were previously shown as individual dots are now removed.
Better Representation of Typical Data: The box and whiskers now focus on the majority of properties, making it easier to interpret average values and spread.
More Balanced Scale: Without outliers, the distributions appear less skewed, and differences between variables become more noticeable.
- Interpretation of Each Boxplot
1️⃣ Boxplot of Price_COP (Colombian Pesos)
The whiskers now extend to reasonable price limits.
The median price is around the middle of the box, indicating a more symmetric distribution without extreme values.
Insight: After removing outliers, the data suggests that most properties fall within a reasonable price range, likely making analysis more meaningful.
2️⃣ Boxplot of Price_USD
Follows a similar trend as Price_COP, but values are converted to USD.
Wider box, indicating higher price variation among typical properties.
Insight: The rental market still has some variability, but the removal of extreme prices makes the range more interpretable.
3️⃣ Boxplot of Price_m2_USD (Price per Square Meter in USD)
Much more balanced than before.
The range is now more realistic, avoiding the influence of luxury properties with extremely high prices per square meter.
Insight: Most properties are priced between $5 and $15 per m², with a median close to $10/m².
4️⃣ Boxplot of Rooms
Shows that most properties have 1 to 3 rooms, with very few extending beyond 5 rooms.
Insight: Most rentals in the dataset are likely small to mid-sized properties, such as apartments or small houses.
5️⃣ Boxplot of Surface (Property Size in m²)
Now, the range is limited to around 1,200m², compared to the previous outliers reaching 8,000m².
The median surface area is below 500m², indicating that most properties are relatively small.
Insight: Without the influence of large estates, the dataset suggests a more typical property size range for rentals.
- Key Takeaways
Removing outliers improves clarity: Now, the dataset represents a more accurate and interpretable market view.
Most properties are within reasonable limits:
Price_m2_USD mostly between 5-15 USD/m².
Rooms mostly 1-3 rooms.
Surface mostly under 1,200m².
Skewness is reduced, making comparisons more meaningful.