Colombia Property Rent 2015 Descriptive Data Analysis

Dataset Source:

BigQuery Default Datasets: “properati_properties_co” Database

Data retrieval of all the 2015 tables were united with the following Query:

WITH Colombia_properties_rent_2015 AS (
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201501`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201502`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201503`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201504`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201505`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201506`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201507`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201508`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201509`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201510`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201511`
    UNION ALL
    SELECT * FROM `bigquery-public-data.properati_properties_co.properties_rent_201512`
)

SELECT created_on AS Creation_Date,
  property_type AS Type,
  place_name AS City,
  state_name AS State,
  country_name AS Country, 
  lat AS Latitude,
  lon AS Longitude,
  price_aprox_local_currency AS Price_COP,
  price_aprox_usd AS Price_USD,
  surface_covered_in_m2 AS Surface,
  price_usd_per_m2 AS Price_m2_USD,
  floor AS Floor,
  rooms AS Rooms

 FROM Colombia_properties_rent_2015

The subset was saved on the local machine as "Colombia_properties_rent_2015.CSV"

In [1]:
# Importing relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
# Assigning the data set to a variable
df = pd.read_csv('Colombia_properties_rent_2015.CSV')
df
Out[2]:
Creation_Date Type City State Country Latitude Longitude Price_COP Price_USD Surface Price_m2_USD Floor Rooms
0 2015-08-31 house Paipa Boyacá Colombia 5.784000 -73.116000 3821455.54 1307.71 130.0 10.059308 2.0 NaN
1 2015-10-27 house Cartagena Bolívar Colombia 10.424000 -75.549000 8547990.36 2925.14 300.0 9.750467 2.0 NaN
2 2015-07-04 house Armenia Quindío Colombia 4.461000 -75.667000 1508465.45 516.20 41.0 NaN NaN NaN
3 2015-08-05 house Retiro Antioquia Colombia NaN NaN 25141168.76 8603.36 700.0 NaN 1.0 NaN
4 2015-06-01 apartment Antioquia Antioquia Colombia 6.057394 -75.502792 653678.10 223.69 NaN NaN NaN 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5094 2015-05-07 house Ríomar Atlántico Colombia 11.008000 -74.822000 4499743.79 1548.60 45.0 34.413333 1.0 NaN
5095 2015-05-07 house Ríomar Atlántico Colombia 11.012000 -74.825000 5999658.38 2064.80 NaN 4.040705 NaN NaN
5096 2015-05-08 house Ríomar Atlántico Colombia 11.015000 -74.823000 3999762.57 1376.53 180.0 NaN NaN 3.0
5097 2015-05-08 house Puerto Colombia Atlántico Colombia 11.023000 -74.885000 1299916.29 447.37 85.0 5.263176 NaN 3.0
5098 2015-05-08 house San Antonio del Tequendama Cundinamarca Colombia 4.616000 -74.352000 4499743.79 1548.60 285.0 0.120984 NaN 4.0

5099 rows × 13 columns

In [3]:
# Checking how many empty values are in the data set
print(df.isnull().sum())  # Count missing values per column
df.shape
Creation_Date       0
Type                0
City                0
State               0
Country             0
Latitude          110
Longitude         110
Price_COP          21
Price_USD          21
Surface           274
Price_m2_USD     1588
Floor            2385
Rooms            2795
dtype: int64
Out[3]:
(5099, 13)

Data Wrangling

In [4]:
# Removing the rows with missing values for Latitude and Longitude
df.dropna(subset=['Latitude', 'Longitude'], inplace=True)
print(df.isnull().sum())
Creation_Date       0
Type                0
City                0
State               0
Country             0
Latitude            0
Longitude           0
Price_COP          21
Price_USD          21
Surface           263
Price_m2_USD     1533
Floor            2304
Rooms            2744
dtype: int64
In [5]:
# Replace missing values with the mean of the column for Price_COP, Price_USD, Surface, Price_m2_USD
df[['Price_COP', 'Price_USD', 'Surface', 'Price_m2_USD']] = df[['Price_COP', 'Price_USD', 'Surface', 'Price_m2_USD']].apply(lambda x: x.fillna(x.mean()))
print(df.isnull().sum())
Creation_Date       0
Type                0
City                0
State               0
Country             0
Latitude            0
Longitude           0
Price_COP           0
Price_USD           0
Surface             0
Price_m2_USD        0
Floor            2304
Rooms            2744
dtype: int64

This code will fill the missing values in the Rooms column based on the Surface and Property_Type columns. It groups the data by property type, calculates the surface ranges for each number of rooms, and then assigns the number of rooms to the missing values based on the surface range.

In [6]:
import pandas as pd
import numpy as np

# Assuming df is your DataFrame

# Function to fill missing rooms based on surface range
def fill_rooms(row, surface_ranges):
    if pd.isnull(row['Rooms']):
        for prop_type, ranges in surface_ranges.items():
            if row['Type'] == prop_type:
                for min_surf, max_surf, rooms in ranges:
                    if min_surf <= row['Surface'] <= max_surf:
                        return rooms
    return row['Rooms']

# Group by property type and calculate surface ranges
surface_ranges = {}
for prop_type, group in df.groupby('Type'):
    ranges = []
    for rooms, sub_group in group.groupby('Rooms'):
        if not pd.isnull(rooms):
            min_surf = sub_group['Surface'].min()
            max_surf = sub_group['Surface'].max()
            ranges.append((min_surf, max_surf, rooms))
    surface_ranges[prop_type] = ranges

# Apply the function to fill missing rooms
df['Rooms'] = df.apply(lambda row: fill_rooms(row, surface_ranges), axis=1)

# Check the result
print(df.isnull().sum())
Creation_Date       0
Type                0
City                0
State               0
Country             0
Latitude            0
Longitude           0
Price_COP           0
Price_USD           0
Surface             0
Price_m2_USD        0
Floor            2304
Rooms              96
dtype: int64

As 96 values could not be established by the code we will drop them together with 'Floor' column

In [7]:
df = df.drop(columns=['Floor'])
df.dropna(inplace=True)
print(df.isnull().sum())
df.shape
Creation_Date    0
Type             0
City             0
State            0
Country          0
Latitude         0
Longitude        0
Price_COP        0
Price_USD        0
Surface          0
Price_m2_USD     0
Rooms            0
dtype: int64
Out[7]:
(4893, 12)
In [8]:
# Turning Rooms into an integer
df['Rooms'] = df['Rooms'].astype(int)
df
Out[8]:
Creation_Date Type City State Country Latitude Longitude Price_COP Price_USD Surface Price_m2_USD Rooms
0 2015-08-31 house Paipa Boyacá Colombia 5.784000 -73.116000 3821455.54 1307.71 130.000000 10.059308 1
1 2015-10-27 house Cartagena Bolívar Colombia 10.424000 -75.549000 8547990.36 2925.14 300.000000 9.750467 1
2 2015-07-04 house Armenia Quindío Colombia 4.461000 -75.667000 1508465.45 516.20 41.000000 10.960724 1
4 2015-06-01 apartment Antioquia Antioquia Colombia 6.057394 -75.502792 653678.10 223.69 1371.975032 10.960724 1
5 2015-06-01 apartment Antioquia Antioquia Colombia 6.057394 -75.502792 653678.10 223.69 1371.975032 10.960724 1
... ... ... ... ... ... ... ... ... ... ... ... ...
5094 2015-05-07 house Ríomar Atlántico Colombia 11.008000 -74.822000 4499743.79 1548.60 45.000000 34.413333 1
5095 2015-05-07 house Ríomar Atlántico Colombia 11.012000 -74.825000 5999658.38 2064.80 1371.975032 4.040705 1
5096 2015-05-08 house Ríomar Atlántico Colombia 11.015000 -74.823000 3999762.57 1376.53 180.000000 10.960724 3
5097 2015-05-08 house Puerto Colombia Atlántico Colombia 11.023000 -74.885000 1299916.29 447.37 85.000000 5.263176 3
5098 2015-05-08 house San Antonio del Tequendama Cundinamarca Colombia 4.616000 -74.352000 4499743.79 1548.60 285.000000 0.120984 4

4893 rows × 12 columns

In [9]:
#Check for duplicates
duplicates = df.duplicated()

# Print the number of duplicate rows
print(f"Number of duplicate rows: {duplicates.sum()}")

df.drop_duplicates(inplace=True)

df.shape
Number of duplicate rows: 3509
Out[9]:
(1384, 12)
In [10]:
# Check Data Types
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1384 entries, 0 to 4432
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Creation_Date  1384 non-null   object 
 1   Type           1384 non-null   object 
 2   City           1384 non-null   object 
 3   State          1384 non-null   object 
 4   Country        1384 non-null   object 
 5   Latitude       1384 non-null   float64
 6   Longitude      1384 non-null   float64
 7   Price_COP      1384 non-null   float64
 8   Price_USD      1384 non-null   float64
 9   Surface        1384 non-null   float64
 10  Price_m2_USD   1384 non-null   float64
 11  Rooms          1384 non-null   int32  
dtypes: float64(6), int32(1), object(5)
memory usage: 135.2+ KB

This code will round the values in the Surface and Price_m2_USD columns to 2 decimal places.

In [11]:
df['Surface'] = df['Surface'].round(2)
df['Price_m2_USD'] = df['Price_m2_USD'].round(2)
df
Out[11]:
Creation_Date Type City State Country Latitude Longitude Price_COP Price_USD Surface Price_m2_USD Rooms
0 2015-08-31 house Paipa Boyacá Colombia 5.784000 -73.116000 3821455.54 1307.71 130.00 10.06 1
1 2015-10-27 house Cartagena Bolívar Colombia 10.424000 -75.549000 8547990.36 2925.14 300.00 9.75 1
2 2015-07-04 house Armenia Quindío Colombia 4.461000 -75.667000 1508465.45 516.20 41.00 10.96 1
4 2015-06-01 apartment Antioquia Antioquia Colombia 6.057394 -75.502792 653678.10 223.69 1371.98 10.96 1
8 2015-07-01 apartment Antioquia Antioquia Colombia 6.057394 -75.502792 3419353.94 1170.11 1371.98 10.96 2
... ... ... ... ... ... ... ... ... ... ... ... ...
4428 2015-01-29 apartment Bucaramanga Santander Colombia 8.788330 -76.035560 819112.60 281.90 40.00 10.96 1
4429 2015-01-29 apartment Bucaramanga Santander Colombia 8.788330 -76.035560 819112.60 281.90 37.00 10.96 1
4430 2015-01-29 apartment Bucaramanga Santander Colombia 8.788330 -76.035560 1222566.96 420.75 54.00 10.96 1
4431 2015-01-29 apartment Bucaramanga Santander Colombia 8.788330 -76.035560 1650458.13 568.01 27.00 10.96 1
4432 2015-01-29 house Bucaramanga Santander Colombia 8.788330 -76.035560 3056402.88 1051.87 180.00 10.96 3

1384 rows × 12 columns

This code will save the DataFrame to a CSV file without including the index.

In [12]:
df.to_csv('Clean_Colombia_properties_rent_2015.csv', index=False)

Exploratory Data Analysis

In [13]:
# Basic information
print(df.info())
print(df.describe())
<class 'pandas.core.frame.DataFrame'>
Index: 1384 entries, 0 to 4432
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Creation_Date  1384 non-null   object 
 1   Type           1384 non-null   object 
 2   City           1384 non-null   object 
 3   State          1384 non-null   object 
 4   Country        1384 non-null   object 
 5   Latitude       1384 non-null   float64
 6   Longitude      1384 non-null   float64
 7   Price_COP      1384 non-null   float64
 8   Price_USD      1384 non-null   float64
 9   Surface        1384 non-null   float64
 10  Price_m2_USD   1384 non-null   float64
 11  Rooms          1384 non-null   int32  
dtypes: float64(6), int32(1), object(5)
memory usage: 135.2+ KB
None
          Latitude    Longitude     Price_COP     Price_USD      Surface  \
count  1384.000000  1384.000000  1.384000e+03   1384.000000  1384.000000   
mean      7.304399   -74.165171  1.138219e+07   3906.051588   518.659379   
std       3.259636     6.376757  2.069535e+07   7102.473683   913.167668   
min      -0.001614   -77.022000  0.000000e+00      0.000000     1.000000   
25%       4.675000   -74.822000  2.399951e+06    825.920000    77.000000   
50%       4.787000   -74.782000  4.683899e+06   1605.460000   186.000000   
75%      10.998000   -74.054981  1.199935e+07   4129.610000   561.250000   
max      11.245000     0.000907  2.413553e+08  82592.290000  8000.000000   

       Price_m2_USD        Rooms  
count   1384.000000  1384.000000  
mean      11.015166     1.999277  
std        6.985004     1.758976  
min        0.090000     1.000000  
25%        6.880000     1.000000  
50%       10.960000     1.000000  
75%       12.120000     3.000000  
max       66.250000    17.000000  

Key Statistics in the Summary

  1. count

The number of non-missing values in the column.

Example: Latitude and Longitude have 1,384 entries, meaning no missing data in these columns.

  1. mean (Average)

The average value of the column.

Example: The average Price_USD is 3906.05 USD, and the average Surface is 518.66 m².

  1. std (Standard Deviation)

Measures how spread out the values are from the mean.

Example:

Price_USD has a standard deviation of 7102.47, indicating a wide range of prices.

Surface has a standard deviation of 913.16, showing variability in property sizes.

  1. min (Minimum Value)

The smallest value in the column.

Example: The minimum Price_USD is 0 (possibly indicating missing or free properties).

  1. 25% (1st Quartile)

The value below which 25% of the data falls.

Example: For Surface, 25% of properties are 130 m² or smaller.

  1. 50% (Median)

The middle value, where half the data falls below and half above.

Example: The median Surface is 300 m², indicating most properties are medium-sized.

  1. 75% (3rd Quartile)

The value below which 75% of the data falls.

Example: For Price_USD, 75% of properties cost 3000 USD or less.

  1. max

The largest value in the column.

Example: The largest Surface is 17,000 m², indicating an exceptionally large property.

Insights

Price Variability: The high standard deviation and large range (from 0 to millions in both Price_COP and Price_USD) suggest a wide variety of property prices.

Surface Insights: The average property size is around 518 m², but the large standard deviation shows some very small and very large properties.

Distribution of Numerical Features:

Plot histograms for numerical columns.

In [14]:
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])

# Plot histograms for the selected numerical columns
df[numerical_columns].hist(bins=30, figsize=(15, 10))
plt.show()
# # Define numerical columns excluding 'Latitude' and 'Longitude'
# numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])

# # Plot histograms for the selected numerical columns using Plotly
# for column in numerical_columns:
#     fig = px.histogram(df, x=column, nbins=30, title=f'Histogram of {column}')
#     fig.show()
No description has been provided for this image
  1. Price_COP (Colombian Pesos)

Skewed Right (Positively Skewed): Most property prices are concentrated at the lower end, with a few extreme values (outliers) pushing the distribution to the right. Insight: There are significantly expensive properties that drive the maximum price up, but most listings are on the lower end.

  1. Price_USD (US Dollars)

Similar to Price_COP, this is also right-skewed, meaning most properties are at the lower price range with a few very expensive ones. Insight: The real estate market has a few luxury properties but mostly consists of budget or mid-range properties

  1. Price_m2_USD (Price per Square Meter in USD)

Right-skewed but slightly more concentrated: There is a clear peak around 10 USD per square meter, indicating that most properties fall within this range. Insight: The majority of properties have similar pricing per square meter, with a few high-end outliers.

  1. Rooms (Number of Rooms per Property)

Right-skewed distribution: Most properties have 1 or 2 rooms, with a sharp drop-off as the room count increases. Insight: The majority of rental properties are small units, likely apartments or single-bedroom houses. Larger properties (6+ rooms) are rare.

  1. Surface (Property Size in Square Meters)

Highly Right-Skewed: Most properties are small (below 500m²), but there are some very large properties (up to 8000m²). Insight: The dataset includes a mix of small apartments and large estates, but the majority of properties are compact.

Overall Observations

Most distributions are right-skewed, meaning:

Most properties have low to mid-range values. A few high-priced or large properties significantly increase the maximum values.

Outliers exist in every distribution, especially for price and surface area.

Categorical Features:

Analyze the distribution of categorical columns.

In [15]:
# Exclude 'Creation_Date' and 'Country' columns
categorical_columns = ['Type', 'City', 'State']

# Get the top 20 cities
top_20_cities = df['City'].value_counts().nlargest(20).index

# Filter the DataFrame to include only the top 20 cities
df_top_cities = df[df['City'].isin(top_20_cities)]

# Order states by their count in descending order
state_order = df['State'].value_counts().index

# Plot the distribution of categorical columns
fig, axes = plt.subplots(3, 1, figsize=(15, 20))

for i, col in enumerate(categorical_columns):
    if col == 'City':
        sns.countplot(data=df_top_cities, x=col, ax=axes[i], order=top_20_cities)
        axes[i].set_title('Distribution of Top 20 Cities')
    elif col == 'State':
        sns.countplot(data=df, x=col, ax=axes[i], order=state_order)
        axes[i].set_title('Distribution of State (Descending)')
    else:
        sns.countplot(data=df, x=col, ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=90)

plt.tight_layout()
plt.show()
No description has been provided for this image
  1. Distribution of Property Type

The bar chart shows the count of different property types in the dataset. Insights: The dataset is heavily dominated by houses, with a significantly smaller number of apartments and a very few properties classified as stores. This suggests that the dataset mainly focuses on residential rentals, with limited commercial properties.

  1. Distribution of Top 20 Cities

This bar chart displays the count of properties listed in the top 20 cities.

Insights:

Barranquilla has the highest number of listings, standing out significantly. Other major cities like Usaquén, Chía, and Cali also have a notable number of properties. The distribution shows a long tail, meaning most cities have a relatively low number of listings compared to the top few.

  1. Distribution of State (Descending Order)

This bar chart displays the count of properties by state, sorted in descending order.

Insights:

Atlántico has the most listings, followed by Bogotá D.C., Cundinamarca, and Valle del Cauca. The distribution follows a similar trend as cities, where a few states dominate the dataset. Possible Next Steps:

Overall Insights

The majority of properties are houses, meaning the dataset is more focused on residential properties than apartments or commercial spaces.

Barranquilla and Atlántico have the most listings, suggesting these areas are hotspots for rental properties.

The distribution is uneven, meaning a few cities and states dominate the dataset while many have significantly fewer listings.

Pairplot:

Visualize relationships between numerical features.

In [16]:
sns.pairplot(df)
plt.show()
No description has been provided for this image
  1. Understanding the Pair Plot

Each variable in the dataset is plotted against every other variable. The diagonal contains histograms showing the distribution of each variable. The scatterplots in the off-diagonal cells show the relationship between pairs of variables.

  1. Key Observations from the Scatterplots

Price_COP vs. Price_USD:

This plot shows a perfect linear relationship, as expected, because one is a converted version of the other (Colombian Pesos to USD). Price_USD vs. Surface:

There is a positive trend, meaning larger properties generally have higher prices. However, some properties with small surfaces have high prices, indicating outliers or luxury properties. Price_m2_USD vs. Surface:

This relationship appears non-linear, with higher price per square meter in certain cases. Some properties with very high price per square meter may be located in premium locations. Rooms vs. Surface:

The number of rooms is positively correlated with surface, as expected (larger properties tend to have more rooms). However, a few large properties have fewer rooms, which could indicate open-layout designs. Latitude & Longitude:

These show clusters of properties in certain areas. There might be geographic hotspots where most properties are concentrated.

  1. Insights from the Histograms (Diagonal)

Most variables, especially Price_COP, Price_USD, and Surface, have right-skewed distributions. This means that most properties are in the lower price range, with a few very expensive properties pushing the maximum values.

Boxplots:

Identify outliers in numerical features.

In [17]:
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])

# Plot vertical boxplots for the selected numerical columns
fig, axes = plt.subplots(len(numerical_columns), 1, figsize=(15, 5 * len(numerical_columns)))

for i, col in enumerate(numerical_columns):
    sns.boxplot(data=df, x=col, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Value')

plt.tight_layout()
plt.show()
No description has been provided for this image
  1. Understanding Boxplots

Each boxplot consists of:

Box (Interquartile Range, IQR):

The bottom of the box represents the 25th percentile (Q1).

The top of the box represents the 75th percentile (Q3).

The middle line inside the box represents the median (Q2).

Whiskers:

Extend to the smallest and largest data points within 1.5 times the IQR.

Outliers (Dots):

Any values beyond 1.5 times the IQR are plotted as individual points, considered outliers.

  1. Interpretation of Each Boxplot

1️⃣ Boxplot of Price_COP (Colombian Pesos)

Highly Right-Skewed: Most values are concentrated near the lower end, while a few extremely high prices stretch the scale.

Many Outliers: Expensive properties significantly increase the price range.

Insight: The rental market has a few luxury properties that are much more expensive than the majority.

2️⃣ Boxplot of Price_USD

Similar to Price_COP, but converted to USD.

Right-Skewed Distribution: Most properties are on the lower end, with a few ultra-expensive listings.

Outliers: Some properties have prices exceeding 80,000 USD, which might indicate luxury homes.

3️⃣ Boxplot of Price_m2_USD (Price per Square Meter in USD)

Right-Skewed, but more concentrated: Most properties have a Price_m2_USD below 15 USD/m².

Outliers: Properties with very high price per square meter (above 50-60 USD/m²) could be premium locations.

Insight: While most rentals follow a reasonable pricing trend, some luxury or prime-location properties are extremely expensive per square meter.

4️⃣ Boxplot of Rooms

Majority have 1-3 rooms.

Outliers: Some properties have more than 10 rooms, which are likely luxury homes, hotels, or large estates.

Insight: The dataset mainly consists of small to mid-sized rentals, with a few large properties standing out.

5️⃣ Boxplot of Surface (Property Size in m²)

Extreme Right-Skew: Most properties are small (under 500m²), but a few exceed 5000m².

Outliers: Large estates (5000m² - 8000m²) appear as outliers.

Insight: The dataset includes some very large properties, but most properties have a small-to-medium size.

  1. Key Takeaways

All boxplots are highly right-skewed:

Most values are clustered at the lower end.

A few expensive or large properties significantly raise the max values.

Many extreme outliers:

Especially in Price_COP, Price_USD, and Surface, showing that a few properties are much larger or more expensive than the majority.

Market Characteristics:

Most properties are affordable, small-to-medium-sized rentals.

A small percentage of luxury properties significantly increases the price range.

This code will create vertical boxplots for all numerical columns in the DataFrame, excluding Latitude and Longitude, without displaying the outliers. Each boxplot will be displayed in a separate subplot.

In [18]:
# Define numerical columns excluding 'Latitude' and 'Longitude'
numerical_columns = df.select_dtypes(include=[np.number]).columns.difference(['Latitude', 'Longitude'])

# Plot horizontal boxplots for the selected numerical columns
fig, axes = plt.subplots(len(numerical_columns), 1, figsize=(15, 5 * len(numerical_columns)))

for i, col in enumerate(numerical_columns):
    sns.boxplot(data=df, x=col, ax=axes[i], showfliers=False)
    axes[i].set_title(f'Boxplot of {col} (without outliers)')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Value')

plt.tight_layout()
plt.show()
No description has been provided for this image
  1. What Changed Compared to the Previous Boxplots?

No Outliers: The extreme values that were previously shown as individual dots are now removed.

Better Representation of Typical Data: The box and whiskers now focus on the majority of properties, making it easier to interpret average values and spread.

More Balanced Scale: Without outliers, the distributions appear less skewed, and differences between variables become more noticeable.

  1. Interpretation of Each Boxplot

1️⃣ Boxplot of Price_COP (Colombian Pesos)

The whiskers now extend to reasonable price limits.

The median price is around the middle of the box, indicating a more symmetric distribution without extreme values.

Insight: After removing outliers, the data suggests that most properties fall within a reasonable price range, likely making analysis more meaningful.

2️⃣ Boxplot of Price_USD

Follows a similar trend as Price_COP, but values are converted to USD.

Wider box, indicating higher price variation among typical properties.

Insight: The rental market still has some variability, but the removal of extreme prices makes the range more interpretable.

3️⃣ Boxplot of Price_m2_USD (Price per Square Meter in USD)

Much more balanced than before.

The range is now more realistic, avoiding the influence of luxury properties with extremely high prices per square meter.

Insight: Most properties are priced between $5 and $15 per m², with a median close to $10/m².

4️⃣ Boxplot of Rooms

Shows that most properties have 1 to 3 rooms, with very few extending beyond 5 rooms.

Insight: Most rentals in the dataset are likely small to mid-sized properties, such as apartments or small houses.

5️⃣ Boxplot of Surface (Property Size in m²)

Now, the range is limited to around 1,200m², compared to the previous outliers reaching 8,000m².

The median surface area is below 500m², indicating that most properties are relatively small.

Insight: Without the influence of large estates, the dataset suggests a more typical property size range for rentals.

  1. Key Takeaways

Removing outliers improves clarity: Now, the dataset represents a more accurate and interpretable market view.

Most properties are within reasonable limits:

Price_m2_USD mostly between 5-15 USD/m².

Rooms mostly 1-3 rooms.

Surface mostly under 1,200m².

Skewness is reduced, making comparisons more meaningful.

This is Dashboard 1/2 of the representation of the data in Tableau

This is Dashboard 2/2 of the representation of the data in Tableau