Airbnb Price Prediction Based on Tree Model and Analysis of Influencing Factors, Evidence from Chicago

Introduction¶

The idea of the sharing economy has become increasingly popular and companies such as Uber and WeWork have been successful under this concept. As the leading symbol, Airbnb's success is remarkable. When Airbnb went public it became the largest IPO on the Nasdaq in 2020. Airbnb allows hosts to rent out spare properties or rooms to people who need them for periods of time. Property owners can earn extra, and travelers can save stay spending, which Airbnb profit from the commission. Therefore, a reasonable price that reflects demand and supply is particularly important in this business model.

In our study, we use machine learning methods to build tree models to analyze Airbnb data for Chicago. We use tree models with Classification And Regression Trees, Random Forests, and XGBoosts methods to forecast Airbnb prices and to compare and analyze model performance, as well as to investigate the factors that have important impacts on Airbnb prices.

Literature Review¶

Due to the accuracy and efficiency of machine learning methods in quantitative studies, many researchers using tree models for price prediction have informed this study. Xu used the XGBoost algorithm in his study to build housing price prediction models and tested the performance of using RMSE (Xu, 2023). Trkin used Random Forest, and XGBoost methods to predict property prices in Istanbul and found that these two methods performed better (Tekin and Sari, 2022). Perez-Sanchez analyzed Airbnb data from four Spanish Mediterranean cities and build the OLS model to analyze the factors that impact prices, extra bathrooms and the number of accommodated guests both had positive impacts on prices, while the distance to the coast has a negative impact (Perez-Sanchez et al., 2018). Hill used machine learning methods to build the Airbnb price model considering over a thousand influences and found that properties with more reviews present will generate more premiums (Hill, 2015). We will inspect these findings in our study as well.

Research Question¶

We will focus on two questions, firstly is whether it is possible to build tree models to predict Airbnb prices using machine learning methods? What are the differences in performance between different algorithms? Secondly, which factors have decisive impacts on Airbnb prices? We will present our conclusions based on the quantitative results.

Presentation of Data¶

In this study, we use the Airbnb data of Chicago provided by Inside Airbnb, an organization that open-sources the Airbnb public website data using advanced web crawlers. The data contains information on 7,747 Airbnb properties in Chicago. We used R to pre-process the raw data, including creating a series of dummy variables based on the contents of the amenities long-string to record whether the house offered certain amenities such as the first aid kit or oven and filtering out 5,714 valid Airbnb data, effectively reducing the data file size to less than 2MB. The corresponding R code can be found in relevant Github Repositories.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for plot
import plotly.express as px # for plot

import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, validation_curve
from sklearn.metrics import mean_squared_error

import rfpimp # for feature importance
from sklearn.tree import DecisionTreeRegressor # CART
from sklearn.ensemble import RandomForestRegressor # Random Forest

import xgboost
from xgboost import XGBRegressor # XGBoost

# set seed to make results reproducible
random_state = 101
In [2]:
# import Chicago Airbnb data
df = pd.read_csv('https://raw.githubusercontent.com/DanteChen0825/ChicagoAirbnb/main/data/airbnb.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5714 entries, 0 to 5713
Data columns (total 56 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   host_since                   5714 non-null   int64  
 1   host_is_superhost            5714 non-null   bool   
 2   host_listings_count          5714 non-null   int64  
 3   host_identity_verified       5714 non-null   bool   
 4   neighbourhood_cleansed       5714 non-null   object 
 5   latitude                     5714 non-null   float64
 6   longitude                    5714 non-null   float64
 7   property_type                5714 non-null   object 
 8   room_type                    5714 non-null   object 
 9   accommodates                 5714 non-null   int64  
 10  bathrooms_text               5714 non-null   float64
 11  bedrooms                     5714 non-null   int64  
 12  beds                         5714 non-null   int64  
 13  price                        5714 non-null   int64  
 14  minimum_nights               5714 non-null   int64  
 15  maximum_nights               5714 non-null   int64  
 16  availability_30              5714 non-null   int64  
 17  availability_60              5714 non-null   int64  
 18  availability_90              5714 non-null   int64  
 19  availability_365             5714 non-null   int64  
 20  number_of_reviews            5714 non-null   int64  
 21  number_of_reviews_ltm        5714 non-null   int64  
 22  number_of_reviews_l30d       5714 non-null   int64  
 23  review_scores_rating         5714 non-null   float64
 24  review_scores_accuracy       5714 non-null   float64
 25  review_scores_cleanliness    5714 non-null   float64
 26  review_scores_checkin        5714 non-null   float64
 27  review_scores_communication  5714 non-null   float64
 28  review_scores_location       5714 non-null   float64
 29  review_scores_value          5714 non-null   float64
 30  instant_bookable             5714 non-null   bool   
 31  reviews_per_month            5714 non-null   float64
 32  self_check_in                5714 non-null   bool   
 33  dishwasher                   5714 non-null   bool   
 34  bathtub                      5714 non-null   bool   
 35  heating                      5714 non-null   bool   
 36  wifi                         5714 non-null   bool   
 37  parking                      5714 non-null   bool   
 38  oven                         5714 non-null   bool   
 39  stove                        5714 non-null   bool   
 40  dryer                        5714 non-null   bool   
 41  air_condition                5714 non-null   bool   
 42  outdoor_furniture            5714 non-null   bool   
 43  shampoo                      5714 non-null   bool   
 44  bbq_grill                    5714 non-null   bool   
 45  hdtv                         5714 non-null   bool   
 46  netflix                      5714 non-null   bool   
 47  microwave                    5714 non-null   bool   
 48  free_parking                 5714 non-null   bool   
 49  refrigerator                 5714 non-null   bool   
 50  alarm                        5714 non-null   bool   
 51  coffee                       5714 non-null   bool   
 52  monoxide                     5714 non-null   bool   
 53  kitchen                      5714 non-null   bool   
 54  silverware                   5714 non-null   bool   
 55  first_aid_kit                5714 non-null   bool   
dtypes: bool(27), float64(11), int64(15), object(3)
memory usage: 1.4+ MB

The data contains information about the host, such as when the host joined, the location of Airbnb, including the neighborhood, and detailed latitude and longitude, as well as information about the property, such as the number of bedrooms and bathrooms, the number of days available in the next 30 and 60 days, reviews scores, etc. It also includes 23 dummy variables recording whether the property contains the stated amenities, such as the fridge or free parking, and the price per night of each Airbnb.

The interactive map in Figure 1 shows the distribution of all Airbnb in Chicago in four categories, entire houses account for 3/4 of all, followed by 22.8% of Airbnb renting private rooms, with hotel rooms and shared rooms accounting for only about 2%. These properties are scattered throughout almost the entire city of Chicago, providing additional options for travelers.

In [3]:
# Chicago Airbnb distribution map
plt.clf()
fig = px.scatter_mapbox(df,
                        lat="latitude",
                        lon="longitude",
                        hover_name="neighbourhood_cleansed", hover_data=["bedrooms", "price"],
                        color="room_type", zoom=10, height=500)
fig.update_layout(mapbox_style="carto-positron")
fig.update_traces(marker=dict(size=5,opacity = 0.6))
fig.update_layout(title = 'Figure 1: Airbnb Map of Chicago, Illinois')
fig.show()
<Figure size 640x480 with 0 Axes>

In Figure 2 we show the density of Airbnb in Chicago and we can visually see some hotspots such as the West Town with the highest number of Airbnb, Lake View with the second highest number located a bit north, and followed by the Near North Side neighborhood. We can also see that there is a cluster of Airbnb in high-density areas such as the Near South Side.

In [4]:
# Chicago Airbnb density map
plt.clf()
fig = px.density_mapbox(df
                        ,lat='latitude'
                        ,lon='longitude'
                        ,range_color = [0, 40]
                        ,zoom=10, height=500
                        ,radius=20
                        ,opacity=0.8
                        ,mapbox_style='carto-positron')                        
fig.update_layout(title = 'Figure 2: Airbnb Density Map of Chicago, Illinois ')
fig.show()
<Figure size 640x480 with 0 Axes>

Figure 3 shows the price distribution of Airbnb in Chicago, we find that the price distribution is significantly right-skewed with over three-quarters of Airbnb's under \$200 a night and the median price being \\$116 per night.

In [5]:
# plot price distribution of Airbnb price of Chicago
plt.figure(dpi=500, figsize=(10,6)) 
df['price'].plot.hist(grid=False, bins=30, rwidth=0.85,color='#2874A6')
plt.title('Figure 3: Airbnb Price Distribution of Chicago, Illinois')
plt.xlabel('Airbnb Price')
plt.ylabel('Airbnb Quantity')
plt.grid(axis='y', alpha=0.4)
plt.axvline(x=df['price'].median(), color="#F7DC6F", alpha=0.95)
plt.xticks([0,100,200,300,400,500,600,700,800,900,1000])
plt.text(140, 950, r'Median = $116')
plt.annotate('[Data Source: Inside Airbnb]', (0,0), (440,-25), fontsize=8, 
             xycoords='axes fraction', textcoords='offset points', va='top')
plt.box(False)
plt.axhline(0, color='black', linewidth=1)
Out[5]:
<matplotlib.lines.Line2D at 0x7fc24a035a30>

Table 1 shows the correlations of all factors, we can find that some factors are highly correlated. For example, the number of people accommodated, bedrooms, and beds have correlations, this linear regression or multiple linear regression aot applicable, as it violates the independent assumption. However, it is unreasonable to discard one of these variables as they are core determinants of Airbnb prices. This problem can be avoided by using the tree model, as the model will differentiate between prices by selecting more effective variables, and we will determine which factors are decisive and which variables are less important.

Table 1: Correlation matrix of Airbnb factors¶
In [6]:
# Airbnb factors Correlation matrix
df_corr = df.drop(['price'], axis=1)
corr = df_corr.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
/var/folders/x5/4d1q_0v93d74d3zzcsbhcx_m0000gn/T/ipykernel_1317/1928569508.py:4: FutureWarning:

this method is deprecated in favour of `Styler.format(precision=..)`

Out[6]:
  host_since host_is_superhost host_listings_count host_identity_verified latitude longitude accommodates bathrooms_text bedrooms beds minimum_nights maximum_nights availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value instant_bookable reviews_per_month self_check_in dishwasher bathtub heating wifi parking oven stove dryer air_condition outdoor_furniture shampoo bbq_grill hdtv netflix microwave free_parking refrigerator alarm coffee monoxide kitchen silverware first_aid_kit
host_since 1.00 -0.16 -0.03 -0.04 -0.11 0.06 0.04 0.01 0.04 0.02 -0.04 -0.12 0.12 0.14 0.14 0.08 -0.21 0.01 0.16 -0.07 -0.07 -0.04 -0.09 -0.07 -0.09 -0.05 0.13 0.10 -0.05 -0.01 0.04 -0.12 -0.01 -0.14 -0.01 0.01 -0.12 0.01 -0.11 -0.12 -0.10 0.04 0.02 -0.03 -0.01 -0.03 -0.07 -0.07 -0.23 -0.04 -0.03 0.07
host_is_superhost -0.16 1.00 -0.14 0.01 0.11 -0.09 0.09 -0.00 0.09 0.12 -0.07 -0.06 -0.03 -0.01 -0.01 0.02 0.18 0.20 0.09 0.27 0.22 0.27 0.19 0.21 0.17 0.22 -0.07 0.16 0.13 0.08 0.12 0.10 0.07 0.20 0.14 0.16 0.12 -0.04 0.19 0.11 0.08 0.22 0.14 0.18 0.03 0.14 0.06 0.20 0.16 0.01 0.21 0.09
host_listings_count -0.03 -0.14 1.00 0.06 -0.01 0.10 -0.10 -0.04 -0.08 -0.09 0.07 0.17 -0.10 -0.06 -0.03 0.09 -0.09 -0.10 -0.08 -0.13 -0.08 -0.07 -0.12 -0.14 -0.04 -0.18 -0.07 -0.14 0.09 0.13 0.04 0.02 0.02 -0.31 0.08 -0.25 0.03 0.06 -0.09 0.07 0.08 -0.11 -0.08 0.06 -0.12 0.05 0.02 0.06 0.05 0.03 -0.36 -0.19
host_identity_verified -0.04 0.01 0.06 1.00 0.01 0.02 0.00 0.04 -0.01 -0.01 -0.02 -0.04 0.02 0.03 0.03 0.04 0.00 0.02 0.05 -0.01 -0.00 -0.00 -0.01 0.00 -0.01 -0.01 0.02 0.05 0.01 0.06 0.05 -0.01 0.01 -0.00 0.04 0.01 0.03 0.02 0.03 0.05 -0.02 0.04 0.06 0.03 -0.00 0.01 0.00 0.02 -0.02 -0.00 -0.01 0.08
latitude -0.11 0.11 -0.01 0.01 1.00 -0.54 -0.01 -0.06 0.01 0.00 -0.03 -0.04 -0.07 -0.09 -0.10 -0.11 0.07 0.06 0.05 0.13 0.12 0.14 0.10 0.10 0.29 0.10 -0.06 0.05 -0.08 0.11 -0.03 0.04 0.01 -0.10 0.01 -0.02 0.08 -0.03 0.05 0.09 -0.00 -0.02 -0.01 -0.02 -0.11 0.02 -0.01 0.07 -0.12 -0.01 0.04 -0.02
longitude 0.06 -0.09 0.10 0.02 -0.54 1.00 0.03 0.09 -0.02 0.01 0.02 0.07 0.04 0.04 0.04 0.07 -0.04 0.00 0.03 -0.09 -0.09 -0.10 -0.10 -0.09 -0.01 -0.10 0.10 -0.01 0.00 0.07 0.00 0.02 -0.01 -0.08 0.01 -0.02 -0.01 0.06 -0.09 -0.03 0.01 -0.00 0.00 -0.00 -0.12 -0.04 0.00 -0.05 -0.01 -0.06 -0.09 -0.02
accommodates 0.04 0.09 -0.10 0.00 -0.01 0.03 1.00 0.50 0.82 0.84 -0.09 0.02 0.10 0.09 0.07 0.12 -0.00 0.05 0.06 0.06 0.03 0.04 0.04 0.05 0.04 0.02 -0.00 0.08 0.09 0.24 0.22 0.05 0.06 0.13 0.20 0.19 0.05 0.09 0.13 0.08 0.08 0.15 0.10 0.13 0.14 0.14 0.05 0.16 0.10 0.16 0.19 0.10
bathrooms_text 0.01 -0.00 -0.04 0.04 -0.06 0.09 0.50 1.00 0.54 0.54 -0.00 0.00 0.04 0.03 0.02 0.08 -0.06 -0.04 -0.02 0.02 -0.00 -0.02 0.01 0.00 -0.00 -0.00 0.07 -0.04 0.04 0.20 0.09 0.04 0.04 0.09 0.09 0.10 0.06 0.12 0.08 -0.03 0.12 0.06 0.04 0.07 0.11 0.07 0.05 0.05 0.08 0.11 0.07 0.06
bedrooms 0.04 0.09 -0.08 -0.01 0.01 -0.02 0.82 0.54 1.00 0.81 -0.07 0.02 0.07 0.06 0.04 0.08 -0.04 0.01 0.03 0.08 0.05 0.06 0.05 0.07 0.04 0.05 -0.03 0.02 0.05 0.25 0.23 0.06 0.05 0.11 0.21 0.19 0.06 0.09 0.13 0.06 0.09 0.14 0.09 0.13 0.17 0.16 0.07 0.17 0.08 0.18 0.19 0.10
beds 0.02 0.12 -0.09 -0.01 0.00 0.01 0.84 0.54 0.81 1.00 -0.08 0.02 0.08 0.07 0.05 0.09 0.00 0.05 0.05 0.07 0.05 0.05 0.06 0.07 0.05 0.05 -0.02 0.06 0.09 0.22 0.21 0.06 0.06 0.13 0.19 0.18 0.06 0.07 0.14 0.07 0.07 0.14 0.08 0.13 0.14 0.14 0.06 0.15 0.10 0.13 0.18 0.11
minimum_nights -0.04 -0.07 0.07 -0.02 -0.03 0.02 -0.09 -0.00 -0.07 -0.08 1.00 0.11 -0.05 -0.06 -0.06 -0.04 -0.08 -0.13 -0.13 -0.01 -0.02 -0.02 -0.00 -0.02 -0.05 -0.03 0.01 -0.16 -0.08 -0.01 -0.08 0.02 -0.00 -0.04 -0.03 -0.05 0.01 0.02 -0.05 -0.01 0.00 -0.07 -0.07 -0.05 -0.00 -0.04 -0.00 -0.06 0.03 0.04 -0.08 -0.03
maximum_nights -0.12 -0.06 0.17 -0.04 -0.04 0.07 0.02 0.00 0.02 0.02 0.11 1.00 -0.08 -0.08 -0.07 0.08 0.04 -0.02 -0.09 -0.07 -0.05 -0.06 -0.03 -0.05 -0.01 -0.07 -0.02 -0.09 0.08 0.03 -0.06 0.09 0.04 0.02 0.01 -0.05 0.07 0.07 -0.01 0.06 0.01 -0.06 -0.02 0.01 0.03 0.02 0.04 -0.00 0.10 0.05 -0.08 -0.10
availability_30 0.12 -0.03 -0.10 0.02 -0.07 0.04 0.10 0.04 0.07 0.08 -0.05 -0.08 1.00 0.93 0.87 0.49 -0.02 0.06 0.04 -0.05 -0.06 -0.04 -0.05 -0.04 -0.09 -0.09 0.08 0.04 -0.02 -0.02 0.00 -0.04 -0.04 -0.02 -0.04 -0.01 -0.03 -0.01 -0.00 -0.01 -0.02 0.05 0.05 -0.04 -0.00 -0.04 -0.02 -0.05 -0.04 -0.07 0.00 0.04
availability_60 0.14 -0.01 -0.06 0.03 -0.09 0.04 0.09 0.03 0.06 0.07 -0.06 -0.08 0.93 1.00 0.97 0.57 -0.03 0.07 0.07 -0.05 -0.06 -0.04 -0.06 -0.04 -0.11 -0.09 0.07 0.06 -0.01 -0.01 0.02 -0.02 -0.04 -0.03 -0.01 0.00 -0.02 -0.02 -0.00 -0.00 -0.01 0.06 0.06 -0.01 0.01 -0.01 0.00 -0.02 -0.04 -0.06 0.02 0.04
availability_90 0.14 -0.01 -0.03 0.03 -0.10 0.04 0.07 0.02 0.04 0.05 -0.06 -0.07 0.87 0.97 1.00 0.61 -0.04 0.07 0.07 -0.06 -0.06 -0.05 -0.07 -0.05 -0.12 -0.10 0.05 0.05 -0.02 -0.01 0.02 -0.01 -0.05 -0.04 0.00 -0.00 -0.01 -0.04 -0.02 0.00 -0.01 0.06 0.06 -0.00 0.00 0.01 0.01 -0.01 -0.04 -0.05 0.03 0.04
availability_365 0.08 0.02 0.09 0.04 -0.11 0.07 0.12 0.08 0.08 0.09 -0.04 0.08 0.49 0.57 0.61 1.00 -0.03 0.04 0.03 -0.06 -0.06 -0.05 -0.06 -0.06 -0.07 -0.11 0.09 0.02 0.04 0.04 0.03 0.00 0.02 0.02 0.05 -0.02 0.03 0.00 0.01 0.01 0.02 0.10 0.08 0.04 0.04 0.07 0.02 0.05 0.07 0.02 0.04 0.00
number_of_reviews -0.21 0.18 -0.09 0.00 0.07 -0.04 -0.00 -0.06 -0.04 0.00 -0.08 0.04 -0.02 -0.03 -0.04 -0.03 1.00 0.71 0.36 0.07 0.08 0.07 0.09 0.08 0.08 0.10 -0.03 0.68 0.08 -0.00 -0.02 0.09 0.04 0.11 0.05 0.05 0.08 -0.01 0.07 0.14 -0.03 0.00 0.04 0.10 0.02 0.05 0.05 0.10 0.08 -0.08 0.09 0.02
number_of_reviews_ltm 0.01 0.20 -0.10 0.02 0.06 0.00 0.05 -0.04 0.01 0.05 -0.13 -0.02 0.06 0.07 0.07 0.04 0.71 1.00 0.65 0.06 0.07 0.07 0.06 0.06 0.09 0.08 0.03 0.85 0.11 0.03 0.04 0.06 0.05 0.04 0.05 0.06 0.06 -0.01 0.08 0.09 -0.06 0.08 0.09 0.09 -0.01 0.04 0.02 0.09 -0.01 -0.09 0.08 0.05
number_of_reviews_l30d 0.16 0.09 -0.08 0.05 0.05 0.03 0.06 -0.02 0.03 0.05 -0.13 -0.09 0.04 0.07 0.07 0.03 0.36 0.65 1.00 0.07 0.06 0.07 0.03 0.05 0.09 0.08 0.08 0.75 0.05 0.07 0.08 0.04 0.01 -0.03 0.08 0.09 0.04 -0.02 0.02 0.08 -0.08 0.10 0.11 0.08 -0.05 0.05 0.02 0.11 -0.10 -0.06 0.09 0.10
review_scores_rating -0.07 0.27 -0.13 -0.01 0.13 -0.09 0.06 0.02 0.08 0.07 -0.01 -0.07 -0.05 -0.05 -0.06 -0.06 0.07 0.06 0.07 1.00 0.85 0.79 0.69 0.80 0.60 0.82 -0.07 0.08 0.04 0.10 0.07 0.08 0.06 0.07 0.09 0.12 0.12 0.01 0.12 0.07 0.07 0.11 0.07 0.09 0.02 0.11 0.07 0.16 0.01 0.02 0.18 0.10
review_scores_accuracy -0.07 0.22 -0.08 -0.00 0.12 -0.09 0.03 -0.00 0.05 0.05 -0.02 -0.05 -0.06 -0.06 -0.06 -0.06 0.08 0.07 0.06 0.85 1.00 0.75 0.70 0.74 0.56 0.79 -0.06 0.09 0.06 0.10 0.05 0.08 0.07 0.04 0.08 0.10 0.11 0.02 0.09 0.06 0.06 0.11 0.07 0.09 0.01 0.10 0.06 0.15 0.02 0.02 0.14 0.07
review_scores_cleanliness -0.04 0.27 -0.07 -0.00 0.14 -0.10 0.04 -0.02 0.06 0.05 -0.02 -0.06 -0.04 -0.04 -0.05 -0.05 0.07 0.07 0.07 0.79 0.75 1.00 0.59 0.63 0.47 0.68 -0.07 0.09 0.04 0.10 0.06 0.08 0.08 0.04 0.10 0.10 0.12 0.01 0.11 0.09 0.06 0.12 0.07 0.10 -0.00 0.10 0.08 0.16 -0.00 0.01 0.15 0.09
review_scores_checkin -0.09 0.19 -0.12 -0.01 0.10 -0.10 0.04 0.01 0.05 0.06 -0.00 -0.03 -0.05 -0.06 -0.07 -0.06 0.09 0.06 0.03 0.69 0.70 0.59 1.00 0.72 0.50 0.62 -0.06 0.06 0.06 0.03 0.03 0.05 0.02 0.08 0.04 0.07 0.08 0.00 0.09 0.04 0.04 0.07 0.04 0.05 0.03 0.06 0.04 0.10 0.04 0.00 0.13 0.04
review_scores_communication -0.07 0.21 -0.14 0.00 0.10 -0.09 0.05 0.00 0.07 0.07 -0.02 -0.05 -0.04 -0.04 -0.05 -0.06 0.08 0.06 0.05 0.80 0.74 0.63 0.72 1.00 0.50 0.70 -0.05 0.08 0.05 0.06 0.07 0.06 0.03 0.07 0.09 0.12 0.09 -0.01 0.09 0.07 0.02 0.09 0.07 0.09 0.04 0.10 0.06 0.14 0.01 0.03 0.18 0.10
review_scores_location -0.09 0.17 -0.04 -0.01 0.29 -0.01 0.04 -0.00 0.04 0.05 -0.05 -0.01 -0.09 -0.11 -0.12 -0.07 0.08 0.09 0.09 0.60 0.56 0.47 0.50 0.50 1.00 0.58 -0.02 0.09 -0.00 0.13 0.01 0.07 0.02 -0.03 0.05 0.04 0.10 0.03 0.09 0.06 0.06 0.07 0.05 0.04 -0.08 0.05 0.07 0.13 -0.03 -0.04 0.09 0.04
review_scores_value -0.05 0.22 -0.18 -0.01 0.10 -0.10 0.02 -0.00 0.05 0.05 -0.03 -0.07 -0.09 -0.09 -0.10 -0.11 0.10 0.08 0.08 0.82 0.79 0.68 0.62 0.70 0.58 1.00 -0.04 0.12 0.02 0.06 0.04 0.05 0.05 0.07 0.06 0.11 0.08 -0.00 0.10 0.04 0.03 0.09 0.07 0.07 0.03 0.07 0.06 0.12 0.01 0.00 0.17 0.10
instant_bookable 0.13 -0.07 -0.07 0.02 -0.06 0.10 -0.00 0.07 -0.03 -0.02 0.01 -0.02 0.08 0.07 0.05 0.09 -0.03 0.03 0.08 -0.07 -0.06 -0.07 -0.06 -0.05 -0.02 -0.04 1.00 0.08 0.07 -0.07 -0.06 -0.09 -0.01 0.02 -0.08 -0.03 -0.06 0.04 -0.06 -0.09 -0.06 -0.01 -0.00 -0.08 -0.02 -0.07 -0.05 -0.09 0.04 -0.09 -0.06 -0.00
reviews_per_month 0.10 0.16 -0.14 0.05 0.05 -0.01 0.08 -0.04 0.02 0.06 -0.16 -0.09 0.04 0.06 0.05 0.02 0.68 0.85 0.75 0.08 0.09 0.09 0.06 0.08 0.09 0.12 0.08 1.00 0.11 0.05 0.08 0.06 0.04 0.05 0.09 0.12 0.05 -0.02 0.06 0.12 -0.09 0.11 0.12 0.13 -0.02 0.08 0.03 0.13 -0.03 -0.08 0.14 0.10
self_check_in -0.05 0.13 0.09 0.01 -0.08 0.00 0.09 0.04 0.05 0.09 -0.08 0.08 -0.02 -0.01 -0.02 0.04 0.08 0.11 0.05 0.04 0.06 0.04 0.06 0.05 -0.00 0.02 0.07 0.11 1.00 0.08 0.09 0.10 0.10 0.21 0.12 0.16 0.12 0.06 0.14 0.07 0.06 0.16 0.11 0.17 0.06 0.15 0.08 0.19 0.26 0.03 0.11 0.05
dishwasher -0.01 0.08 0.13 0.06 0.11 0.07 0.24 0.20 0.25 0.22 -0.01 0.03 -0.02 -0.01 -0.01 0.04 -0.00 0.03 0.07 0.10 0.10 0.10 0.03 0.06 0.13 0.06 -0.07 0.05 0.08 1.00 0.24 0.20 0.05 0.05 0.54 0.43 0.20 0.13 0.13 0.18 0.12 0.15 0.14 0.46 -0.02 0.41 0.10 0.38 -0.01 0.24 0.31 0.11
bathtub 0.04 0.12 0.04 0.05 -0.03 0.00 0.22 0.09 0.23 0.21 -0.08 -0.06 0.00 0.02 0.02 0.03 -0.02 0.04 0.08 0.07 0.05 0.06 0.03 0.07 0.01 0.04 -0.06 0.08 0.09 0.24 1.00 0.13 0.05 0.12 0.33 0.28 0.10 -0.04 0.14 0.14 0.07 0.22 0.20 0.28 0.05 0.26 0.07 0.22 0.08 0.13 0.24 0.11
heating -0.12 0.10 0.02 -0.01 0.04 0.02 0.05 0.04 0.06 0.06 0.02 0.09 -0.04 -0.02 -0.01 0.00 0.09 0.06 0.04 0.08 0.08 0.08 0.05 0.06 0.07 0.05 -0.09 0.06 0.10 0.20 0.13 1.00 0.08 0.13 0.28 0.26 0.48 0.04 0.07 0.32 0.02 0.10 0.08 0.32 -0.00 0.34 0.20 0.32 0.10 0.05 0.31 0.05
wifi -0.01 0.07 0.02 0.01 0.01 -0.01 0.06 0.04 0.05 0.06 -0.00 0.04 -0.04 -0.04 -0.05 0.02 0.04 0.05 0.01 0.06 0.07 0.08 0.02 0.03 0.02 0.05 -0.01 0.04 0.10 0.05 0.05 0.08 1.00 0.08 0.06 0.04 0.09 0.13 0.05 0.04 0.04 0.07 0.05 0.09 0.05 0.05 0.18 0.09 0.15 0.08 0.04 0.04
parking -0.14 0.20 -0.31 -0.00 -0.10 -0.08 0.13 0.09 0.11 0.13 -0.04 0.02 -0.02 -0.03 -0.04 0.02 0.11 0.04 -0.03 0.07 0.04 0.04 0.08 0.07 -0.03 0.07 0.02 0.05 0.21 0.05 0.12 0.13 0.08 1.00 0.15 0.26 0.10 -0.00 0.18 0.03 0.10 0.19 0.14 0.23 0.35 0.19 0.11 0.18 0.34 0.05 0.32 0.13
oven -0.01 0.14 0.08 0.04 0.01 0.01 0.20 0.09 0.21 0.19 -0.03 0.01 -0.04 -0.01 0.00 0.05 0.05 0.05 0.08 0.09 0.08 0.10 0.04 0.09 0.05 0.06 -0.08 0.09 0.12 0.54 0.33 0.28 0.06 0.15 1.00 0.77 0.23 -0.02 0.13 0.24 0.13 0.19 0.15 0.67 -0.00 0.64 0.15 0.52 0.04 0.40 0.54 0.12
stove 0.01 0.16 -0.25 0.01 -0.02 -0.02 0.19 0.10 0.19 0.18 -0.05 -0.05 -0.01 0.00 -0.00 -0.02 0.05 0.06 0.09 0.12 0.10 0.10 0.07 0.12 0.04 0.11 -0.03 0.12 0.16 0.43 0.28 0.26 0.04 0.26 0.77 1.00 0.21 -0.03 0.17 0.18 0.12 0.21 0.15 0.60 0.01 0.58 0.13 0.47 0.02 0.38 0.63 0.24
dryer -0.12 0.12 0.03 0.03 0.08 -0.01 0.05 0.06 0.06 0.06 0.01 0.07 -0.03 -0.02 -0.01 0.03 0.08 0.06 0.04 0.12 0.11 0.12 0.08 0.09 0.10 0.08 -0.06 0.05 0.12 0.20 0.10 0.48 0.09 0.10 0.23 0.21 1.00 0.06 0.08 0.33 0.04 0.11 0.08 0.28 -0.01 0.25 0.20 0.29 0.09 0.06 0.23 0.10
air_condition 0.01 -0.04 0.06 0.02 -0.03 0.06 0.09 0.12 0.09 0.07 0.02 0.07 -0.01 -0.02 -0.04 0.00 -0.01 -0.01 -0.02 0.01 0.02 0.01 0.00 -0.01 0.03 -0.00 0.04 -0.02 0.06 0.13 -0.04 0.04 0.13 -0.00 -0.02 -0.03 0.06 1.00 -0.02 0.00 0.03 -0.08 -0.04 0.01 0.01 -0.02 0.06 -0.01 0.09 0.01 -0.04 -0.01
outdoor_furniture -0.11 0.19 -0.09 0.03 0.05 -0.09 0.13 0.08 0.13 0.14 -0.05 -0.01 -0.00 -0.00 -0.02 0.01 0.07 0.08 0.02 0.12 0.09 0.11 0.09 0.09 0.09 0.10 -0.06 0.06 0.14 0.13 0.14 0.07 0.05 0.18 0.13 0.17 0.08 -0.02 1.00 0.05 0.41 0.25 0.19 0.14 0.06 0.12 0.06 0.16 0.10 0.02 0.17 0.13
shampoo -0.12 0.11 0.07 0.05 0.09 -0.03 0.08 -0.03 0.06 0.07 -0.01 0.06 -0.01 -0.00 0.00 0.01 0.14 0.09 0.08 0.07 0.06 0.09 0.04 0.07 0.06 0.04 -0.09 0.12 0.07 0.18 0.14 0.32 0.04 0.03 0.24 0.18 0.33 0.00 0.05 1.00 -0.03 0.09 0.11 0.27 -0.07 0.18 0.14 0.27 0.07 0.03 0.17 0.14
bbq_grill -0.10 0.08 0.08 -0.02 -0.00 0.01 0.08 0.12 0.09 0.07 0.00 0.01 -0.02 -0.01 -0.01 0.02 -0.03 -0.06 -0.08 0.07 0.06 0.06 0.04 0.02 0.06 0.03 -0.06 -0.09 0.06 0.12 0.07 0.02 0.04 0.10 0.13 0.12 0.04 0.03 0.41 -0.03 1.00 0.10 0.06 0.12 0.04 0.14 0.05 0.14 0.08 0.09 0.11 0.05
hdtv 0.04 0.22 -0.11 0.04 -0.02 -0.00 0.15 0.06 0.14 0.14 -0.07 -0.06 0.05 0.06 0.06 0.10 0.00 0.08 0.10 0.11 0.11 0.12 0.07 0.09 0.07 0.09 -0.01 0.11 0.16 0.15 0.22 0.10 0.07 0.19 0.19 0.21 0.11 -0.08 0.25 0.09 0.10 1.00 0.62 0.20 0.04 0.22 0.07 0.23 0.12 0.08 0.26 0.16
netflix 0.02 0.14 -0.08 0.06 -0.01 0.00 0.10 0.04 0.09 0.08 -0.07 -0.02 0.05 0.06 0.06 0.08 0.04 0.09 0.11 0.07 0.07 0.07 0.04 0.07 0.05 0.07 -0.00 0.12 0.11 0.14 0.20 0.08 0.05 0.14 0.15 0.15 0.08 -0.04 0.19 0.11 0.06 0.62 1.00 0.18 0.04 0.17 0.05 0.18 0.09 0.06 0.20 0.14
microwave -0.03 0.18 0.06 0.03 -0.02 -0.00 0.13 0.07 0.13 0.13 -0.05 0.01 -0.04 -0.01 -0.00 0.04 0.10 0.09 0.08 0.09 0.09 0.10 0.05 0.09 0.04 0.07 -0.08 0.13 0.17 0.46 0.28 0.32 0.09 0.23 0.67 0.60 0.28 0.01 0.14 0.27 0.12 0.20 0.18 1.00 -0.00 0.64 0.16 0.58 0.10 0.19 0.55 0.13
free_parking -0.01 0.03 -0.12 -0.00 -0.11 -0.12 0.14 0.11 0.17 0.14 -0.00 0.03 -0.00 0.01 0.00 0.04 0.02 -0.01 -0.05 0.02 0.01 -0.00 0.03 0.04 -0.08 0.03 -0.02 -0.02 0.06 -0.02 0.05 -0.00 0.05 0.35 -0.00 0.01 -0.01 0.01 0.06 -0.07 0.04 0.04 0.04 -0.00 1.00 0.02 0.04 0.00 0.14 0.08 0.09 0.05
refrigerator -0.03 0.14 0.05 0.01 0.02 -0.04 0.14 0.07 0.16 0.14 -0.04 0.02 -0.04 -0.01 0.01 0.07 0.05 0.04 0.05 0.11 0.10 0.10 0.06 0.10 0.05 0.07 -0.07 0.08 0.15 0.41 0.26 0.34 0.05 0.19 0.64 0.58 0.25 -0.02 0.12 0.18 0.14 0.22 0.17 0.64 0.02 1.00 0.18 0.64 0.08 0.31 0.64 0.11
alarm -0.07 0.06 0.02 0.00 -0.01 0.00 0.05 0.05 0.07 0.06 -0.00 0.04 -0.02 0.00 0.01 0.02 0.05 0.02 0.02 0.07 0.06 0.08 0.04 0.06 0.07 0.06 -0.05 0.03 0.08 0.10 0.07 0.20 0.18 0.11 0.15 0.13 0.20 0.06 0.06 0.14 0.05 0.07 0.05 0.16 0.04 0.18 1.00 0.17 0.39 0.12 0.17 0.15
coffee -0.07 0.20 0.06 0.02 0.07 -0.05 0.16 0.05 0.17 0.15 -0.06 -0.00 -0.05 -0.02 -0.01 0.05 0.10 0.09 0.11 0.16 0.15 0.16 0.10 0.14 0.13 0.12 -0.09 0.13 0.19 0.38 0.22 0.32 0.09 0.18 0.52 0.47 0.29 -0.01 0.16 0.27 0.14 0.23 0.18 0.58 0.00 0.64 0.17 1.00 0.09 0.18 0.57 0.13
monoxide -0.23 0.16 0.05 -0.02 -0.12 -0.01 0.10 0.08 0.08 0.10 0.03 0.10 -0.04 -0.04 -0.04 0.07 0.08 -0.01 -0.10 0.01 0.02 -0.00 0.04 0.01 -0.03 0.01 0.04 -0.03 0.26 -0.01 0.08 0.10 0.15 0.34 0.04 0.02 0.09 0.09 0.10 0.07 0.08 0.12 0.09 0.10 0.14 0.08 0.39 0.09 1.00 0.06 0.07 0.04
kitchen -0.04 0.01 0.03 -0.00 -0.01 -0.06 0.16 0.11 0.18 0.13 0.04 0.05 -0.07 -0.06 -0.05 0.02 -0.08 -0.09 -0.06 0.02 0.02 0.01 0.00 0.03 -0.04 0.00 -0.09 -0.08 0.03 0.24 0.13 0.05 0.08 0.05 0.40 0.38 0.06 0.01 0.02 0.03 0.09 0.08 0.06 0.19 0.08 0.31 0.12 0.18 0.06 1.00 0.34 0.03
silverware -0.03 0.21 -0.36 -0.01 0.04 -0.09 0.19 0.07 0.19 0.18 -0.08 -0.08 0.00 0.02 0.03 0.04 0.09 0.08 0.09 0.18 0.14 0.15 0.13 0.18 0.09 0.17 -0.06 0.14 0.11 0.31 0.24 0.31 0.04 0.32 0.54 0.63 0.23 -0.04 0.17 0.17 0.11 0.26 0.20 0.55 0.09 0.64 0.17 0.57 0.07 0.34 1.00 0.19
first_aid_kit 0.07 0.09 -0.19 0.08 -0.02 -0.02 0.10 0.06 0.10 0.11 -0.03 -0.10 0.04 0.04 0.04 0.00 0.02 0.05 0.10 0.10 0.07 0.09 0.04 0.10 0.04 0.10 -0.00 0.10 0.05 0.11 0.11 0.05 0.04 0.13 0.12 0.24 0.10 -0.01 0.13 0.14 0.05 0.16 0.14 0.13 0.05 0.11 0.15 0.13 0.04 0.03 0.19 1.00

We created dummy variables for 3 categorical variables and split the data randomly into train and test data, with 75% of the data used to determine hyper-parameters and train the model, and the remaining 25% used as test data, the output of which will be used to determine the performance of the model.

In [7]:
# create dummy variable for the categorical value
df_1 = pd.get_dummies(df, columns=['neighbourhood_cleansed', 'property_type', 'room_type'])
df_2 = df_1.drop(["neighbourhood_cleansed_Albany Park", "property_type_Casa particular", "room_type_Entire home/apt"], axis=1)
df_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5714 entries, 0 to 5713
Columns: 170 entries, host_since to room_type_Shared room
dtypes: bool(27), float64(11), int64(15), uint8(117)
memory usage: 1.9 MB
In [8]:
# split the data into train set and test set
train_x, test_x, train_y, test_y = train_test_split(df_2.drop(['price'], axis = 1), df_2.price, random_state=random_state, test_size=0.25)

print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)
(4285, 169)
(4285,)
(1429, 169)
(1429,)

Methodology¶

Classification And Regression Tree¶

First, we build a Classification And Regression Tree. The CART and Random Forest algorithm requires two hyperparameters max_depth and min_samples_split. The max_depth determines the maximum tree height, and classification process will stop when the number of layers in the tree exceeds the maximum. The min_samples_split determines the minimum number of nodes needed for each additional layer of classification, and classification will stop when the number of nodes decrease to less than the minimum. These two hyperparameters limit the decision tree simultaneously, and classification will stop when either of these conditions is triggered. We want the max_depth to be large enough and the min_samples_split to be small enough to ensure the model is precise, but we do not want the max_depth to be too large and the min_samples_split to be too small resulting in overfitting of the model, which only fits the training data well but does not have comprehensive transferability.

We use a grid search method to determine the best hyperparameter combination. In the CART model we will try 8*10 different hyperparameter combinations. For each pair of hyperparameters, we will use 5-fold cross-validation, splitting the training data into five random parts, using 4 of them as the training set and 1 as the test set, and averaging the performance of the five models to evaluate the performance of this hyperparameter combination. The best hyperparameter is determined based on the performance of all hyperparameter combinations. In particular, both the grid search and the cross-validation process to determine the hyperparameters use only the training data, not the test data, to avoid data leakage problems.

In [9]:
# set hyperparameter max_depth and min_samples_split for grid search
hyperparameters = {'max_depth':[2,5,10,15,20,30,40,50], 'min_samples_split':[2,4,6,8,10,12,14,16,18,20]}

# grid searrfch with a 5-fold cross-validation
dt = DecisionTreeRegressor(random_state=random_state)
gs_cart = GridSearchCV(dt, hyperparameters, cv=5)
gs_cart.fit(train_x, train_y)

# print best hyperparameter value and corresponding accuracy score
print ("The best hyperparameter value is: ",gs_cart.best_params_)
print ("The best training score is: ", "%.4f" % gs_cart.best_score_)

# This chunk may take longer to run.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 5 minutes.
# The expected output is following:
# The best hyperparameter value is:  {'max_depth': 5, 'min_samples_split': 16}
# The best training score is:  0.4761
The best hyperparameter value is:  {'max_depth': 5, 'min_samples_split': 16}
The best training score is:  0.4761

Finally, we found that the CART model performed best with the max_depth of 5 and min_samples_split of 16, but the best training score was only 0.4761. We fit the CART model with the training data.

In [10]:
# use the best hyperparameter by grid search to train the CART model
dt_final = DecisionTreeRegressor(max_depth=5, min_samples_split=16, random_state=random_state)
dt_final.fit(train_x, train_y)
Out[10]:
DecisionTreeRegressor(max_depth=5, min_samples_split=16, random_state=101)

Random Forest¶

We then used the Random Forest algorithm to build the model, which is ensemble learning using the bootstrap aggregating method. A series of different CART decision trees were built by random sampling of the training data rows and random selection of variables. The majority of the output from the different trees was used as the output of the Random Forest model. We apply the same grid search and 5-fold cross-validation methods to find the best combination of hyperparameters.

In [11]:
# This chunk may take longer to run and can be skipped.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 10 minutes.

# set hyperparameter max_depth and min_samples_split for grid search
#hyperparameters = {'max_depth':[5,10,20,30,40,50], 'min_samples_split':[2,4,6,8,10,15,20]}

# To reduce the running time of the grid search and cross-validation,
# the following alternative code that will test fewer combinations of hyperparameters, and produce the same output.
# If time is available using the original code above is highly encouraged.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 5 minutes.

# set hyperparameter max_depth and min_samples_split for grid search
hyperparameters = {'max_depth':[20,30,40,50], 'min_samples_split':[2,4,6,8]}

# grid search with a 5-fold cross-validation
rf = RandomForestRegressor(random_state=random_state)
gs_rf = GridSearchCV(rf, hyperparameters, cv=5)
gs_rf.fit(train_x, train_y)

# print best hyperparameter value and corresponding accuracy score
print ("The best hyperparameter value is: ",gs_rf.best_params_)
print ("The best training score is: ", "%.4f" % gs_rf.best_score_)

# The expected output is following:
# The best hyperparameter value is:  {'max_depth': 40, 'min_samples_split': 2}
# The best training score is:  0.6433
The best hyperparameter value is:  {'max_depth': 40, 'min_samples_split': 2}
The best training score is:  0.6433

We found that the model performed best when max_depth was 40 and min_samples_split was 2, with a training score of 0.6433, then we apply the hyperparameter to the model.

In [12]:
# use the best hyperparameter by grid search to train the Random Forest model
rf_final = RandomForestRegressor(max_depth=40, min_samples_split=2, random_state=random_state)
rf_final.fit(train_x, train_y)
Out[12]:
RandomForestRegressor(max_depth=40, random_state=101)

XGBoost¶

Finally, we adopt the gradient boosting decision tree algorithm, which will consider the residual of the previous layer of CART each iteration to improve the accuracy of the model by correcting for inaccurate predictions, and we use the algorithm provided by the XGBoost package to simulate this process. By using grid search and 5-fold cross-validation we found the best combination of hyperparameters max_depth of 5, n_estimators of 200, and the best training score of 0.65.

In [13]:
# This chunk may take longer to run and can be skipped.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 10 minutes.

# set hyperparameter max_depth and n_estimators for grid search
# hyperparameters =  {'max_depth':[5,10,20,30,40,50], 'n_estimators':[50,80,100,150,200,250,300]}

# To reduce the running time of the grid search and cross-validation,
# the following alternative code that will test fewer combinations of hyperparameters, and produce the same output.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 5 minutes.
# If time is available using the original code above is highly encouraged.
# Using Jupyter with the Anaconda environment on MacBook Pro with an M2 chip can be run in under 5 minutes.

# set hyperparameter max_depth and n_estimators for grid search
hyperparameters =  {'max_depth':[5,10,20,30], 'n_estimators':[100,150,200,250]}

# grid search with a 5-fold cross-validation
xgb = XGBRegressor(random_state=random_state)
gscv_xgb = GridSearchCV(xgb, hyperparameters, cv=5)
gscv_xgb.fit(train_x, train_y)

# print best hyperparameter value and corresponding accuracy score
print ("The best parameter value is: ", gscv_xgb.best_params_)
print ("The best score is: ", "%.4f" % gscv_xgb.best_score_)

# The expected output is following:
# The best parameter value is:  {'max_depth': 5, 'n_estimators': 200}
# The best score is:  0.6493
The best parameter value is:  {'max_depth': 5, 'n_estimators': 200}
The best score is:  0.6493

We fit the XGBoost model with the training data using the best combination of hyperparameters.

In [14]:
# use the best hyperparameter by grid search to train the XGBoost model
xgb_final = XGBRegressor(max_depth=5, n_estimators=200, random_state=random_state)
xgb_final.fit(train_x, train_y)
Out[14]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=5, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=200, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=101, ...)

Feature Importance Analysis¶

For feature importance analysis on price variations, we use the importances function from the rfpimp package. This function will compare the change in the R-squared of the model before and after the random shuffle of each column in the table, and those columns with greater changes in R-squared will be considered as important variables, while those with minor changes will be considered as less important (Parr et al., 2018).

Results and Discussion¶

From Table 2, we observe significant differences in the performance of the three tree models on R-squared. For the training data, using the best combination of hyperparameters, the CART model has an R-squared of 57%, while the Random Forest and XGBoost algorithms have an R-squared of over 95%. The model's performance on the test data showed that Random Forest had 69.9% and XGBoost was able to reach 71.2%. This implies the strength of XGBoost in obtaining effective information, with its variables being able to explain 71% of Airbnb price variation.

For the difference between the training and test data, CART has the smallest difference, while both Random Forest and XGBoost have a difference of more than 0.25, which may indicate a potential overfitting problem, but they still outperformed CART.

In [15]:
# print R2 of train and test data for three model
final_output_R2 = dict()
final_output_R2['Train'] = [dt_final.score(X=train_x, y=train_y), rf_final.score(X=train_x, y=train_y), xgb_final.score(X=train_x, y=train_y)]          
final_output_R2['Test'] = [dt_final.score(X=test_x, y=test_y), rf_final.score(X=test_x, y=test_y),xgb_final.score(X=test_x, y=test_y)]
final_output_R2 = pd.DataFrame.from_dict(final_output_R2, orient='index', columns=['CART', 'RF', 'XGB'])
final_output_R2
final_output_R2.style.set_caption("Table 2: R2 of Tree Models").format("{:.2%}")
Out[15]:
Table 2: R2 of Tree Models
  CART RF XGB
Train 60.70% 95.24% 98.77%
Test 56.64% 69.89% 71.18%

We then evaluate the model performance using the root mean square error, where the RMSE measures the mean of the error between the predicted and actual values. From Table 3 we find that CART has the largest RMSE, and XGBoost training data is only 14.16. But considering that the average price per night for Airbnb in Chicago is 154.4, the XGBoost model with the best prediction performance still has an RMSE of 71.79, and the error band for the prediction is still relatively wide.

In [16]:
# print the average Airbnb price
print ("The average Airbnb price of Chicago is: ", "%.1f" % df['price'].mean())
The average Airbnb price of Chicago is:  154.4
In [17]:
# print RMSE of train and test data for three model
final_output_RMSE = dict()
final_output_RMSE['Train'] = [mean_squared_error(train_y, dt_final.predict(train_x), squared=False),
                              mean_squared_error(train_y, rf_final.predict(train_x), squared=False), 
                              mean_squared_error(train_y, xgb_final.predict(train_x), squared=False)]          
final_output_RMSE['Test'] = [mean_squared_error(test_y, dt_final.predict(test_x), squared=False),
                             mean_squared_error(test_y, rf_final.predict(test_x), squared=False),
                             mean_squared_error(test_y, xgb_final.predict(test_x), squared=False)]
final_output_RMSE = pd.DataFrame.from_dict(final_output_RMSE, orient='index', columns=['CART', 'RF', 'XGB'])
final_output_RMSE
final_output_RMSE.style.set_caption("Table 3: RMSE of Tree Models").format("{:.2f}")
Out[17]:
Table 3: RMSE of Tree Models
  CART RF XGB
Train 80.16 27.89 14.16
Test 88.06 73.37 71.79

In terms of factors that have a significant impact on Airbnb's prices, we will analyze the importance of factors for the Random Forest and XGBoost models, which predict more accurately. Figure 4 shows the most important features of the random forest model, the most important factor being the number of bathrooms, followed by the number of accommodated people, and the number of bedrooms. The next four include the three variables regarding location: longitude, latitude, and rating of the location.

Figure 4: Airbnb Feature Importance Plot of Random Forest Model¶
In [18]:
# Random Forest feature importance plot
rfpimp.plot_importances(rfpimp.importances(rf_final, test_x, test_y))
/Users/chenzhiang/opt/anaconda3/lib/python3.9/site-packages/rfpimp.py:52: UserWarning:

Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Out[18]:
2023-04-28T17:38:38.344604 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

Figure 5 illustrates the important features of the XGBoost model, where the most important variable in terms of price impact is the number of people accommodated, followed by the number of toilets, bedrooms, and latitude and longitude, while the number of reviews per month followed in importance.

Figure 5: Airbnb Feature Importance Plot of XGBoost Model¶
In [19]:
# XGBoost feature importance plot
rfpimp.plot_importances(rfpimp.importances(xgb_final, test_x, test_y))
/Users/chenzhiang/opt/anaconda3/lib/python3.9/site-packages/rfpimp.py:52: UserWarning:

Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

Out[19]:
2023-04-28T17:38:42.652809 image/svg+xml Matplotlib v3.5.2, https://matplotlib.org/

There are still limitations in our study, we only use Airbnb data for Chicago of March 2023 and are lacking data for different times of the year, while the cost of staying is highly seasonally. In future studies, we can increase the generalizability and transferability of the model by adding data from different times of the year. Furthermore, our raw data includes long textual descriptions of properties and hosts that have not been used, and in future studies introducing semantic analysis tools for textual analysis may significantly improve the accuracy of the model.

Conclusion¶

In this study, we use three different algorithms to build tree models for predicting Airbnb prices in Chicago, and we find that Random Forest and XGBoost outperform the CART model significantly, with XGBoost having the best performance. Meanwhile, our model can predict Chicago Airbnb prices relatively accurately and can be used as a tool to provide pricing references to Airbnb hosts, but there is still potential for improving the accuracy of the model in further studies. We found that the number of bathrooms and bedrooms, the number of accommodations, and the location of the property are the most important factors affecting Airbnb prices.

Reference¶

  1. Inside Airbnb (2023) Get the Data Chicago, Illinois, United States, Inside Airbnb. Available at: http://insideairbnb.com/get-the-data.

  2. Xu, C., 2023. Housing price forecast based on XGBoost algorithm house price: advanced regression techniques, in: Beligiannis, G.N. (Ed.), International Conference on Statistics, Data Science, and Computational Intelligence (CSDSCI 2022). Presented at the International Conference on Statistics, Data Science, and Computational Intelligence (CSDSCI 2022), SPIE, Qingdao, China, p. 57. https://doi.org/10.1117/12.2656904

  3. Tekin, M., Sari, I.U., 2022. Real Estate Market Price Prediction Model of Istanbul. Real Estate Management and Valuation 30, 1–16. https://doi.org/10.2478/remav-2022-0025

  4. Perez-Sanchez, V., Serrano-Estrada, L., Marti, P., Mora-Garcia, R.-T., 2018. The What, Where, and Why of Airbnb Price Determinants. Sustainability 10, 4596. https://doi.org/10.3390/su10124596

  5. Hill, D., 2015. How much is your spare room worth? IEEE Spectr. 52, 32–58. https://doi.org/10.1109/MSPEC.2015.7226609

  6. Parr, T. et al. (2018) Beware default random forest importances, explained.ai. Available at: https://explained.ai/rf-importance/index.html

The relevant data and code can be found in the Github Repositories.