Using Listings' Metadata to Predict Ratings.

Data Preprocessing

Dataset Overview

The dataset we used in this analysis is from the site Inside Airbnb. The listing meta data consists of 36938 listings with 96 features, including neighborhood information like neighborhood overview, availability information like available days in 30, 60 and 365, host information like host response rate, the total listing count of the host, host identify verfication and review information like review score rating etc.

Data Cleansing

In order to feed the data into WEKA, I did the following data cleansing step:

1. Transforming norminal features into multi-dimensional binary features.

2. Interpolationg missing data with the mean of the corresponding features.

3. Removing duplicate listing instances with the same listing id.

4. Normalizing all the numerical features by unity based normalization.

The following are the 12 selected feature distribution after normalization and interpolation.

Feature Selection

Before performing feature selection, we first investigate the dependency of the 96 features in WEKA by visualizing the pairwise correlation of the features. A snapshot of the correlation graph is shown below. From the graph we can see that there is no strong correlation between any of the two features, indicating that the features are mostly independent.

Based on the independence of the features, we can calculate the information gain (also called entropy) for each attribute for the output variable with InfoGainAttributeEval Attribute Evaluator in WEKA. Entry values vary from 0 (no information) to 1 (maximum information). Those attributes that contribute more information will have a higher information gain value and can be selected, whereas those that do not add much information will have a lower score and can be removed.With this, we reduced the number of features from 96 to 56 with the most information gain.

Correlation

Training and Testing

Spliting the dataset

The dataset is split into three parts: 20% development set for data exploration and feature engineering, 70% cross validation data for choosing the right algorithm and parameter tuning, 10% final testing set for evaluating the performance of the final model.

Choosing the right algorithm

We have tried three classifiers in WEKA: logistic regression(1), k-Nearest Neighbors(2. IBk) and decision tree(3. REPTree), where logistic regression is the baseline. The following two graphs are the correlation and relative absolute error performance for the three algorithms over the five folds of stratified cross validation data and the cross validation data as a whole. From the graphs, we can see that the decision tree(3) significantly outperformed the other two algorithms in terms of both correlation and relative absolute error.

Tuning Parameters

We then tuned the two parameters of minNum of instances in a split and the depth of the tree in order to find out a good parameter setting for decision tree. The settings we have tried in order are (minNum, tree_depth) = {(2, -1), (2,17), (2,54), (140, 1), (140, 27), (140, 54)}. From the following two graphs, for both correlation and relative absolute error, none of the parameter settings beat the baseline setting (WEKA will automatically calculate if the other parameter settings beat the baseline setting and mark them as v). Therefore, we will choose the decision tree model with default setting, i.e. (minNum, tree_depth) = (2, -1) (-1 means there is no limitation on the depth of the tree).