Data Documentation

Data Source

The listings data are downloaded from inside airbnb. We choose to analyze data from the New York City scrapted at March 4th 2018, which contains 48852 listings across NYC. For computational efficiency only top 3000 listings are chosen for visualization in the host and history page.

Apart from the listings data, we also scraped data about things-to-do in NYC, including the top 30 places for food , sightseeing, nightlife , nature and park. The data are obtaind from Airbnb's recommendation here.

Correctness

There are no invalid values in the listings data and things-to-do data. The numbers fields are in resonable range. For the review_score_rating field, which is what we are trying to predict, all the values are in the range of 0 to 100.

Coherent

The distribution of data is pretty reasonable. For example, most of the listings have verly few number of reviews per month, where as only very rare listings have extremely many number of review per month. The distribution is very skewed which is explanable as the data are genreated from a platform serving users with various behavior.

Moreover, the value across different columns for one listing is coherent. For example, all the listings with 0 number_of_reviews has first_review time as NaN.

Completeness

For things-to-do data, as some of the places did not come with address in the orignal website, we need to manually add several addresses. And we also need to append "NYC" to scrapted address to correctly convert them to latitute/longitude using python geopy library.

When generating the visualization in history page, the listings with "first_review" as NaN are deleted. As we need the first_review time to reflect the history of the listing.

In the given airbnb data, there are three columns related to the price of listings : 'price' indicating the daily price, as well as 'weekly_price' 'monthly_price'. For some listings, only part of the three columns contain valid values. So the actual price displayed in the map on home page is calculated first by daily price. If the daily price it is not avaialbe, then weekly_price/7 or monthly_price/30 will be used as estimated prices.

Accountability

The chosen visualization are able to explained some of the underlying factors that influcence popularity.