Boston
AirBnB Analysis

First deliverable for the Udacity Data Scientist Nanodegree
Have a look at the code on my Github: Github

Programming language: Python
Libraries: numpy, pandas, matplotlib, searborn, plotly, sklearn

Brief Overview:

Goal

  • Analyze a dataset and write a blogpost about it
  • Ask Questions and answer them through visualizations

Tools

  • NumPy and Pandas to analyze data
  • Matplotlib, Seaborn and Plotly to create visualizations
  • Scikit-learn for a linear Regression (see the effect of each feature on the price)

Results

  • Blog post on Medium
  • Best travel dates: between January and April on a Monday, Tuesday or Wednesday
  • Wort travel day: date of the Boston Marathon
  • Best location: Roxbury is affordable and close to Downtown

Challenges

  • Creating an interactiv map with every AirBnb ploted to it with meta information

What did I learn?

  • Plotting geographic data (longitude and latitude) with Plotly for interactivity
  • Did a lot of plotting and arranging data to get the right format

Introduction

The goal of this project was to write a blog post about a data analysis. This involved coming up with questions about a dataset and then editing the dataset to answer the questions. The final step was then to write the answers in a readable report for a non-technical audience. The article can be accessed here.

I decided to analyze the AirBnB dataset of Boston from the perspective of a traveler who wants to stay in Boston as cheaply as possible. Link to dataset.

What did I do?

First, I looked at the data set. From the three datasets I used two:

  • listings.csv: full description of the listing
  • calendar.csv: shows the availability of each listing for each day and a corresponding price, when the listing is available

After assessing the data I transformed columns that weren’t in the right data format removed outliers in the dataset, based on the respected histograms. After that the data was ready to be used for the questions. I will shortly describe what I did in a more technical way, if you want to read about more of the interpretations please read the medium article.

Question 1: What is the availability over the year and how does the price develop?
For this question, I visualized the sum of available AirBnBs versus the average price per day. I also noticed a certain pattern (seasonality) in the average price.

Question 2: Where are the BnBs in Boston and how expensive is each of them?
Interactive visualization with plotly. Polted every listing on AirBnB over a map and color-coded the price.

Question 3: Which neighborhoods are the most expensive?
Concise visualization to showcase the difference in price per neighborhood (distance to the average price).

Question 4: What features influence the price of a BnB?
To answer this question I needed to clean the data more so I could use it in a multiple regression to get the weights of each feature of a AirBnB (based on the listings) and order them by importance.

  • Dealt with missing values
  • Checked correlations
  • Looked at the relationship between categorical features and price
  • Feature engineering
  • Created dummy variables
With the features and a simple multiple Regression I was able to predict the price within a RSME of 60, meaning that on average the model is $60 of the real price, I could also achieve a practical R-squerd of 0.64, meaning that the model can explain 64% of the variance in price. But that wasn’t really important I just need the weights so I could rank the features based on their contribution to the prediction.

Question 5: How does amenities impact the price?
For this question I looked at one specific features: the amenities of an AirBnB. To answer the question I used the calculated weights in question 4.