Boston
AirBnB Analysis
First deliverable for the Udacity Data Scientist Nanodegree Have a look at the code on my Github: Github
Libraries: numpy, pandas, matplotlib, searborn, plotly, sklearn
Brief Overview:
Goal
- Analyze a dataset and write a blogpost about it
- Ask Questions and answer them through visualizations
Tools
- NumPy and Pandas to analyze data
- Matplotlib, Seaborn and Plotly to create visualizations
- Scikit-learn for a linear Regression (see the effect of each feature on the price)
Results
- Blog post on Medium
- Best travel dates: between January and April on a Monday, Tuesday or Wednesday
- Wort travel day: date of the Boston Marathon
- Best location: Roxbury is affordable and close to Downtown
Challenges
- Creating an interactiv map with every AirBnb ploted to it with meta information
What did I learn?
- Plotting geographic data (longitude and latitude) with Plotly for interactivity
- Did a lot of plotting and arranging data to get the right format
Introduction
The goal of this project was to write a blog post about a data analysis. This involved coming up with questions about a dataset and then editing the dataset to answer the questions. The final step was then to write the answers in a readable report for a non-technical audience. The article can be accessed here.
I decided to analyze the AirBnB dataset of Boston from the perspective of a traveler who wants to stay in Boston as cheaply as possible. Link to dataset.
What did I do?
First, I looked at the data set. From the three datasets I used two:
- listings.csv: full description of the listing
- calendar.csv: shows the availability of each listing for each day and a corresponding price, when the listing is available
After assessing the data I transformed columns that weren’t in the right data format removed outliers in the dataset, based on the respected histograms. After that the data was ready to be used for the questions. I will shortly describe what I did in a more technical way, if you want to read about more of the interpretations please read the medium article.
Question 1: What is the availability over the year and how does the price develop? For this question, I visualized the sum of available AirBnBs versus the average price per day. I also noticed a certain pattern (seasonality) in the average price.
Question 2: Where are the BnBs in Boston and how expensive is each of them? Interactive visualization with plotly. Polted every listing on AirBnB over a map and color-coded the price.
Question 3: Which neighborhoods are the most expensive? Concise visualization to showcase the difference in price per neighborhood (distance to the average price).
Question 4: What features influence the price of a BnB? To answer this question I needed to clean the data more so I could use it in a multiple regression to get the weights of each feature of a AirBnB (based on the listings) and order them by importance.
- Dealt with missing values
- Checked correlations
- Looked at the relationship between categorical features and price
- Feature engineering
- Created dummy variables
Question 5: How does amenities impact the price? For this question I looked at one specific features: the amenities of an AirBnB. To answer the question I used the calculated weights in question 4.