AirBnB Data Analysis & Preprocessing Tutorial

A comprehensive data science project focused on exploratory data analysis (EDA) and data preprocessing using real AirBnB NYC 2019 dataset. This project demonstrates essential data cleaning, exploration, and feature engineering techniques through practical exercises with real-world data.

Project Overview

This project analyzes 48,895 AirBnB listings from New York City (2019) and provides hands-on experience with:

Data loading and exploration
Statistical analysis and distribution analysis
Feature relationship investigation using appropriate statistical tests
Data cleaning and null value handling
Feature engineering and preprocessing
Advanced visualization techniques

Getting Started

Option 1: GitHub Codespaces (Recommended)

Fork the Repository
- Click the "Fork" button on the top right of the GitHub repository page
- Give the fork a descriptive name including your GitHub username
- Click "Create fork"
- Bookmark or save the link to your fork
Create a GitHub Codespace
- On your forked repository, click the "Code" button
- Select "Create codespace on main"
- Wait for the environment to load (dependencies are pre-installed)
Start Working
- Open notebooks/MVP.ipynb to begin the assignment
- Refer to notebooks/instructions.md for detailed requirements
- Check the full_solution/ folder for complete examples

Option 2: Local Development

Prerequisites
- Git
- Python >= 3.10

Clone the repository

git clone https://github.com/YOUR_USERNAME/gperdrizet-data-preprocessing-project-tutorial.git cd gperdrizet-data-preprocessing-project-tutorial

Set Up Environment

python -m venv venv source venv/bin/activate pip install -r requirements.txt

Launch Jupyter & start the notebook
```
jupyter notebook notebooks/MVP.ipynb
```

Project Structure

├── .devcontainer/ # Development container configuration ├── notebooks/ # Jupyter notebook directory │ ├── MVP.ipynb # Assignment notebook │ ├── MVP_solution.ipynb # Solution notebook │ ├── instructions.md # Detailed assignment instructions │ └── full_solution/ # Detailed solution notebooks │ ├── 01_distributions.ipynb │ ├── 02_correlations.ipynb │ ├── 03_data_cleaning.ipynb │ ├── 04_feature_engineering.ipynb │ └── functions.py │ ├── .gitignore # Files/directories not tracked by git ├── requirements.txt # Python dependencies └── README.md # Project documentation

Dataset

The dataset contains 48,895 AirBnB listings from New York City (2019) with the following key features:

Price: Property prices in USD
Location: Hierarchical location data (latitude, longitude, neighbourhood_group, neighbourhood)
Listing Details: room_type, minimum_nights, availability_365
Host Information: host_name, calculated_host_listings_count
Review Data: number_of_reviews, last_review, reviews_per_month
Identifiers: id, name

Note: The dataset is automatically loaded from the web in the notebooks, so no manual download is required.

Learning Objectives

By completing this tutorial, students will learn to:

Analyze Data Distributions
- Generate descriptive statistics for numerical and categorical features
- Create appropriate visualizations (histograms, bar plots, scatter plots)
- Identify data quality issues and extreme values
Investigate Feature Relationships
- Apply Chi-squared tests for categorical-categorical relationships
- Use Kruskal-Wallis H-tests for categorical-numerical relationships
- Calculate Spearman/Kendall correlations for numerical-numerical relationships
Clean and Preprocess Data
- Select relevant features for modeling
- Handle missing values using various imputation strategies
- Address extreme values and outliers appropriately
Engineer Features
- Apply one-hot encoding to categorical variables
- Transform skewed distributions using Box-Cox transformation
- Create polynomial features to capture non-linear relationships

Solution Reference

The solution to the MVP notebook can be found in MVP_solution.ipynb. The solution/ folder contains more professional-level implementations with experimentation to evaluate alternative approaches:

01_distributions.ipynb: Comprehensive EDA with detailed analysis of each feature
02_correlations.ipynb: Statistical testing and relationship analysis
03_data_cleaning.ipynb: Systematic comparison of cleaning strategies
04_feature_engineering.ipynb: Advanced preprocessing achieving 60% explained variance

Each solution notebook includes:

Clear goal statements for each section
Professional visualizations and statistical analysis
Detailed interpretation of results
Best practices for data science workflows

Key Technologies

Python 3.8+
Pandas: Data manipulation and analysis
NumPy: Numerical computing
Matplotlib & Seaborn: Data visualization
Scikit-learn: Machine learning and preprocessing
SciPy: Statistical testing
Jupyter Notebook: Interactive development environment

Contributing

This is an educational repository. Students should work on their forked copies. If you find issues or have suggestions for improvements, please open an issue or submit a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

AirBnB Data Analysis & Preprocessing Tutorial

Project Overview

Getting Started

Option 1: GitHub Codespaces (Recommended)

Option 2: Local Development

Project Structure

Dataset

Learning Objectives

Solution Reference

Key Technologies

Contributing

About

Uh oh!

Releases 1

Contributors 12

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.devcontainer		.devcontainer
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Uh oh!

gperdrizet/data-preprocessing

Folders and files

Latest commit

History

Repository files navigation

AirBnB Data Analysis & Preprocessing Tutorial

Project Overview

Getting Started

Option 1: GitHub Codespaces (Recommended)

Option 2: Local Development

Project Structure

Dataset

Learning Objectives

Solution Reference

Key Technologies

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors 12

Uh oh!

Languages