Exploring the Titanic Dataset: A Data Analysis Project Using SQL and Power BI
As a data analyst with a keen interest in historical datasets and their modern implications, I recently undertook a project analyzing the Titanic passenger dataset. This project involved database creation, SQL querying, and data visualization in Power BI to uncover patterns in survival and socio-economic factors aboard the ill-fated ship. The goal was to not only practice technical skills but also derive meaningful insights that could inform contemporary safety practices in the maritime industry.
I'll walk you through the step-by-step process, highlight the importance of each stage, share key objectives, insights, and provide forward-looking recommendations for how companies could enhance operations over the next 2-10 years.
Project Overview
The Titanic disaster of 1912 remains one of history's most poignant tragedies, with over 1,500 lives lost when the "unsinkable" ship struck an iceberg. The dataset used in this project is a cleaned version of the Titanic passenger manifest (train_clean.csv), containing 891 records with variables such as Age, Sex, Pclass (passenger class), Fare, SibSp (siblings/spouses aboard), Parch (parents/children aboard), Survived (0 or 1), Embarked (port of embarkation), Title (extracted from names, e.g., Mr., Mrs.), and Family_Size (SibSp + Parch, excluding self).
The project followed a structured approach: creating a SQL database, importing the data, writing queries to answer specific questions, and visualizing results in Power BI. This mirrors real-world data analysis workflows, where raw data is transformed into insights for decision-making.
Objectives
The primary objectives were:
These objectives ensured the project was not just an exercise but a bridge between past events and future improvements.
Step-by-Step Process
Step 1: Data Preparation and Database Setup
I began by creating a SQL Server database named "Titanic." This step is crucial as it establishes a structured environment for querying, ensuring data integrity and efficient retrieval. Importance: Without a proper database, queries could lead to errors or inefficiencies, especially with larger datasets. Next, I imported the train_clean.csv file into a table named [dbo].[train_clean] using SQL Server Management Studio's import wizard. This process involved mapping columns to appropriate data types (e.g., Age as float, Survived as int) and handling any null values (e.g., in Age or Cabin).
Importance: Data import ensures cleanliness and accessibility. In this case, the dataset was pre-cleaned (e.g., Titles extracted from Names, Family_Size calculated), but verifying for duplicates or inconsistencies is key to reliable analysis. This step took about 15 minutes and set the foundation for accurate querying.
Step 2: SQL Query Development
With the data loaded, I wrote SQL queries to address 10 specific tasks. Each query was designed to extract targeted insights, using techniques like aggregation (AVG, COUNT), conditional logic (CASE), grouping (GROUP BY), and filtering (WHERE). Below, I detail each query, its execution, and why it's important.
Calculate the Average Age of Passengers Who Survived and Those Who Did Not
Survivors averaged ~28.34 years; non-survivors ~30.63 years. This reveals age-related vulnerabilities, highlighting how younger passengers (potentially more agile) had a slight edge in survival. It's essential for understanding demographic risks in emergencies.
Find the Top 5 Most Common Titles Among Passengers
Mr. (517), Miss (182), Mrs. (125), Master (40), Dr. (7). Titles reflect social structure and gender distribution, aiding in segmentation analysis. This is vital for targeted insights, like how titles correlate with class or survival.
Calculate the Average Fare Paid by Passengers in Each Pclass
1st Class (~84.15), 2nd (~20.66), 3rd (~13.68). Illustrates economic disparities, showing how fare (proxy for wealth) ties to amenities and survival odds. Critical for pricing strategy analysis in business contexts.
Retrieve Passengers with the Highest and Lowest Fare
Highest (~512.33, e.g., Cardeza family); Lowest (0, e.g., crew or complimentary tickets). Identifies outliers, revealing luxury vs. economy dynamics. Useful for anomaly detection in financial data.
Find the Survival Rate for Male and Female Passengers
Females (74.20% survival, 233/314); Males (18.89%, 109/577). Exposes gender biases in evacuation ("women and children first"), underscoring equity in crisis protocols.
Calculate the Average Family Size for Survivors and Non-Survivors
Survivors (~1.97); Non-survivors (~1.88). Shows family size's subtle impact on mobility during disasters, informing group-based safety planning.
Retrieve Names of Passengers with Missing Age
177 passengers (e.g., Moran, Mr. James). Handles missing data, essential for imputation strategies and ensuring complete analysis.
Find the Most Common Port of Embarkation and Passenger Count
Southampton (644). Reveals geographic trends, useful for logistics and market analysis.
Calculate Average Age by Pclass and Gender
1st Class Females (~34.61); 3rd Class Males (~26.51). Highlights intersections of class, gender, and age, key for nuanced segmentation.
Retrieve Passengers Traveling Alone Who Did Not Survive
~163 such passengers (adjusted for Family_Size=0 meaning alone). Identifies isolated individuals' higher risks, emphasizing support systems.
Each query was tested for accuracy, with execution times under 1 second due to the small dataset.
Step 3: Data Visualization in Power BI
I connected the SQL database to Power BI and created dashboards for each query's results. Images like (survival by gender) and (family size comparison) were exported.
Importance: Visualizations make complex data accessible, enabling storytelling. Tableau's interactivity allowed for filters, enhancing exploratory analysis.
Key Insights
These insights humanize the data, turning numbers into narratives about inequality.
Recommendations and Future Insights
Drawing from this analysis, modern cruise companies (e.g., a hypothetical "OceanSafe Cruises") can apply lessons to improve safety:
These steps could reduce risks by 50-70%, based on historical parallels, turning data into lifesaving strategies.
Conclusion
This Titanic project honed my skills in SQL, Power BI, and data storytelling while revealing timeless lessons. It's a testament to how analyzing the past can shape a safer future. I'm eager to apply these techniques in professional roles and connect with me on LinkedIn to discuss collaborations!
Tools Used: SQL Server, Power BI. Dataset Source: Public Titanic repository (cleaned version).
Data Analyst | Power BI & SQL Expert | Turning Business Data into Actionable Insights
4wGood
Freelance Data Analyst | Transform Row Data Into Actionable Insights | Power BI Dashboards | SQL | Python | Data Visualization & Reporting | Excel Dashboard | Data Cleaning
4wWell done! A neat visualization that delivers strong insights.”
B.Sc. Statistics, First Class Honours || Data Scientist || Machine Learning || Deep Learning || Computer Vision || Artificial Intelligence || Researcher
4wYou're doing well Godwill Okwuchukwu
Area Manager at Samsung Electronics India Pvt Ltd
4wWell done
Junior Computer Science Student | Aspiring Cybersecurity Specialist | Data Analysis Enthusiast
1moGodwill Okwuchukwu, your analysis on the Titanic dataset is amazing! The key insights and recommendations you've shared are incredibly inspiring.