DEV Community

Cover image for The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #6)
gerry leo nugroho
gerry leo nugroho

Posted on • Edited on

The Gemika's Magical Guide to Sorting Hogwarts Students using the Decision Tree Algorithm (Part #6)

6. Visualizing Data with Charts

Visualizing Data with Charts

Our previous quest to unlock the secrets of sorting at Hogwarts is well underway! We've gathered our essential spellbooks (Python libraries) and mended the forgetful pages (filled in missing data). Now, it's time to unleash the true power of data science – the magic of data visualization! 🪄

Imagine Professor Dumbledore himself, his eyes twinkling with wisdom, holding a magical artifact – a shimmering chart. This isn't your ordinary piece of parchment, mind you! It's a canvas upon which raw data is transformed into a breathtaking spectacle, revealing hidden patterns and trends just like a Marauder's Map unveils secret passages. ️

6.1 Distribution of Students Across Houses

Now that we've filled those forgetful pages in our book, it's time to delve deeper into the fascinating world of Hogwarts houses! Remember how Harry, Ron, and Hermione were sorted into their houses based on their unique talents and personalities? Well, we're about to embark on a similar quest, using a magical tool called Matplotlib to create a visual map of how the Hogwarts students are distributed across their houses. ✨

With a wave of our metaphorical wand (or a line of Python code!), Matplotlib will conjure a magnificent bar chart. Think of it like a giant sorting hat, but instead of a tear on its brim, this hat boasts colorful bars that reach for the ceiling. Each bar represents a Hogwarts house – Gryffindor, Ravenclaw, Hufflepuff, and Slytherin. 🪄

# Importing visualization libraries import matplotlib.pyplot as plt import seaborn as sns # Setting the aesthetic style for our plots sns.set(style="whitegrid") # Visualizing the distribution of students across houses plt.figure(figsize=(15, 10)) sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False) plt.title('Distribution of Students Across Houses') plt.xlabel('House') plt.ylabel('Students') plt.show() 
Enter fullscreen mode Exit fullscreen mode

Visualizing the distribution of students across houses


6.2 Distribution of Students Across Houses (With a Twist)

But this isn't just any ordinary painting. We're going to use the magic of data to bring our picture to life. With a flick of our wand (or a click of a mouse), we'll transform cold numbers into a vibrant tapestry that tells a tale as enchanting as any fairy tale. But this time, let's add a bit of twist of spell to show the values of each X and Y axis accordingly so it'd become more informative 💫

# Importing visualization libraries import matplotlib.pyplot as plt import seaborn as sns # Setting the aesthetic style for our plots sns.set(style="whitegrid") # Visualizing the distribution of students across houses plt.figure(figsize=(15, 10)) ax = sns.countplot(x='house', data=hogwarts_df, hue='house', legend=False) # Adding numerical information on top of each bar for p in ax.patches: ax.annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom', fontsize=12, color='black', xytext=(0, 5), # Offset the text slightly above the bar  textcoords='offset points') plt.title('Distribution of Students Across Houses') plt.xlabel('House') plt.ylabel('Students') plt.show() 
Enter fullscreen mode Exit fullscreen mode

Distribution of Students Across Houses (With a bit of twist)


6.3 Visualizing Age Distribution

But what if we want to see how the ages of boys and girls differ? Fear not, for we have another spell, the Bar Chart. This spell creates side-by-side towers, comparing the number of boys and girls at each age. It's like two rival houses, Gryffindor and Slytherin, competing for the tallest tower. ⚔️

# Visualizing the age distribution plt.figure(figsize=(10, 6)) sns.histplot(hogwarts_df['age'], kde=True, color='blue') plt.title('Age Distribution of Hogwarts Students') plt.xlabel('Age') plt.ylabel('Frequency') plt.show() 
Enter fullscreen mode Exit fullscreen mode

Visualizing the age distribution


6.4 Visualizing Relationships Features

Next, we weave a more intricate spell, exploring the relationships between different features in our dataset. For instance, does a student’s heritage influence their choice of pet, or is there a connection between a student’s age and the type of wand they use? This step is like exploring the Forbidden Forest, uncovering the connections and mysteries that lie within.

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Path to your dataset dataset_path = 'data/hogwarts-students-02.csv' # Reading the dataset hogwarts_df = pd.read_csv(dataset_path) # Plotting the distribution of Hogwarts Houses with student counts plt.figure(figsize=(10, 5)) sns.countplot(x='house', hue='pet', data=hogwarts_df, palette='viridis') # Add data labels (student counts) on top of each bar for container in plt.gca().containers: plt.bar_label(container) plt.title('Relationship between "House" and "Choice of Pet"') plt.xlabel('House') plt.ylabel('Number of Students') plt.legend(title='Pet Type') plt.show() 
Enter fullscreen mode Exit fullscreen mode

Visualizing Relationships Features

Through this visualization, we might discover that Muggle-born students have a penchant for owls, while Pure-bloods prefer cats. These insights are akin to understanding the habits of magical creatures, revealing the subtle nuances that define the Hogwarts community.


6.5 Summarizing the Data

Summarizing the Data

This summary provides key statistics such as the mean, median, and standard deviation of numerical columns, and unique counts and modes for categorical columns. For instance, we might find that the most common house is Gryffindor, or that the average age of students is 14 years.

summary = hogwarts_df.describe(include='all') print(summary) 
Enter fullscreen mode Exit fullscreen mode
Unnamed: 0 name gender age origin specialty \ count 52.000000 52 52 52.000000 52 52 unique NaN 52 2 NaN 9 24 top NaN Harry Potter Male NaN England Charms freq NaN 1 27 NaN 35 7 mean 25.500000 NaN NaN 14.942308 NaN NaN std 15.154757 NaN NaN 2.492447 NaN NaN min 0.000000 NaN NaN 11.000000 NaN NaN 25% 12.750000 NaN NaN 13.250000 NaN NaN 50% 25.500000 NaN NaN 16.000000 NaN NaN 75% 38.250000 NaN NaN 17.000000 NaN NaN max 51.000000 NaN NaN 18.000000 NaN NaN house blood_status pet wand_type patronus \ count 52 52 52 52 52 unique 6 4 9 28 15 top Gryffindor Half-blood Owl Ash Non-corporeal freq 18 25 36 4 36 mean NaN NaN NaN NaN NaN std NaN NaN NaN NaN NaN min NaN NaN NaN NaN NaN 25% NaN NaN NaN NaN NaN 50% NaN NaN NaN NaN NaN 75% NaN NaN NaN NaN NaN max NaN NaN NaN NaN NaN quidditch_position boggart favorite_class house_points count 52 52 52 52.000000 unique 5 11 21 NaN top Seeker Failure Charms NaN freq 47 40 9 NaN mean NaN NaN NaN 119.200000 std NaN NaN NaN 53.057128 min NaN NaN NaN 10.000000 25% NaN NaN NaN 77.500000 50% NaN NaN NaN 119.600000 75% NaN NaN NaN 160.000000 max NaN NaN NaN 200.000000 
Enter fullscreen mode Exit fullscreen mode

6.5.1 Summary of the Results

gemika haziq nugroho insight

  1. Count: The number of non-null values in each column.
  2. Unique: The number of unique values in each column.
  3. Top: The most frequent value in each column.
  4. Freq: The number of times the most frequent value appears.
  5. Mean: The arithmetic mean of the values in each column.
  6. Std: The standard deviation of the values in each column.
  7. Min: The minimum value in each column.
  8. 25%: The 25th percentile (lower quartile) of the values in each column.
  9. 50%: The 50th percentile (median) of the values in each column.
  10. 75%: The 75th percentile (upper quartile) of the values in each column.
  11. Max: The maximum value in each column.

6.5.2 Key Observations

gemika haziq nugroho insight

  1. Age: The mean age is 14.942308, with a standard deviation of 2.492447. The age range is from 11 to 18.
  2. Gender: There are only two unique values: Male and Female.
  3. Origin: There are nine unique values, with England being the most frequent.
  4. Specialty: There are 24 unique values, with Charms being the most frequent.
  5. House: There are six unique values, with Gryffindor being the most frequent.
  6. Blood Status: There are four unique values, with Half-blood being the most frequent.
  7. Pet: There are nine unique values, with Owl being the most frequent.
  8. Wand Type: There are 28 unique values, with Ash being the most frequent.
  9. Patronus: There are 15 unique values, with Non-corporeal being the most frequent.
  10. Quidditch Position: There are five unique values, with Seeker being the most frequent.
  11. Boggart: There are 11 unique values, with Failure being the most frequent.
  12. Favorite Class: There are 21 unique values, with Charms being the most frequent.
  13. House Points: The mean is 119.200000, with a standard deviation of 53.057128. The range is from 10 to 200.

6.5.3 Insights

gemika haziq nugroho insight

  • Age Distribution: The age distribution is relatively narrow, with most students being between 13 and 17 years old.
  • Gender: The dataset is skewed towards males.
  • Specialty and House: The most frequent values in these columns suggest that students tend to specialize in Charms and are part of Gryffindor house.
  • Blood Status: The most frequent value suggests that most students are Half-blood.
  • Pet and Wand Type: The most frequent values in these columns suggest that students often have pets like Owls and use wands made of Ash.
  • Patronus: The most frequent value suggests that many students have Non-corporeal patronuses.
  • Quidditch Position: The most frequent value suggests that many students play the role of Seeker in Quidditch.
  • Boggart and Favorite Class: The most frequent values in these columns suggest that students often fear Failure and enjoy studying Charms.
  • House Points: The mean and range of house points suggest that students in this dataset have varying levels of achievement and participation.

These insights can help you better understand the characteristics of the students in the Hogwarts dataset.


6.6 Correlation Matrix

Correlation Matrix

Finally, we perform statistical analysis to quantify relationships and trends within our data. This step is akin to Snape carefully measuring potion ingredients to ensure the perfect brew. The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.

The correlation matrix and its visualization show us how different features relate to each other. For example, we might find a strong correlation between age and year at Hogwarts, as expected. Understanding these relationships helps us build more accurate models and make informed predictions.

# Importing necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Loading the dataset dataset_path = 'data/hogwarts-students-02.csv' # Path to our dataset hogwarts_df = pd.read_csv(dataset_path) # Displaying the first few rows to understand the structure of the dataset print(hogwarts_df.head()) # Checking the data types of each column to identify numerical and categorical data print(hogwarts_df.dtypes) # Selecting only numerical columns for correlation matrix numerical_df = hogwarts_df.select_dtypes(include=[np.number]) # Calculating the correlation matrix using only numerical data correlation_matrix = numerical_df.corr() print(correlation_matrix) # Visualizing the correlation matrix plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5) plt.title('Correlation Matrix of Hogwarts Student Features') plt.show() 
Enter fullscreen mode Exit fullscreen mode
name gender age origin specialty \ 0 Harry Potter Male 11 England Defense Against the Dark Arts 1 Hermione Granger Female 11 England Transfiguration 2 Ron Weasley Male 11 England Chess 3 Draco Malfoy Male 11 England Potions 4 Luna Lovegood Female 11 Ireland Creatures house blood_status pet wand_type patronus \ 0 Gryffindor Half-blood Owl Holly Stag 1 Gryffindor Muggle-born Cat Vine Otter 2 Gryffindor Pure-blood Rat Ash Jack Russell Terrier 3 Slytherin Pure-blood Owl Hawthorn Non-corporeal 4 Ravenclaw Half-blood Owl Fir Hare quidditch_position boggart favorite_class \ 0 Seeker Dementor Defense Against the Dark Arts 1 Seeker Failure Arithmancy 2 Keeper Spider Charms 3 Seeker Lord Voldemort Potions 4 Seeker Her mother Creatures house_points 0 150.0 1 200.0 2 50.0 3 100.0 4 120.0 name object gender object age int64 origin object specialty object house object blood_status object pet object wand_type object patronus object quidditch_position object boggart object favorite_class object house_points float64 dtype: object age house_points age 1.000000 0.315227 house_points 0.315227 1.000000 
Enter fullscreen mode Exit fullscreen mode

correlation-matrix-age-house-points.png

The correlation analysis results provided show the correlation coefficients between the age and house_points columns in the dataset. Here’s a breakdown of what can be implied from these results, as the following.

6.6.1 Correlation Coefficients Interpretation

A correlation coefficient is like a magical measuring tape, helping us understand how closely two things are linked. It's a number between -1 and 1, and the closer it is to either end, the stronger the connection. Think of it as a magical spell that reveals hidden relationships!

 age house_points age 1.000000 0.315227 house_points 0.315227 1.000000 
Enter fullscreen mode Exit fullscreen mode

A positive correlation is like a friendship charm; as one thing increases, so does the other. For instance, if height and weight have a strong positive correlation, taller students tend to weigh more. On the other hand, a negative correlation is like a mischievous Pixies' prank; as one thing increases, the other decreases. If hours of sleep and tiredness have a strong negative correlation, those who sleep more tend to be less tired.

6.6.2 Correlation Value Analysis:

The correlation coefficient between age and house_points is 0.315227. This value indicates a positive correlation between the two variables. In general, correlation coefficients range from -1 to 1:

  • 1 indicates a perfect positive correlation.
  • 0 indicates no correlation.
  • -1 indicates a perfect negative correlation.

6.6.3 Strength of the Correlation

A correlation of 0.315 suggests a weak to moderate positive correlation. This means that as the age of the students increases, their house points tend to increase as well, but the relationship is not very strong.

6.6.4 Implications:

  • Age and Performance: The positive correlation may imply that older students tend to accumulate more house points. This could be due to increased experience, maturity, or participation in activities that earn house points.
  • Further Investigation Needed: While there is a correlation, it does not imply causation. Other factors could be influencing both age and house points, such as the year of study, involvement in extracurricular activities, or differences in house dynamics.
  • Potential Analysis: Further analysis could involve looking at other variables (like specialty or house) to see if they mediate or moderate the relationship between age and house points.

6.6.4 Correlation Coefficients Summary

In summary, the correlation analysis indicates a weak to moderate positive relationship between age and house points among Hogwarts students. While older students may tend to earn more points, further analysis is necessary to understand the underlying factors contributing to this correlation.

But beware, young wizard! Correlation doesn't always equal causation. Just because two things are linked doesn't mean one causes the other. It's like finding a lost sock and a lucky penny on the same day; they might be connected, but it doesn't mean one caused the other. 🪄✨


6.7 Gemika's Pop-Up Quiz: Visualizing Data with Charts 🧙‍♂️🪄

Gemika's Pop-Up Quiz: Visualizing Data with Charts

And now, dear reader, my son Gemika Haziq Nugroho appears with a twinkle in his eye and a quiz in hand. Are you ready to test your knowledge and prove your mastery of data exploration?

  1. What magical python libraries used to perform visualization?
  2. What metric do you use to identify the number of times the most frequent value appears?
  3. What can be implied from "Blood Status" insight?

Answer these questions with confidence, and you will demonstrate your prowess in the art of data exploration. With our dataset now fully explored and understood, we are ready to embark on the next phase of our magical journey. Onward, and continue to our next deeper discoveries and greater insights! 🌟✨🧙‍♂️


Top comments (0)