Posted on Sep 17, 2024 • Edited on Nov 7, 2024

Mastering Pandas in Python: A Beginner's Guide to Data Analysis

#pandas #datascience #machinelearning #beginners

In today’s data-driven world, the ability to efficiently clean and analyze large datasets is a key skill. This is where Pandas, one of Python’s most powerful libraries, comes into play. Whether you're handling time series data, numerical data, or categorical data, Pandas provides you with tools that make data manipulation easy and intuitive. Let's jump into Pandas and see how it can transform your approach to data analysis.

Installing pandas

To start using Pandas, you’ll need to install it. Like any other Python library, Pandas can be installed via pip by running the following command:

pip install pandas

Pandas Data Structures

Pandas have series and dataframe for data structure. They provide a solid foundation for a wide variety of data tasks.

1. Series

From Panda's documentation, a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

import pandas as pd # Creating a Series s = pd.Series(data, index=index) # Creating a Series from a list data = pd.Series([10, 20, 30, 40]) # Creating a Series from a dictionary data_dict = pd.Series({'a': 10, 'b': 20, 'c': 30})

2. DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different value types (numeric, string, Boolean, etc.). You can think of it like a spreadsheet SQL table or a dict of Series objects

import pandas as pd data = { 'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'], 'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'], 'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'], 'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'], 'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'], 'OWL Scores': [7, 11, 7, 8, 9] } df = pd.DataFrame(data) print(df)

Data Manipulation with Pandas

Once you have your data in a DataFrame, Pandas provides powerful methods to explore, clean, and transform it. Let’s start with some of the most commonly used methods for exploring data.

1. Exploring Data

head()

The head() method returns the headers and a specified number of rows, starting from the top. The default number of elements to display is five, but you may pass a custom number.

>>> df.head(3) Name House Patronus Favorite Subject Quidditch Position OWL Scores 0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7 1 Hermione Granger Gryffindor Otter Arithmancy None 11 2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7

tail()

The tail() method returns the headers and a specified number of rows, starting from the bottom.

>>> df.tail(2) Name House Patronus Favorite Subject Quidditch Position OWL Scores 3 Draco Malfoy Slytherin None Potions None 8 4 Luna Lovegood Ravenclaw Hare Charms None 9

info()

The DataFrames object has a method called info(), that gives you more information about the data set.

>>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 5 non-null object 1 House 5 non-null object 2 Patronus 5 non-null object 3 Favorite Subject 5 non-null object 4 Quidditch Position 5 non-null object 5 OWL Scores 5 non-null int64 dtypes: int64(1), object(5) memory usage: 368.0 bytes

describe()

The describe() methods give us the overall statistics of the dataset. It gives us values of min, max, mean, and standard deviation.

>>> df.describe() OWL Scores count 5.000000 mean 8.400000 std 1.673320 min 7.000000 25% 7.000000 50% 8.000000 75% 9.000000 max 11.000000

2.Filtering

In data analysis, filtering helps you narrow down the data you're interested in. Pandas have several ways to filter data. The most simple and straightforward is direct Boolean indexing, especially filtering based on specific conditions (e.g., filtering based on column values). Let’s look at a few examples. In the first example, we’re selecting rows where the house value is Gryffindor:

import pandas as pd data = { 'Name': ['Harry Potter', 'Hermione Granger', 'Ron Weasley', 'Draco Malfoy', 'Luna Lovegood'], 'House': ['Gryffindor', 'Gryffindor', 'Gryffindor', 'Slytherin', 'Ravenclaw'], 'Patronus': ['Stag', 'Otter', 'Jack Russell Terrier', 'None', 'Hare'], 'Favorite Subject': ['Defense Against the Dark Arts', 'Arithmancy', 'Divination', 'Potions', 'Charms'], 'Quidditch Position': ['Seeker', 'None', 'Keeper', 'None', 'None'], 'OWL Scores': [7, 11, 7, 8, 9] } df = pd.DataFrame(data)

# Filter rows where the House is Gryffindor gryffindor_students = df[df['House'] == 'Gryffindor'] print(gryffindor_students)

output

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7.00 1 Hermione Granger Gryffindor Otter Arithmancy None 11.00 2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7.00

In the second example, we’re filtering data where the OWL score (think of it as a magical equivalent to the SAT in the Harry Potter world) is greater than 8:

# Filter students with OWL Scores greater than 8 high_scorers = df[df['OWL Scores'] > 8] print(high_scorers)

output

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 1 Hermione Granger Gryffindor Otter Arithmancy None 11.00 4 Luna Lovegood Ravenclaw Hare Charms None 8.25

Another way to filter data is by using the .loc method. This method allows you to filter using conditions and labels for both rows and columns. If the specified labels don’t exist, it will raise a KeyError:

# Use .loc[] to filter students who scored more than 8 OWLs high_owl_scores_loc = df.loc[df['OWL Scores'] > 8] print(high_owl_scores_loc)

output

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 1 Hermione Granger Gryffindor Otter Arithmancy None 11 4 Luna Lovegood Ravenclaw Hare Charms None 9

At first glance, this may look like direct Boolean indexing. Still, there’s a key difference: .loc provides finer control, letting you select both rows and columns simultaneously, while Boolean indexing primarily filters rows:

# Use .loc[] to filter and select specific columns gryffindor_students = df.loc[df['House'] == 'Gryffindor', ['Name', 'OWL Scores']] print(gryffindor_students)

output

 Name OWL Scores 0 Harry Potter 7 1 Hermione Granger 11 2 Ron Weasley 7

Finally, we have the .iloc method. This is used for position-based filtering, meaning you select rows and columns by their index positions rather than their labels:

third_character = df.iloc[2] print(third_character)

output

Name Ron Weasley House Gryffindor Patronus Jack Russell Terrier Favorite Subject Divination Quidditch Position Keeper OWL Scores 7 Name: 2, dtype: object

Select the 1st and last rows (indexes 0 and 4) for columns "House" and "OWL Scores"

first_last_info = df.iloc[[0, 4], [1, 5]] print(first_last_info)

output

 House OWL Scores 0 Gryffindor 7 4 Ravenclaw 9

3. Sorting

Sorting data with pandas is straightforward and can be done using the sort_values() method. For example, you can sort a list of students by their OWL scores in ascending order:

# Sort by 'OWL Scores' in ascending order (default) sorted_by_owl = df.sort_values(by='OWL Scores') print(sorted_by_owl)

output:

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7 2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7 3 Draco Malfoy Slytherin None Potions None 8 4 Luna Lovegood Ravenclaw Hare Charms None 9 1 Hermione Granger Gryffindor Otter Arithmancy None 11

To sort in descending order, set the ascending parameter to False:

# Sort by 'OWL Scores' in descending order sorted_by_owl_desc = df.sort_values(by='OWL Scores', ascending=False) print(sorted_by_owl_desc)

output:

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 1 Hermione Granger Gryffindor Otter Arithmancy None 11 4 Luna Lovegood Ravenclaw Hare Charms None 9 3 Draco Malfoy Slytherin None Potions None 8 0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7 2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7

One of the powerful features of sort_values() is that it allows you to sort by multiple columns. In the example below, students are sorted first by their OWL scores and then by their house:

# Sort by 'OWL Scores' first in descending order, then by 'House' in ascending order sorted_by_owl_first = df.sort_values(by=['OWL Scores', 'House'], ascending=[False, True]) print(sorted_by_owl_first)

output:

 Name House Patronus Favorite Subject Quidditch Position OWL Scores 1 Hermione Granger Gryffindor Otter Arithmancy None 11 4 Luna Lovegood Ravenclaw Hare Charms None 9 3 Draco Malfoy Slytherin None Potions None 8 0 Harry Potter Gryffindor Stag Defense Against the Dark Arts Seeker 7 2 Ron Weasley Gryffindor Jack Russell Terrier Divination Keeper 7

In this case, the OWL score is the primary criterion for sorting, meaning pandas will prioritize it. If two students have the same OWL score, the house value is used as the secondary criterion for sorting

Exploring, filtering, and sorting data is an essential first step before jumping into tasks like data cleaning or wrangling in the data analysis process. Pandas offers a range of built-in methods that help organize and accelerate these operations. Additionally, Pandas integrates seamlessly with other libraries, such as NumPy or SciPy for numerical computations, Matplotlib for data visualization, and analytical tools like Statsmodels and Scikit-learn. By learning Pandas, you can significantly boost your efficiency in handling and analyzing data, making it a valuable skill for any data professional. Happy coding!