0% found this document useful (0 votes)

16 views36 pages

Data Science UNIT-3

This document provides an overview of Python's significance in data science, highlighting its libraries such as NumPy, Pandas, and Matplotlib, which facilitate data manipulation, analysis, and visualization. It also discusses various Python Integrated Development Environments (IDEs) like Jupyter Notebook and PyCharm, emphasizing their roles in enhancing productivity for data science projects. Additionally, it covers the basics of NumPy, including array creation and vectorized computation, which are essential for efficient numerical operations.

Uploaded by

vasaviveturi2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views36 pages

Data Science UNIT-3

Uploaded by

vasaviveturi2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT III

Python for Data Science: Python Libraries, Python integrated Development Environments (IDE) for
Data Science.

NumPy Basics: Arrays and Vectorized Computation, The NumPy ndarray, Creating ndarrays, Data
Types for ndarrays, Arithmetic with NumPy Arrays, Basic Indexing and Slicing, Boolean Indexing,
Transposing Arrays and Swapping Axes.
Universal Functions: Fast Element, Wise Array Functions, Mathematical and Statistical Methods,
Sorting, Unique and Other Set Logic.

PYTHON FOR DATA SCIENCE

Python has become the most widely used programming language in the field of data science because
of its simplicity, flexibility, and strong ecosystem of libraries. Data science involves extracting insights,
patterns, and knowledge from structured and unstructured data, and Python provides a complete toolkit
for handling every stage of this process—from data collection and cleaning to analysis, visualization,
and machine learning.

One of the key reasons for Python’s popularity is its readability and ease of learning, which makes it
accessible not only to software developers but also to researchers, analysts, and domain experts who
may not have a strong programming background. Its vast collection of libraries like NumPy, pandas,
Matplotlib, and SciPy supports numerical computing, data manipulation, and visualization. For
advanced analytics and machine learning, libraries such as scikit-learn, TensorFlow, and PyTorch are
extensively used.

Python also integrates seamlessly with databases, cloud services, and big data platforms, making it
suitable for handling real-world data at scale. Additionally, tools like Jupyter Notebook allow data
scientists to write, test, and share code interactively along with visualizations and documentation,
which is essential for collaborative projects.

Beyond its technical strengths, Python has also gained popularity in data science due to its large and
active community. This community continuously contributes new libraries, tools, and resources that
make solving complex data problems easier and more efficient. Hence, Python is not just a
programming language but a complete ecosystem for data science, offering both beginners and experts
the support needed to innovate and apply data-driven solutions across diverse fields.

The applications of Python in data science are vast: it is used in predictive modeling, business
intelligence, natural language processing, computer vision, recommendation systems, and scientific
research. Many industries, such as healthcare, finance, e-commerce, and transportation rely on Python-
powered data science solutions for decision-making and automation.

Python Libraries:
Python is famous for its library ecosystem, which makes it the most widely used language in data
science. These libraries provide tools for data storage, manipulation, visualization, scientific
computing, and machine learning. Together, they form the backbone of modern data science.
1
The major libraries are: NumPy, Pandas, Matplotlib, Seaborn, SciPy, and Scikit-learn.

1. NumPy (Numerical Python): NumPy is the fundamental package for numerical computing in
Python. It provides a ndarray (N-dimensional array), which is much faster and more efficient
than Python lists.

Key Features:

• ndarray: Efficient storage and manipulation of multi-dimensional arrays.

• Broadcasting: Perform operations on arrays of different shapes without writing loops.
• Mathematical Functions: Includes linear algebra, Fourier transforms, and random
number generation.

Example:

import numpy as np
# Creating an ndarray
array = [Link]([1, 2, 3, 4])
print("Array:", array)
# Performing element-wise operation
squared = array ** 2
print("Squared:", squared)

Output:
Array: [1 2 3 4]
Squared: [1 4 9 16]

2. Pandas: Pandas is a powerful library for data manipulation and analysis. It introduces two
main data structures:

Series – one-dimensional labeled array

DataFrame – two-dimensional labeled data (like an Excel sheet)

Key Features:

• DataFrame: Stores tabular data with labeled rows and columns.

• Data Cleaning: Handle missing values, duplicates, and filtering.
• GroupBy: Summarize and aggregate data easily.
Example:

import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = [Link](data)
print(df)
# Filtering data
filtered_df = df[df['Age'] > 28]

2
print(filtered_df)

Output:
Name Age #df
0 Alice 25
1 Bob 30
2 Charlie 35

Name Age #filtered_df

1 Bob 30
2 Charlie 35

3. Matplotlib: Matplotlib is the most widely used library for data visualization. It allows the
creation of static, animated, and interactive plots.

Key Features:

• Supports line, bar, scatter, histogram, pie charts, etc.

• Highly customizable (titles, labels, legends, colors).
• Integrates with Pandas and NumPy.
Example:

Output:

3
4. Seaborn: Seaborn is a high-level visualization library built on top of Matplotlib. It makes plots
more attractive and easier to create, especially for statistical data.

Key Features:

• Predefined styles and color themes.

• Functions for distribution plots, regression plots, and heatmaps.
• Works directly with Pandas DataFrames.
Example:

import seaborn as sns

import pandas as pd
import [Link] as plt
# Plotting a histogram with Seaborn
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = [Link](data)
[Link](data=df, x='Age', kde=True)
[Link]("Age Distribution")
[Link]()

Output:

5. SciPy: SciPy builds on NumPy and provides advanced scientific and technical computing
tools. It is especially useful in mathematics, physics, and engineering.

Key Features:

• Optimization: Algorithms to minimize or maximize functions.

• Integration: Solving integrals and differential equations.
• Statistics: Probability distributions and hypothesis testing.

4
Example:

from scipy import stats

import numpy as np
# Performing a t-test
data1 = [Link](0, 1, 100)
data2 = [Link](0.5, 1, 100)
t_stat, p_value = stats.ttest_ind(data1, data2)
print(f"T-Statistic: {t_stat}, P-Value: {p_value}")

Output:
T-Statistic: -4.2892427387353536, P-Value: 2.800941448830977e-05

6. Scikit-learn: Scikit-learn is the most popular library for machine learning in Python. It
provides tools for both supervised and unsupervised learning.

Key Features:

• Supervised Learning: Classification and regression (e.g., Linear Regression, SVM,

Random Forest).
• Unsupervised Learning: Clustering, PCA, anomaly detection.
• Model Selection: Cross-validation, grid search, evaluation metrics.

Example:
from sklearn.linear_model import LinearRegression
import numpy as np
# Creating a linear regression model
model = LinearRegression()
X = [Link]([[1], [2], [3], [4]])
y = [Link]([10, 20, 25, 30])
[Link](X, y)
# Making predictions
predictions = [Link]([Link]([[5], [6]]))
print(predictions)

Output:
[37.5 44.0]

These libraries are often used together in real-world data science projects:

✓ NumPy + Pandas for data handling,

✓ Matplotlib + Seaborn for visualization,
✓ SciPy + Scikit-learn for advanced analysis and machine learning.

Python Integrated Development Environments (IDEs)

Definition: A Python Integrated Development Environment (IDE) is a software application that
provides an interface for writing, testing, and debugging Python code. For data science, IDEs are
5
especially useful because they integrate tools for data analysis, visualization, and machine learning
development.

Python IDEs combine several features into one environment:

• Code editor with syntax highlighting and auto-completion.

• Debugger to trace and fix errors.
• Execution console to run code interactively.
• Extensions and plugins for scientific computing and visualization.

How Python IDEs Work?

Python IDEs act as a bridge between the programmer and the computer. They provide a workspace to
write Python code. Support cell-based execution (in Jupyter) for running small portions of code
iteratively. Offer real-time feedback, where the programmer can immediately see results, adjust code,
and re-run. Integrate libraries like NumPy, Pandas, Matplotlib, and scikit-learn seamlessly for data
science workflows.

This interactivity makes IDEs indispensable for data exploration and model development.

Some of the Python IDEs:

When choosing an Integrated Development Environment (IDE) for Python in data science, the
suitability depends on the specific tasks, user experience, and project scale. Each tool has a unique set
of strengths:

1. Jupyter Notebook: Jupyter is the most popular tool for data science research and education.
It allows writing code in separate cells, running them independently, and embedding plots,
tables, and even Markdown text in the same document. This makes it highly suitable for data
exploration, visualization, and reproducible research.

6
2. PyCharm: PyCharm is a professional-grade IDE with advanced features such as intelligent
code completion, version control integration, refactoring tools, and strong debugging support.
The Professional Edition also supports Jupyter integration and provides a “Scientific Mode.”
It is highly suitable for large-scale data science projects, especially when the project
integrates with web frameworks (like Django/Flask) or involves multiple developers working
in collaboration.

3. Spyder: Spyder is designed specifically for scientific and data-driven programming. It

integrates well with libraries like NumPy, SciPy, and Matplotlib, and includes features such
as a variable explorer and interactive console. It is especially suitable for students and
researchers who come from a MATLAB background, as its interface and workflows are quite
similar. Spyder is often used in academic and research-based projects.

4. VS Code: Visual Studio Code is a lightweight but versatile editor. It is highly extensible
through plugins and supports Python very well via official extensions. With Jupyter
integration, debugging, Git integration, and remote development features, VS Code is
suitable for both beginners and professionals. It is widely used in industry because of its
cross-platform adaptability and lightweight nature.

5. Anaconda Distribution: Anaconda is not an IDE itself but a distribution that simplifies the
setup of Python environments for data science. It bundles Jupyter, Spyder, and hundreds of
libraries (NumPy, pandas, scikit-learn, TensorFlow, etc.). This makes it the most suitable tool
for setting up environments quickly, especially for beginners or those who want to avoid
installation issues. It ensures package compatibility, which is critical in machine learning
workflows.

Advantages:

• Interactive Development: Allows live execution of code (e.g., Jupyter).

• Debugging Tools: Helps identify and fix errors quickly.
• Code Completion & Syntax Highlighting: Improves coding speed and reduces mistakes.
• Version Control: Supports Git integration for team collaboration.
• Environment Management: Tools like Anaconda simplify package management.
Disadvantages:

• Resource Intensive: Full-featured IDEs may slow down low-spec machines.

• Steeper Learning Curve: Beginners may find professional IDEs (like PyCharm) overwhelming.
• Complex Setup: Configuring IDEs and environments can be time-consuming.
• Unnecessary Overhead: For simple coding, heavy IDEs may add complexity.
Applications:

Python IDEs in data science are not just for writing code, they are workflow enablers that streamline
the process of data preparation, analysis, model building, and deployment. Their applications include:

• Data Analysis & Visualization: Use Jupyter, Spyder for importing, cleaning (pandas), and
visualizing data (Matplotlib, Seaborn, Plotly) to find trends and patterns.

• Machine Learning Development: Build and test models in PyCharm, VS Code, or Spyder
using scikit-learn, TensorFlow, or PyTorch with support for feature engineering and tuning.

7
• Big Data Processing: Handle large datasets with PySpark or Dask in Python IDEs, vital for
industries like finance, healthcare, and e-commerce.

• Web Development & Deployment: Deploy ML models using Flask/Django in PyCharm or

VS Code as APIs or full applications.

• Scientific Research & Academia: Jupyter Notebooks are ideal for teaching, experiments, and
publishing reproducible research with code, LaTeX, and visuals.

• Collaboration & Version Control: IDEs integrate with Git/GitHub for team collaboration,
version tracking, and project management.

• Automation & Scripting: Automate tasks like report generation, data scraping, and scheduled
model retraining using Python scripts.

Example:

Imagine a data science project on customer retention prediction:

1. Data Import & Cleaning: Use Jupyter Notebook with Pandas to import customer transaction
data, handle missing values, and visualize patterns.
2. Feature Engineering: Derive features like transaction frequency, spending average, and
recency.
3. Model Development: Use Spyder or PyCharm to build machine learning models (e.g., logistic
regression, decision trees, random forest) with scikit-learn.
4. Model Evaluation: Evaluate models using accuracy, precision, recall, and ROC curves.
5. Deployment: Deploy the final model using Flask/Django inside PyCharm or VS Code to create
a web application for business stakeholders.

8
NUMPY BASICS
Arrays and Vectorized Computation
NumPy (Numerical Python) is a fundamental package for scientific computing in Python. It provides
support for arrays, which are grid-like data structures used to represent vectors, matrices, and higher-
dimensional datasets. Arrays are more efficient than Python lists for numerical operations, making
NumPy an essential tool for data science and machine learning.

Arrays
ndarrays: An ndarray (n-dimensional array) is a multidimensional, homogeneous array of fixed-size
items.

Creation: Arrays can be created from Python lists or tuples using [Link](), and there are functions
like [Link](), [Link](), and [Link]() for generating arrays.

Example:

import numpy as np
#Creating an array from a list
array_from_list = [Link]([1, 2, 3, 4])
print(array_from_list)
#Creating a 3x3 array of zeros
zeros_array = [Link]((3, 3))
print(zeros_array)
#Creating an array with a range of values
range_array = [Link](10)
print(range_array)

Output:
[1 2 3 4]
[[0 0 0]
[0 0 0]
[0 0 0]]
[0 1 2 3 4 5 6 7 8 9]

9
Data Types: Each ndarray has a dtype (data type) object that describes the type of elements in the
array. You can specify the dtype during array creation or convert it using the astype() method.

Example:

float_array = [Link]([1, 2, 3], dtype = np.float64)

print(float_array)
#Converting data type
int_array = float_array. astype(np.int32)
print(int_array)

Output:
[1.0 2.0 3.0]
[1 2 3]

Vectorized Computation
Element-wise Operations: NumPy allows for element-wise operations on arrays without explicit
loops, which is known as Vectorization. These operations include addition, subtraction, multiplication,
and division.

Example:

array1 = [Link]([1, 2, 3])

array2 = [Link]([4, 5, 6])
#Element-wise addition
result = array1 + array2
print(result)
#Element-wise multiplication
result = array1 * array2
print(result)

Output:
[5 7 9]
[4 10 18]

Universal Functions (ufuncs): Universal functions are functions that operate element-wise on
ndarrays. Examples include mathematical functions like [Link](), [Link](), and [Link]().

Example:

array = [Link]([1, 4, 9, 16])

#Square root of each element
sqrt_array = [Link](array)
print(sqrt_array)
#Exponential of each element
exp_array = [Link](array)
print(exp_array)

Output:
[1.0 2.0 3.0 4.0]

10
[2.71828183e+00 5.45981500e+01 8.10308393e+03 8.88611052e+06]

Broadcasting: Broadcasting allows NumPy to perform operations on arrays of different shapes.

Smaller arrays are "broadcast" across larger arrays so that they have compatible shapes.

Example:

array1 = [Link]([1, 2, 3])

array2 = [Link]([[1], [2], [3]])
#Broadcasting and element-wise addition
result = array1 + array2
print(result)

Output:
[[2 3 4]
[3 4 5]
[4 5 6]]

Reductions: Reduction operations like summing, finding the minimum, or maximum can be
performed using methods like [Link](), [Link]() and [Link](). These functions can be applied to the
entire array or along a specific axis.

Example:

Output:
21
[5 7 9]
[6 15]

Applications:

NumPy's array operations and vectorized computations are used in various applications, including:

• Data Analysis: Efficient manipulation and analysis of large datasets.

• Machine Learning: Implementation of algorithms that require fast numerical computations.
• Scientific Computing: Solving mathematical problems involving linear algebra, Fourier
transforms, and random number generation.
• Image Processing: Handling and processing image data as arrays of pixel values.

11
Example:

Let's consider an example where we perform basic data manipulation using NumPy:

import numpy as np
#Generate a random dataset of 1000 samples with 3 features
data = [Link](1000, 3)
#Normalize the data (feature scaling)
data_mean = [Link](data, axis=0)
data_std = [Link](data, axis=0)
normalized_data = (data - data_mean) / data_std
print("Original Data:\n", data[:5]) # Display first 5 samples
print("Normalized Data:\n", normalized_data[:5]) # Display first 5 normalized
samples

Output:
Original Data:
[[0.73417432 0.74037936 0.90738968]
[0.72763622 0.67006168 0.41185916]
[0.920575 0.57639958 0.82472397]
[0.05283245 0.25583882 0.85750512]
[0.98428961 0.05666713 0.8623092]]
Normalized Data:
[[ 0.80828951 0.77270154 1.40698663]
[0.78535802 0.53086531 -0.30160892]
[1.46206495 0.20874309 1.12195421]
[-1.5814258 -0.89372793 1.23498405]
[1.68553539-1.57871823 1.25154857]]

In this example, we generate a random dataset, normalize it by subtracting the mean and dividing by
the standard deviation for each feature, and then print the first five samples of the original and
normalized data. This showcases how NumPy can efficiently handle data manipulation and
preprocessing tasks crucial for data science and machine learning workflows.

The NumPy ndarray:

The NumPy ndarray (N-dimensional array) is a special data structure in the NumPy library. It is mainly
used to store and work with large sets of numbers in an efficient way. Unlike normal Python lists,
which are flexible but slow when doing many calculations, ndarrays are faster because all the elements
inside them are of the same type (for example, all integers or all floats).

An ndarray is described by two main things:

• Its shape → tells us how many rows, columns, or dimensions it has.

• Its dtype (data type) → tells us whether the numbers are integers, decimals (floats), or complex
numbers.

12
How ndarrays Work ?

Inside the computer, an ndarray stores all its data in a continuous block of memory, which makes it
much faster than normal Python lists. Since all the elements are of the same type, NumPy can use C
programming speed in the background, instead of slow Python loops. This makes ndarrays very useful
for data science, machine learning, image processing, and scientific research.

Creating ndarrays in NumPy:

An ndarray (N-dimensional array) is the core data structure of NumPy. It is similar to a Python list but
more powerful because it can store large amounts of numerical data efficiently and allows
mathematical operations to be applied directly on the data.
There are many ways to create ndarrays in NumPy, depending on the requirement. They are:

1. Using [Link](): The [Link]() function converts Python lists, tuples, or other array-like
objects into an ndarray. It takes input data (like a list) and creates an ndarray. You can also
specify the data type (dtype).

Syntax:
[Link](object, dtype=None)

Example:

import numpy as np
# 1D array from a list
array1d = [Link]([1, 2, 3, 4, 5])
print("1D Array:\n", array1d)
# 2D array from a list of lists
array2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", array2d)

Output:
1D Array:
[1 2 3 4 5]
2D Array:
[[1 2 3]
[4 5 6]]

2. Using [Link]() and [Link](): You provide the shape (rows × columns), and it fills the array
with zeros or ones.
• [Link]() creates an array filled with 0s.
• [Link]() creates an array filled with 1s.
Syntax:

[Link](shape)
[Link](shape)
Example:

# Array of zeros (3x3)

zeros_array = [Link]((3, 3))
13
print("Array of zeros:\n", zeros_array)
# Array of ones (2x4)
ones_array = [Link]((2, 4))
print("Array of ones:\n", ones_array)

Output:
Array of zeros:
[[0 0 0]
[0 0 0]
[0 0 0]]
Array of ones:
[[1 1 1 1]
[1 1 1 1]]

3. Using [Link](): [Link]() creates an array with values in a given range, similar to
Python’s range() function, but returns an ndarray. You specify the start, stop, and step size.

Syntax:

[Link](start, stop, step)

Example:

# Array with values from 0 to 10 (step of 2)

range_array = [Link](0, 10, 2)
print("Array with range values:", range_array)

Output:
Array with range values: [0 2 4 6 8]

4. Using [Link](): [Link]() creates an array with evenly spaced values between a start
and end point. You give the start, end, and number of elements required.

Syntax:
[Link](start, stop, num)

Example:

# 5 evenly spaced values between 0 and 1

linspace_array = [Link](0, 1, 5)
print("Array with evenly spaced values:", linspace_array)

Output:
Array with evenly spaced values: [0 0.25 0.5 0.75 1.0]

5. Using [Link](): [Link]() creates an identity matrix (a square matrix with 1s on the diagonal
and 0s elsewhere). It is mostly used in linear algebra and mathematical computations.

Syntax:

[Link](N)

14
Example:

# Identity matrix of size 4x4

identity_matrix = [Link](4)
print("Identity matrix:\n", identity_matrix)

Output:
Identity matrix:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]

6. Using Random Functions ([Link]): NumPy provides random number functions to create
arrays filled with random values.
• [Link]() → random values between 0 and 1.
• [Link]() → random integers in a given range.
Syntax:
[Link](shape)
[Link](low, high, shape)
Example:

# Random values between 0 and 1

random_array = [Link](3, 3)
print("Random values:\n", random_array)
# Random integers between 0 and 10
random_int_array = [Link](0, 10, (3, 3))
print("Random integers:\n", random_int_array)

Output:
Random values:
[[0.79023899 0.83268165 0.30868424]
[0.01097371 0.0861339 0.11680078]
[0.79638802 0.75836068 0.49767694]]
Random integers:
[[4 0 2]
[1 4 0]
[0 5 6]]

How to store different types of data in NumPy Array ?

NumPy ndarrays are designed to store elements of the same type. However, you can use structured
arrays to store different types of data.

Structured Arrays: Structured arrays allow storing different data types (like strings, integers, floats)
in a single array, similar to a database table. You define a custom data type with fields and then create
an array with that type.

15
Syntax:

[Link]([('field1','datatype1'), ('field2','datatype2'), ...])

dtype: The dtype (data type) tells NumPy what type of values each field will store. In structured arrays,
dtype defines a blueprint of the structure:

• Field name → acts like a column name (e.g., 'name', 'age', 'height').
• Data type → defines the storage type (e.g., string, int, float).
Example:

# Define structured data type

data_type = [Link]([('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
# Create structured array
structured_array = [Link]([('Alice', 25, 5.5), ('Bob', 30, 5.8)], dtype=data_type)
print("Structured array:")
print(structured_array)
print("Names:", structured_array['name'])
print("Ages:", structured_array['age'])
print("Heights:", structured_array['height'])

Output:
Structured array:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]
Names: ['Alice' 'Bob']
Ages: [25 30]
Heights: [5.5 5.8]

Setting default Data Type (dtype Parameter):

The dtype parameter in NumPy allows you to set the default data type of an array. If you don’t give
dtype, NumPy automatically selects it from your input values. If you give it explicitly, the array will
be converted to that type.
Syntax:

[Link](data, dtype=datatype)

Example:

import numpy as np
int_array = [Link]([1.5, 2.7, 3.8], dtype='int32')
print("Integer array:", int_array)
float_array = [Link]([1, 2, 3], dtype='float64')
print("Float array:", float_array)
complex_array = [Link]([1, 2, 3], dtype='complex')
print("Complex array:", complex_array)

Output:
Integer array: [1 2 3]
Float array: [1. 2. 3.]
Complex array: [1.+0.j 2.+0.j 3.+0.j]

16
Advantages of NumPy ndarray:

• Performance: Ndarrays store data in contiguous memory blocks and use optimized C
operations. This makes array operations much faster and more efficient than standard Python
lists.
• Ease of Use: NumPy provides ready-made functions for creating, manipulating, and operating
on arrays. Complex computations (like matrix multiplication or Fourier transforms) can be
done with simple functions.
• Broadcasting: Enables performing operations on arrays of different shapes and sizes without
writing loops. Saves coding effort and improves performance.
• Integration: Ndarrays work seamlessly with SciPy, Pandas, Matplotlib, Scikit-learn, and many
other libraries. This makes them the backbone of scientific Python programming.

Disadvantages of NumPy ndarray:

• Homogeneous Data Types: All elements in an ndarray must have the same type (all integers,
all floats, etc.). This makes it less suitable for mixed data (like names + numbers).
• Memory Usage: Handling very large arrays requires a lot of memory. High-dimensional data
can cause memory errors on low-resource systems.
• Learning Curve: Beginners coming from basic Python lists may find it difficult to understand
concepts like broadcasting, slicing, and vectorization.
Applications of NumPy ndarray:

• Scientific Computing: Used in physics, chemistry, and biology for simulations, modeling, and
data analysis.
• Data Analysis: Core tool for processing, cleaning, and analyzing large datasets.
• Machine Learning: Acts as the base for handling training and testing datasets before feeding
them into ML algorithms.
• Image Processing: Images are stored as pixel arrays; NumPy makes it easy to manipulate and
process them.
• Financial Modeling: Applied in quantitative finance for risk analysis, option pricing, and
algorithmic trading.

Data Types for NumPy ndarrays:

In NumPy, an ndarray (N-Dimensional Array) can store many different kinds of data. The type of data
stored inside an array is called its data type (dtype). Choosing the right data type is very important
because it affects memory usage, speed, and accuracy of calculations. Datatypes that are supported by
numpy ndarray are:

17
1. Numeric Data Types:

(a) Integer Types: Integers are whole numbers (positive or negative, without decimals). Based on
size, they come in different forms:

• int8 → 8-bit integer (from -128 to 127)

• int16 → 16-bit integer (from -32,768 to 32,767)
• int32 → 32-bit integer (from -2,147, 483, 648 to +2,147, 483, 647.)
• int64 → 64-bit integer (very large range, up to 9 quintillion approx.)
Example: [Link]([1, 2, 3], dtype='int16')

(b) Unsigned Integers: Unsigned means only positive values (no negatives). They can store larger
positive numbers with the same bit size:

• uint8 → 0 to 255.
• uint16 → 0 to 65,535.
• uint32 → 0 to 4,294,967,295.
• uint64 → up to 18 quintillion+
Example: [Link]([10, 20, 30], dtype='uint8')
(c) Floating-Point Types: These are numbers with decimals. Based on precision (accuracy), they
are:

• float16 → half-precision (less accurate, uses less memory)

• float32 → single-precision (commonly used)
• float64 → double-precision (default in NumPy, more accurate but uses more
memory)
Example: [Link]([1.5, 2.5], dtype='float32')

(d) Complex Numbers: Complex numbers have a real part and an imaginary part. NumPy
supports:

• complex64 → 32-bit real + 32-bit imaginary

18
• complex128 → 64-bit real + 64-bit imaginary
Example: [Link]([2+3j, 4+5j], dtype='complex128)

2. Boolean Data Type(bool): It stores only True or False.

Example: [Link]([True, False, True], dtype='bool')

3. String Data Type(str): It stores fixed-length text. For variable-length text, we can use object
dtype.

Example: [Link](["apple", "banana"], dtype='str')

4. Object Data Type(object): It allows storing Python objects of different types inside one array
(like a mix of numbers, strings, etc.).

Example: [Link]([1, "two", 3.5], dtype='object')

5. Structured Data Types: Sometimes, we want to store records (like rows in a table) where each
row has different fields (name, age, height). NumPy supports this through structured arrays
using [Link]. This works like a small table inside an array.

Example:

import numpy as np
data_type = [Link]([('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = [Link]([('Alice', 25, 5.5), ('Bob', 30, 5.8)], dtype=data_type)
print(structured_array)

Output:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]

Advantages of NumPy Data Types:

• Memory Efficiency → Uses less memory compared to normal Python lists.

• High Performance → Computations are much faster (written in C under the hood).
• Precision Control → We can choose how accurate we want numbers to be (float16, float32,
float64).
• Flexibility → Can handle structured and complex data easily.
Disadvantages of NumPy Data Types:

• Homogeneous Requirement → Normal ndarrays must store elements of the same type (except
structured arrays).
• Memory Usage → Large arrays with high precision (float64) take a lot of memory.
• Learning Curve → Beginners may find it confusing to select the right dtype.
Applications of Numpy Data Types:

• Scientific Computing – Used for simulations and complex calculations. Floating types help
control accuracy.

19
• Data Analysis – Handle large datasets efficiently; right data type saves memory and speeds up
processing.
• Machine Learning – Store and process big datasets; improves model training speed and reduces
memory use.
• Image Processing – Images stored as arrays; data type (e.g., uint8) affects memory use and
image quality.
Example:

import numpy as np
# Integer Array
int_array = [Link]([1, 2, 3, 4], dtype='int32')
print("Integer array:", int_array)
# Float Array
float_array = [Link]([1.1, 2.2, 3.3], dtype='float64')
print("Float array:", float_array)
# Complex Array
complex_array = [Link]([1+2j, 3+4j], dtype='complex128')
print("Complex array:", complex_array)
# Boolean Array
bool_array = [Link]([True, False, True], dtype='bool')
print("Boolean array:", bool_array)
# Structured Array
data_type = [Link]([('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = [Link]([('Alice', 25, 5.5), ('Bob', 30, 5.8)], dtype=data_type)
print("Structured array:\n", structured_array)

Output:
Integer array: [1 2 3 4]
Float array: [1.1 2.2 3.3]
Complex array: [1.+2.j 3.+4.j]
Boolean array: [True False True]
Structured array:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]

Arithmetic with NumPy Arrays:

NumPy is a Python library mainly used for numerical and scientific computing. One of its most
powerful features is the ability to perform arithmetic operations on arrays very efficiently. Normally
in Python, if we want to add or multiply elements in a list, we need to use loops, which can be slow
for large data. With NumPy arrays, we can do the same operations faster and with simpler code,
because NumPy is built using optimized C code behind the scenes.

1. Basic Arithmetic Operations

a) Element-wise Operations: When we add, subtract, multiply, or divide two NumPy arrays, the
operation is done element by element. That means the first element of one array is added to the
first element of the other array, and so on.

Example:

import numpy as np
20
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print("Addition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)

Output:
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [ 4 10 18]
Division: [0.25 0.4 0.5 ]

b) Universal Functions (ufuncs): NumPy also provides ready-made functions to do the same
operations, such as [Link], [Link], [Link], and [Link]. These do the same element-
wise operations but in a more explicit way.
Example:

import numpy as np
addition = [Link](a, b)
multiplication = [Link](a, b)
print("Addition using ufunc:", addition)
print("Multiplication using ufunc:", multiplication)

Output:
Addition using ufunc: [5 7 9]
Multiplication using ufunc: [ 4 10 18]

2. Broadcasting: Broadcasting is a very important feature in NumPy. It means that arrays of

different shapes can still be combined in arithmetic operations by automatically adjusting the
smaller one.
a) Scalar with Array: If we add a single number (scalar) to an array, NumPy automatically adds
that number to all elements of the array.

Example:

a = [Link]([1, 2, 3])
print(a + 5)

Output:
[6 7 8]

b) Arrays of Different Shapes: If the arrays have compatible shapes, NumPy “stretches” the
smaller array to match the bigger one.
Example:

a = [Link]([[1, 2, 3], [4, 5, 6]])

b = [Link]([10, 20, 30])
print(a + b)

21
Output:
[[11 22 33]
[14 25 36]]

3. Advanced Arithmetic Operations:

a) Matrix Multiplication: In mathematics, matrix multiplication is different from element-wise
multiplication. NumPy provides the @ operator or [Link] for this.
Example:
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[5, 6], [7, 8]])
print(A @ B)

Output:
[[19 22]
[43 50]]

b) Element-wise Power: We can raise each element of an array to a power.

Example:
a = [Link]([1, 2, 3])
print(a ** 2)

Output:
[1 4 9]

Advantages of Arithmetic with NumPy Arrays:

• Fast and Efficient: Operations are much faster than normal Python lists because NumPy uses
optimized C code.
• Less Code: No need for loops. One line of code can handle the whole array.
• Broadcasting: Easy to work with arrays of different sizes.
• Consistency: Functions are simple and work in the same way for different operations.
Disadvantages of Arithmetic with Numpy Arrays:

• Memory Usage: For very large arrays, it may use a lot of memory.
• Same Data Type: All elements in a NumPy array must be of the same type (all integers or all
floats).
• Broadcasting Confusion: If we don’t understand broadcasting rules, we may get wrong results.
Applications:

• Scientific Computing: Used in simulations, physics, and mathematics.

• Data Analysis: Used for cleaning, transforming, and analyzing datasets.
• Machine Learning: Arrays are used to store features and perform calculations in algorithms.
• Image Processing: Images are stored as arrays of pixels, so operations like brightness
adjustment or filtering are done using NumPy arithmetic.

22
Example:

import numpy as np
#Basic arithmetic
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print("Addition:", a + b )
print("Subtraction:", a - b )
print("Multiplication:", a*b )
print("Division:", a / b )
#Broadcasting with scalar
scalar = 10
print("Scalar addition:", a + scalar)
#Broadcasting with arrays
c = [Link]([[1, 2, 3], [4, 5, 6]])
d = [Link]([10, 20, 30])
print("Broadcasted addition:\n", c + d)
# Matrix multiplication
A= [Link] ([ [1, 2] , [3, 4]])
B = [Link] ([ [5, 6] , [7, 8] ])
print("Matrix multiplication:\n", A @ B)
#Element-wise power
print("Element-wise power.", a **2)

Output:
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [4 10 18]
Division: [0.25 0.4 0.5]
Scalar addition: [11 12 13]
Broadcasted addition:
[[11 22 33]
[14 25 36]]
Matrix multiplication:
[[19 22]
[43 50]]
Element-wise power: [1 4 9]

Basic Indexing and Slicing in NumPy:

When we work with data in NumPy arrays, we often need to pick out specific values or parts of the
array. This is done using indexing (to select individual elements) and slicing (to select a range or block
of elements). These two operations are the foundation for handling data in Python with NumPy.

1. Indexing: Indexing refers to accessing individual elements of an array using their positions.
a) Indexing in 1D Arrays: A 1D array is like a simple list. Index numbers start from 0 (first
element) and go up to n-1 (last element).

Example:

import numpy as np

23
array1D = [Link]([10, 20, 30, 40, 50])
print("First element:", array1D[0])
print("Third element:", array1D[2])
print("Last element:", array1D[-1])

Output:
First element: 10
Third element: 30
Last element: 50

b) Indexing in 2D Arrays: A 2D array looks like a table (rows and columns). We use a pair
of indices:
• First index → Row number
• Second index → Column number
Example:

import numpy as np
array2D = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("Element at (1,2):", array2D[1, 2]) # 6
print("Element at (0,0):", array2D[0, 0]) # 1
print("Element at (-1,-1):", array2D[-1, -1]) # 9

Output:
Element at (1,2): 6
Element at (0,0): 1
Element at (-1,-1): 9

2. Slicing: Slicing allows us to select subarrays (a continuous block of data) instead of single
elements. Slicing in Python is a technique used to extract a portion of data from sequences such
as lists, strings, or NumPy arrays by specifying a range of indices. It follows the general syntax
as sequence[start:stop:step], where the start index is included, the stop index is excluded, and
the step defines the interval between elements. If the start or stop values are omitted, Python
assumes defaults (beginning or end of the sequence), and if the step is omitted, it defaults to 1.

Syntax:

array[start : stop : step]

• start → Index where slice begins (default = 0).

• stop → Index where slice ends (not included).
• step → Interval between elements (default = 1).

a) Slicing in 1D Arrays:
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
24
print("Slice from index 1 to 4:", array1D[1:4]) # [20 30 40]
print("Every second element:", array1D[::2]) # [10 30 50]
print("From index 2 till end:", array1D[2:]) # [30 40 50]
print("From start to index 3:", array1D[:3]) # [10 20 30]

Output:
Slice from index 1 to 4: [20 30 40]
Every second element: [10 30 50]
From index 2 till end: [30 40 50]
From start to index 3: [10 20 30]

b) Slicing in 2D Arrays: We can slice rows and columns together. Slicing creates a view of the
original array (not a copy).

Example:
import numpy as np
array2D = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("Rows 0 to 1, columns 1 to 2:\n", array2D[0:2, 1:3])
print("Every second row and column:\n", array2D[::2, ::2])

Output:
Rows 0 to 1, columns 1 to 2:
[[2 3]
[5 6]]
Every second row and column:
[[1 3]
[7 9]]

c) Advanced Indexing: Sometimes basic indexing and slicing are not enough. NumPy provides
advanced indexing methods:
• Boolean Indexing: We can select elements that satisfy a condition using a boolean mask.
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
mask = array1D > 25
print("Boolean mask:", mask)
print("Elements greater than 25:", array1D[mask])

Output:
Boolean mask: [False False True True True]
Elements greater than 25: [30 40 50]

• Integer Array Indexing: We can specify a list of indices to pick multiple elements.
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
indices = [0, 2, 4]
print("Selected elements:", array1D[indices])
25
Output:
Selected elements: [10 30 50]

Advantages of Indexing & Slicing:

• Fast and efficient – No need for loops, operations are optimized

• Flexible – Works on 1D, 2D, and higher dimensions.
• Concise – Easy to write and read.
• Powerful – Supports conditions (boolean masks) and multiple indices.
Disadvantages of Indexing & Slicing:

• Care needed – Wrong indices may cause errors.

• Complexity increases – With high-dimensional arrays, syntax may get confusing.
• Advanced indexing is slower than simple slicing.
• Views vs Copies – Sometimes slicing creates a view, not a copy. If we modify it, the original
array also changes.

Applications:

• Data Analysis – Extract rows/columns of data.

• Image Processing – Crop, zoom, or modify parts of an image.
• Scientific Computing – Analyze subsets of large data.
• Machine Learning – Split data into training and testing sets.
Example:

import numpy as np
#Create a 2D array
array = [Link]([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120]])
# Basic indexing
print("Element at (1, 2): ", array[1, 2])
# Negative indexing
print("Element at (- 1, - 1): ", array[-1, -1])
# Basic slicing
print("Slice from rows 0 to 2 and columns 1 to 3: \n" , array [0:2, 1:3] )
# Slicing with step
print("Every second row and column: \n", array [:: 2 , :2])
# Boolean indexing
mask = array[:, 0] > 30
print("Rows where first column > 30: \n", array[mask])
# Integer array indexing
indices = [0, 2]
print("Rows at indices 0 and 2: \n", array[indices])

Output:
Element at (1, 2): 70
Element at (- 1, - 1): 120
Slice from rows 0 to 2 and columns 1 to 3:
[[20 30]

26
[60 70]]
Every second row and column:
[[ 10 20]
[ 90 100]]
Rows where first column > 30:
[[ 50 60 70 80]
[ 90 100 110 120]]
Rows at indices 0 and 2:
[[ 10 20 30 40]
[ 90 100 110 120]]

Boolean Indexing in NumPy:

When we work with large datasets, we often want to pick out only the values that meet certain
conditions. For example, we may want to extract all numbers greater than 50, or all marks greater than
35 from a student marks list. In NumPy, this can be done easily using Boolean Indexing.

Boolean indexing means selecting elements from an array based on a condition.

First, we create a Boolean mask → an array of the same shape that contains only True or False values.
Each True represents an element that satisfies the condition, and False means it does not. When this
mask is applied to the original array, we get only those elements where the mask is True.
How Boolean Indexing Works ?

Step 1: Create a Boolean Mask

We apply a condition on the array. This gives us a Boolean array.

Example:

import numpy as np
# Create an array
array = [Link]([10, 20, 30, 40, 50])
# Condition: elements greater than 25
mask = array > 25
print("Boolean mask:", mask)

Output:
Boolean mask: [False False True True True]

Step 2: Apply the Mask

We use this mask to filter the array.

Example:

import numpy as np
# Create an array
array = [Link]([10, 20, 30, 40, 50])
# Condition: elements greater than 25
mask = array > 25
filtered_array = array[mask]

27
print("Filtered array:", filtered_array)

Output:
Filtered array: [30 40 50]

Advantages of Boolean Indexing:

• Readability → The condition is written clearly in one line.

Example: array[array > 25] is more readable than writing a loop.
• Efficiency → Very fast because NumPy operations are optimized in C.
• Flexibility → Works for 1D, 2D, or higher-dimensional arrays.
Disadvantages of Boolean Indexing:

• Memory Usage → The mask itself is another array, so with huge datasets, it takes extra
memory.
• Performance Issues with Large Masks → Creating very large boolean masks can slow down
performance.
• Errors Possible → If the mask shape does not match the array shape, it causes errors.
Applications of Boolean Indexing:

• Data Filtering → Extracting specific rows or values (e.g., marks > 35).
• Data Cleaning → Removing unwanted data points.
• Conditional Updates → Changing only the values that satisfy a condition.

Example:
import numpy as np
# Create a 2D array
array = [Link]([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
# Create a boolean mask to select elements greater than 50
mask = array > 50
print("Boolean mask:\n", mask)
#Apply the mask to filter elements
filtered_array = array[mask]
print("Filtered array:", filtered_array)
# Example of modifying elements using boolean indexing
array[mask] = -1
print("Modified array:\n", array)

Output:
Boolean mask:
[[False False False]
[False False True]
[True True True]]
Filtered array: [60 70 80 90]
Modified array:
[[10 20 30]
[40 50-1]
[-1-1-1]]

28
Transposing Arrays and Swapping Axes in NumPy:
Working with data often requires changing its orientation or the way its rows and columns are arranged.
In NumPy, two very important operations help us do this:

1. Transposing arrays
2. Swapping axes

Both operations are useful for reshaping data, making it ready for mathematical operations like
multiplication, or preparing it for analysis in machine learning and data science.

1. Transposing Arrays:
Meaning: Transposing an array means flipping its rows into columns and columns into rows.
For a 2D array (matrix), it simply means exchanging rows with columns. For arrays with more
than two dimensions, the transpose operation rearranges the axes based on the given order.

How it is done in NumPy ?

• [Link](a, axes=None) → Returns a view of array a with its axes permuted

according to axes (or reversed if not given).
• array.T → Shortcut for the transpose of a 2D array (rows become columns and vice versa).
Example:

import numpy as np
# 2D array (matrix)
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original 2D array:\n", array_2d)
# Transpose the array
transposed_array = array_2d.T
print("Transposed array:\n", transposed_array)

Output:
Original 2D array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]]

Why is transpose important?

In Mathematics: Used in linear algebra for matrix multiplication, dot products, and solving equations.

Data Science: Aligns data when rows and columns need to be flipped (e.g., features vs. samples). In
Image Processing: Transposing can flip images or rearrange pixel values.

29
2. Swapping Axes:

Meaning: Sometimes, instead of flipping all rows and columns, we may want to rearrange only
specific dimensions (axes) of a multidimensional array. Swapping axes means exchanging any
two given axes (dimensions) of the array.

How it is done in NumPy:

• [Link](a, axis1, axis2) → Returns a view of array a with the two specified
axes interchanged.

Here:

o axis1 → The first axis you want to swap.

o axis2 → The second axis you want to swap.

Example:

import numpy as np
# 3D array
array_3d = [Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("Original 3D array:\n", array_3d)
# Swap the first and last axes
swapped_array = [Link](array_3d, 0, 2)
print("Array after swapping axes:\n", swapped_array)

Output:
Original 3D array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Array after swapping axes:
[[[1 5]
[3 7]]
[[2 6]
[4 8]]]
Here:

• The first axis (0) and the last axis (2) have been swapped.
• This changes how the data is arranged, but it doesn’t lose any information.
Advantages:

• Flexibility – We can easily rearrange data dimensions as required.

• Convenience – NumPy provides simple functions (.T and [Link]) for these operations.
• Essential for Mathematics – Many matrix operations (like finding transpose in linear algebra)
directly use these.

Disadvantages:

• Memory Usage – While transpose usually gives just a “view” of data, sometimes swapping axes
may use extra memory for large arrays.
30
• Complexity in Higher Dimensions – For arrays with 3 or more dimensions, it can become
confusing to keep track of which axis is being swapped.

Applications:

• Data Alignment → Adjusting dimensions of datasets for compatibility.

• Matrix Operations → Transpose is required in matrix multiplication, solving equations,
eigenvalue problems, etc.
• Image Processing → Images are stored as arrays. Transpose or axis swapping helps in rotating or
reordering pixel data.
• Machine Learning → Data often needs to be rearranged before feeding into models (for example,
channels-first vs channels-last representation in deep learning).

Example:

import numpy as np
#Create a 2D array
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original 2D array:\n", array_2d)
#Transpose the array
transposed_array = array_2d.T
print("Transposed array:\n", transposed_array)
#Create a 3D array
array_3d =[Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("Original 3D array:\n", array_3d)
#Swap the first and last axes
swapped_array = [Link](array_3d, 0, 2)
print("Array with swapped axes:\n", swapped_array)

Output:
Original 2D array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]]
Original 3D array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Array with swapped axes:
[[[1 5]
[3 7]]
[[2 6]
[4 8]]]

31
Universal functions (ufuncs): Fast element-wise array functions
In NumPy, Universal Functions (ufuncs) are special functions that work on each element of an array
individually. They are written in the C programming language inside NumPy, which makes them very
fast compared to normal Python loops. Instead of writing long loops to apply a formula on each
element, we can simply use a ufunc, and it will apply the operation to all elements automatically.

For example, if we want to find the square root of each element in an array, a ufunc like [Link]() can
do this in one line.

How They Work ?

Universal functions perform operations element by element. They use vectorization, meaning the
operation happens on the whole array at once (without writing loops). Since they are optimized in C
language, they run much faster than pure Python code.

Example:

import numpy as np
array = [Link]([1, 2, 3, 4])
sqrt_array = [Link](array)
print("Square root:", sqrt_array)

Output:
Square root: [1. 1.41421356 1.73205081 2.]

Advantages:

• Speed – Ufuncs are very fast because they are built in C and avoid Python loops.
• Simple Code – One line of code replaces many lines of loops.
• Broadcasting Support – Ufuncs can also work with arrays of different shapes, making them
flexible.

Disadvantages:

• High Memory Use – For very large arrays, they may consume more memory.
• Limited Scope – They are mainly for element-wise operations (not for complex logic).
• Errors in Special Cases – Some operations (like division by zero) may cause errors or warnings.
Applications:

• Mathematics – For operations like logarithm, square root, trigonometric functions.

• Data Transformation – Normalizing or scaling data in machine learning.
• Image Processing – Working on pixels element by element (e.g., filters, brightness changes).
Example:

import numpy as np
array = [Link]([1, 4, 9, 16])
square_root = [Link](array) # Square root
squared = [Link](array) # Square of each element
logarithm = [Link](array) # Natural log

32
print("Original array:", array)
print("Square root:", square_root)
print("Squared:", squared)
print("Logarithm:", logarithm)

Output:
Original array: [1 4 9 16]
Square root: [1. 2. 3. 4.]
Squared: [1 16 81 256]
Logarithm: [0. 1.38629436 2.19722458 2.77258872]

Mathematical and Statistical Methods – Sorting:

Sorting means arranging data in order, either from smallest to largest (ascending) or from largest to
smallest (descending). It is one of the most important methods in mathematics, statistics, and computer
science because it helps us organize information and makes it easier to analyze.

In NumPy, sorting is very easy and efficient. The two main functions used are:

• [Link]() → Gives a new array with all elements arranged in ascending order (by default).
• [Link]() → Gives the positions (indices) that can be used to rearrange the array into sorted
order.

For example, if we have an array of numbers, we can sort it directly or find the order in which we
should arrange its elements. In the case of multidimensional arrays (matrices), sorting can be done
along rows or columns, which makes it flexible for handling real-world datasets.

How Sorting works ?

• [Link]() → sorts the values directly and returns a new sorted array.
• [Link]() → returns the index numbers of elements in sorted order (which can be used to
rearrange the data).
Example:

import numpy as np
# Create an array
array = [Link]([3, 1, 2, 5, 4])
# Sort the array
sorted_array = [Link](array)
print("Sorted array:", sorted_array)
# Get indices that sort the array
sorted_indices = [Link](array)
print("Indices to sort the array:", sorted_indices)

Output:
Sorted array: [1 2 3 4 5]
Indices to sort the array: [1 2 0 4 3]

Here, the numbers are arranged in ascending order. The indices show where the numbers originally
were before sorting.
33
Advantages of Sorting:

• Efficiency – Sorting in NumPy is fast and can handle large data sets easily because it uses
powerful algorithms like quicksort, mergesort, and heapsort.
• Flexibility – Sorting works not only on 1D arrays but also on multi-dimensional arrays along
any row or column.

Disadvantages of Sorting:

• Extra Memory Usage – When using [Link](), a new array is created, which requires extra
memory.
• Complexity in Higher Dimensions – Sorting 2D or 3D arrays can be confusing because we
need to decide whether to sort by row, by column, or by another axis.

Applications of Sorting:

• Data Analysis – Helps in finding patterns, trends, or outliers in data.

• Algorithm Optimization – Many algorithms work faster on sorted data.
• Data Preparation – Before making graphs or performing statistical analysis, sorting helps
organize the dataset.

Example:

import numpy as np
# 1D Array
array = [Link]([7, 2, 5, 1, 9])
sorted_array = [Link](array)
print("Sorted 1D array:", sorted_array)
# 2D Array
array_2d = [Link]([[3, 1, 2], [6, 5, 4]])
# Sort column-wise (axis=0)
sorted_axis0 = [Link](array_2d, axis=0)
print("Sorted 2D array along axis 0:\n", sorted_axis0)
# Sort row-wise (axis=1)
sorted_axis1 = [Link](array_2d, axis=1)
print("Sorted 2D array along axis 1:\n", sorted_axis1)

Output:
Sorted 1D array: [1 2 5 7 9]
Sorted 2D array along axis 0:
[[3 1 2]
[6 5 4]]
Sorted 2D array along axis 1:
[[1 2 3]
[4 5 6]]

Here, the 1D array is sorted normally. In the 2D array, sorting along axis=0 means column-wise sorting,
while axis=1 means row-wise sorting.

34
Unique and Set Logic in NumPy:
When we work with data, sometimes we get repeated values or need to compare two datasets. NumPy
makes this easy by giving us special functions to handle unique values and set operations (like
intersection, union, and difference). These operations are very useful in data cleaning, analysis, and
comparison.

1. Unique Elements:
The [Link]() function helps us find the unique (non-repeated) values in an array. It removes
duplicates. It can also give us how many times each value appears. By default, it returns the
results in sorted order.

Example:
import numpy as np
array = [Link]([1, 2, 2, 3, 4, 4, 5])
unique_elements = [Link](array)
print("Unique elements:", unique_elements)

Output:
Unique elements: [1 2 3 4 5]

Here, duplicates like 2 and 4 are removed.

2. Set Logic Operations:

Just like in Mathematics, NumPy allows us to do set operations on arrays:

• Intersection (np.intersect1d()) → Finds the common elements in two arrays.

• Union (np.union1d()) → Combines both arrays and gives all unique elements.
• Difference (np.setdiff1d()) → Finds elements that are in the first array but not in the second.
Example:

import numpy as np
array1 = [Link]([1, 2, 3, 4])
array2 = [Link]([3, 4, 5, 6])
# Intersection
print("Intersection:", np.intersect1d(array1, array2))
# Union
print("Union:", np.union1d(array1, array2))
# Difference
print("Difference:", np.setdiff1d(array1, array2))

Output:
Intersection: [3 4]
Union: [1 2 3 4 5 6]
Difference: [1 2]

35
Advantages:

• Fast and efficient: Works quickly even on large datasets.

• Simple to use: No need to write long code for finding unique, union, or intersection.
Disadvantages:

• Memory usage: Large arrays may need more memory.

• Order change: By default, results are sorted, so original order may be lost.
Applications:

• Data Cleaning: Removing duplicates.

• Data Analysis: Comparing two datasets to see what is common or different.
• Machine Learning: Useful when checking for unique labels or categories.

Top 20 Python Libraries For Data Science
No ratings yet
Top 20 Python Libraries For Data Science
15 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
PYTHON
No ratings yet
PYTHON
11 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
4 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Python Libraries for B.Tech Students
No ratings yet
Python Libraries for B.Tech Students
17 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
6 pages
Data Ty
No ratings yet
Data Ty
59 pages
10 Essential Python Libraries For Data Professionals - by Sigli Mumuni - Medium
No ratings yet
10 Essential Python Libraries For Data Professionals - by Sigli Mumuni - Medium
6 pages
Data Science
No ratings yet
Data Science
17 pages
Python for Data Analysis Overview
No ratings yet
Python for Data Analysis Overview
49 pages
Top 18 Python Libraries for Data Science
100% (1)
Top 18 Python Libraries for Data Science
11 pages
Machine Learning Document
No ratings yet
Machine Learning Document
7 pages
Data Science Lecture 5 6th Semster
No ratings yet
Data Science Lecture 5 6th Semster
3 pages
Python For Data Science
No ratings yet
Python For Data Science
17 pages
Essential Python Tools for Data Science
No ratings yet
Essential Python Tools for Data Science
2 pages
40 Most Popular Python Scientific Libraries
No ratings yet
40 Most Popular Python Scientific Libraries
9 pages
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
No ratings yet
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
86 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Machine Learning With Data Science (1) - 5-31 (1) - 1-25
No ratings yet
Machine Learning With Data Science (1) - 5-31 (1) - 1-25
25 pages
Report Format (1) .Docx - 20240508 - 124537 - 0000
No ratings yet
Report Format (1) .Docx - 20240508 - 124537 - 0000
11 pages
Data Science With Python Unlocking Insights
No ratings yet
Data Science With Python Unlocking Insights
8 pages
Data Science 2
No ratings yet
Data Science 2
15 pages
Important Libraries For Data Science
No ratings yet
Important Libraries For Data Science
29 pages
5 Essential Python Libraries For Every Data Scientist
No ratings yet
5 Essential Python Libraries For Every Data Scientist
10 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Practical 1
No ratings yet
Practical 1
8 pages
Python's Role in Data Science Explained
No ratings yet
Python's Role in Data Science Explained
2 pages
Exp 1
No ratings yet
Exp 1
22 pages
Cs3361 Data Science Laboratory
No ratings yet
Cs3361 Data Science Laboratory
139 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Data Science with Python Overview
No ratings yet
Data Science with Python Overview
14 pages
Final ppt-1
No ratings yet
Final ppt-1
17 pages
Core Libraries For Machine Learning
No ratings yet
Core Libraries For Machine Learning
5 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
2 IntroPython
No ratings yet
2 IntroPython
18 pages
Staple Python Libraries For Data Science
No ratings yet
Staple Python Libraries For Data Science
26 pages
Python for Data Science Overview
No ratings yet
Python for Data Science Overview
20 pages
Top 5 Python Libraries for Data Science
100% (1)
Top 5 Python Libraries for Data Science
5 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Introduction To Popular-1
No ratings yet
Introduction To Popular-1
15 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
Pai 6
No ratings yet
Pai 6
17 pages
AIES Assignment1
No ratings yet
AIES Assignment1
15 pages
Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
Libraries For Data Science
No ratings yet
Libraries For Data Science
2 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Exp 1 Dav
No ratings yet
Exp 1 Dav
3 pages
Content Beyond The Syllabus-23ge301
No ratings yet
Content Beyond The Syllabus-23ge301
16 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Asm 135233
No ratings yet
Asm 135233
3 pages
Unit 5 Notes: 1. Integrated Development Environment
No ratings yet
Unit 5 Notes: 1. Integrated Development Environment
6 pages
15 Python Libraries For Data Science
No ratings yet
15 Python Libraries For Data Science
17 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Python Libraries For Data Science
No ratings yet
Python Libraries For Data Science
10 pages
ABAP Objects: Comprehensive Guide
No ratings yet
ABAP Objects: Comprehensive Guide
51 pages
Virus Scan Log with Errors
No ratings yet
Virus Scan Log with Errors
3 pages
(Ebook PDF) Illustrated Microsoft Office 365 & Access 2016: Introductory Full Access
100% (6)
(Ebook PDF) Illustrated Microsoft Office 365 & Access 2016: Introductory Full Access
146 pages
In The Name of God Ravi Subramanian PDF Download
No ratings yet
In The Name of God Ravi Subramanian PDF Download
31 pages
Cisco IOS Shellcodes
No ratings yet
Cisco IOS Shellcodes
45 pages
Wacom Ink SDK Evaluation Agreement v202302 CLN With Email Verbiage
No ratings yet
Wacom Ink SDK Evaluation Agreement v202302 CLN With Email Verbiage
8 pages
MTP Main Document
No ratings yet
MTP Main Document
3 pages
Aditya Tiwari Pune 8.10 Yrs
No ratings yet
Aditya Tiwari Pune 8.10 Yrs
3 pages
C++ Software Development Expertise
No ratings yet
C++ Software Development Expertise
13 pages
Lexium 28 Servo Drives and BCH2 Servo Motors
No ratings yet
Lexium 28 Servo Drives and BCH2 Servo Motors
26 pages
Final Term Past Papers Mgt101
100% (1)
Final Term Past Papers Mgt101
8 pages
Homeworks QS Software Overview and Guide
100% (2)
Homeworks QS Software Overview and Guide
4 pages
C - Type Casting - Tutorialspoint
No ratings yet
C - Type Casting - Tutorialspoint
3 pages
Airline Crew Scheduling Review
100% (1)
Airline Crew Scheduling Review
27 pages
Hex D-type Flip-Flop HEF40174B
No ratings yet
Hex D-type Flip-Flop HEF40174B
5 pages
GATE Preparation Books for CS/IT
No ratings yet
GATE Preparation Books for CS/IT
2 pages
ApplicationServer 2014R2 RevA Presentation1
No ratings yet
ApplicationServer 2014R2 RevA Presentation1
96 pages
IOE Data Structure Notes
No ratings yet
IOE Data Structure Notes
132 pages
Prosteel AutoCAD Training Overview
No ratings yet
Prosteel AutoCAD Training Overview
10 pages
ICT Insights for Educators & Students
No ratings yet
ICT Insights for Educators & Students
131 pages
Microsoft Office Skills Self Assessment Guide
No ratings yet
Microsoft Office Skills Self Assessment Guide
10 pages
Free Cam
No ratings yet
Free Cam
16 pages
PRD - New CRM
No ratings yet
PRD - New CRM
11 pages
Unit 1
No ratings yet
Unit 1
24 pages
EU Declaration for MiR200 Autonomous Vehicle
No ratings yet
EU Declaration for MiR200 Autonomous Vehicle
1 page
Annotation in Rap
No ratings yet
Annotation in Rap
133 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
Arati Musale Angular Developer
No ratings yet
Arati Musale Angular Developer
4 pages
The Ultimate Guide To Intelligent Document Processing 1709708578
No ratings yet
The Ultimate Guide To Intelligent Document Processing 1709708578
8 pages
OS Concepts Practice Quiz
No ratings yet
OS Concepts Practice Quiz
4 pages

Data Science UNIT-3

Uploaded by

Data Science UNIT-3

Uploaded by

UNIT III

PYTHON FOR DATA SCIENCE

• ndarray: Efficient storage and manipulation of multi-dimensional arrays.

Series – one-dimensional labeled array

• DataFrame: Stores tabular data with labeled rows and columns.

Name Age #filtered_df

• Supports line, bar, scatter, histogram, pie charts, etc.

import [Link] as plt

• Predefined styles and color themes.

import seaborn as sns

• Optimization: Algorithms to minimize or maximize functions.

from scipy import stats

• Supervised Learning: Classification and regression (e.g., Linear Regression, SVM,

✓ NumPy + Pandas for data handling,

Python Integrated Development Environments (IDEs)

Python IDEs combine several features into one environment:

• Code editor with syntax highlighting and auto-completion.

How Python IDEs Work?

Some of the Python IDEs:

3. Spyder: Spyder is designed specifically for scientific and data-driven programming. It

• Interactive Development: Allows live execution of code (e.g., Jupyter).

• Resource Intensive: Full-featured IDEs may slow down low-spec machines.

• Web Development & Deployment: Deploy ML models using Flask/Django in PyCharm or

Imagine a data science project on customer retention prediction:

float_array = [Link]([1, 2, 3], dtype = np.float64)

array1 = [Link]([1, 2, 3])

array = [Link]([1, 4, 9, 16])

Broadcasting: Broadcasting allows NumPy to perform operations on arrays of different shapes.

array1 = [Link]([1, 2, 3])

array= [Link]([[1, 2, 3], [4, 5, 6]])

• Data Analysis: Efficient manipulation and analysis of large datasets.

The NumPy ndarray:

An ndarray is described by two main things:

• Its shape → tells us how many rows, columns, or dimensions it has.

Creating ndarrays in NumPy:

# Array of zeros (3x3)

[Link](start, stop, step)

# Array with values from 0 to 10 (step of 2)

# 5 evenly spaced values between 0 and 1

# Identity matrix of size 4x4

# Random values between 0 and 1

How to store different types of data in NumPy Array ?

[Link]([('field1','datatype1'), ('field2','datatype2'), ...])

# Define structured data type

Setting default Data Type (dtype Parameter):

Disadvantages of NumPy ndarray:

Data Types for NumPy ndarrays:

• int8 → 8-bit integer (from -128 to 127)

• float16 → half-precision (less accurate, uses less memory)

• complex64 → 32-bit real + 32-bit imaginary

2. Boolean Data Type(bool): It stores only True or False.

Example: [Link]([True, False, True], dtype='bool')

Example: [Link](["apple", "banana"], dtype='str')

Example: [Link]([1, "two", 3.5], dtype='object')

Advantages of NumPy Data Types:

• Memory Efficiency → Uses less memory compared to normal Python lists.

Arithmetic with NumPy Arrays:

1. Basic Arithmetic Operations

2. Broadcasting: Broadcasting is a very important feature in NumPy. It means that arrays of

a = [Link]([[1, 2, 3], [4, 5, 6]])

3. Advanced Arithmetic Operations:

b) Element-wise Power: We can raise each element of an array to a power.

Advantages of Arithmetic with NumPy Arrays:

• Scientific Computing: Used in simulations, physics, and mathematics.

Basic Indexing and Slicing in NumPy:

array[start : stop : step]

• start → Index where slice begins (default = 0).

Advantages of Indexing & Slicing:

• Fast and efficient – No need for loops, operations are optimized

• Care needed – Wrong indices may cause errors.

• Data Analysis – Extract rows/columns of data.

Boolean Indexing in NumPy:

Boolean indexing means selecting elements from an array based on a condition.

Step 1: Create a Boolean Mask

We apply a condition on the array. This gives us a Boolean array.

Step 2: Apply the Mask