Data Science UNIT-3
Data Science UNIT-3
Python for Data Science: Python Libraries, Python integrated Development Environments (IDE) for
Data Science.
NumPy Basics: Arrays and Vectorized Computation, The NumPy ndarray, Creating ndarrays, Data
Types for ndarrays, Arithmetic with NumPy Arrays, Basic Indexing and Slicing, Boolean Indexing,
Transposing Arrays and Swapping Axes.
Universal Functions: Fast Element, Wise Array Functions, Mathematical and Statistical Methods,
Sorting, Unique and Other Set Logic.
One of the key reasons for Python’s popularity is its readability and ease of learning, which makes it
accessible not only to software developers but also to researchers, analysts, and domain experts who
may not have a strong programming background. Its vast collection of libraries like NumPy, pandas,
Matplotlib, and SciPy supports numerical computing, data manipulation, and visualization. For
advanced analytics and machine learning, libraries such as scikit-learn, TensorFlow, and PyTorch are
extensively used.
Python also integrates seamlessly with databases, cloud services, and big data platforms, making it
suitable for handling real-world data at scale. Additionally, tools like Jupyter Notebook allow data
scientists to write, test, and share code interactively along with visualizations and documentation,
which is essential for collaborative projects.
Beyond its technical strengths, Python has also gained popularity in data science due to its large and
active community. This community continuously contributes new libraries, tools, and resources that
make solving complex data problems easier and more efficient. Hence, Python is not just a
programming language but a complete ecosystem for data science, offering both beginners and experts
the support needed to innovate and apply data-driven solutions across diverse fields.
The applications of Python in data science are vast: it is used in predictive modeling, business
intelligence, natural language processing, computer vision, recommendation systems, and scientific
research. Many industries, such as healthcare, finance, e-commerce, and transportation rely on Python-
powered data science solutions for decision-making and automation.
Python Libraries:
Python is famous for its library ecosystem, which makes it the most widely used language in data
science. These libraries provide tools for data storage, manipulation, visualization, scientific
computing, and machine learning. Together, they form the backbone of modern data science.
1
The major libraries are: NumPy, Pandas, Matplotlib, Seaborn, SciPy, and Scikit-learn.
1. NumPy (Numerical Python): NumPy is the fundamental package for numerical computing in
Python. It provides a ndarray (N-dimensional array), which is much faster and more efficient
than Python lists.
Key Features:
Example:
import numpy as np
# Creating an ndarray
array = [Link]([1, 2, 3, 4])
print("Array:", array)
# Performing element-wise operation
squared = array ** 2
print("Squared:", squared)
Output:
Array: [1 2 3 4]
Squared: [1 4 9 16]
2. Pandas: Pandas is a powerful library for data manipulation and analysis. It introduces two
main data structures:
Key Features:
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = [Link](data)
print(df)
# Filtering data
filtered_df = df[df['Age'] > 28]
2
print(filtered_df)
Output:
Name Age #df
0 Alice 25
1 Bob 30
2 Charlie 35
3. Matplotlib: Matplotlib is the most widely used library for data visualization. It allows the
creation of static, animated, and interactive plots.
Key Features:
Output:
3
4. Seaborn: Seaborn is a high-level visualization library built on top of Matplotlib. It makes plots
more attractive and easier to create, especially for statistical data.
Key Features:
Output:
5. SciPy: SciPy builds on NumPy and provides advanced scientific and technical computing
tools. It is especially useful in mathematics, physics, and engineering.
Key Features:
4
Example:
Output:
T-Statistic: -4.2892427387353536, P-Value: 2.800941448830977e-05
6. Scikit-learn: Scikit-learn is the most popular library for machine learning in Python. It
provides tools for both supervised and unsupervised learning.
Key Features:
Example:
from sklearn.linear_model import LinearRegression
import numpy as np
# Creating a linear regression model
model = LinearRegression()
X = [Link]([[1], [2], [3], [4]])
y = [Link]([10, 20, 25, 30])
[Link](X, y)
# Making predictions
predictions = [Link]([Link]([[5], [6]]))
print(predictions)
Output:
[37.5 44.0]
These libraries are often used together in real-world data science projects:
This interactivity makes IDEs indispensable for data exploration and model development.
1. Jupyter Notebook: Jupyter is the most popular tool for data science research and education.
It allows writing code in separate cells, running them independently, and embedding plots,
tables, and even Markdown text in the same document. This makes it highly suitable for data
exploration, visualization, and reproducible research.
6
2. PyCharm: PyCharm is a professional-grade IDE with advanced features such as intelligent
code completion, version control integration, refactoring tools, and strong debugging support.
The Professional Edition also supports Jupyter integration and provides a “Scientific Mode.”
It is highly suitable for large-scale data science projects, especially when the project
integrates with web frameworks (like Django/Flask) or involves multiple developers working
in collaboration.
4. VS Code: Visual Studio Code is a lightweight but versatile editor. It is highly extensible
through plugins and supports Python very well via official extensions. With Jupyter
integration, debugging, Git integration, and remote development features, VS Code is
suitable for both beginners and professionals. It is widely used in industry because of its
cross-platform adaptability and lightweight nature.
5. Anaconda Distribution: Anaconda is not an IDE itself but a distribution that simplifies the
setup of Python environments for data science. It bundles Jupyter, Spyder, and hundreds of
libraries (NumPy, pandas, scikit-learn, TensorFlow, etc.). This makes it the most suitable tool
for setting up environments quickly, especially for beginners or those who want to avoid
installation issues. It ensures package compatibility, which is critical in machine learning
workflows.
Advantages:
Python IDEs in data science are not just for writing code, they are workflow enablers that streamline
the process of data preparation, analysis, model building, and deployment. Their applications include:
• Data Analysis & Visualization: Use Jupyter, Spyder for importing, cleaning (pandas), and
visualizing data (Matplotlib, Seaborn, Plotly) to find trends and patterns.
• Machine Learning Development: Build and test models in PyCharm, VS Code, or Spyder
using scikit-learn, TensorFlow, or PyTorch with support for feature engineering and tuning.
7
• Big Data Processing: Handle large datasets with PySpark or Dask in Python IDEs, vital for
industries like finance, healthcare, and e-commerce.
• Scientific Research & Academia: Jupyter Notebooks are ideal for teaching, experiments, and
publishing reproducible research with code, LaTeX, and visuals.
• Collaboration & Version Control: IDEs integrate with Git/GitHub for team collaboration,
version tracking, and project management.
• Automation & Scripting: Automate tasks like report generation, data scraping, and scheduled
model retraining using Python scripts.
Example:
1. Data Import & Cleaning: Use Jupyter Notebook with Pandas to import customer transaction
data, handle missing values, and visualize patterns.
2. Feature Engineering: Derive features like transaction frequency, spending average, and
recency.
3. Model Development: Use Spyder or PyCharm to build machine learning models (e.g., logistic
regression, decision trees, random forest) with scikit-learn.
4. Model Evaluation: Evaluate models using accuracy, precision, recall, and ROC curves.
5. Deployment: Deploy the final model using Flask/Django inside PyCharm or VS Code to create
a web application for business stakeholders.
8
NUMPY BASICS
Arrays and Vectorized Computation
NumPy (Numerical Python) is a fundamental package for scientific computing in Python. It provides
support for arrays, which are grid-like data structures used to represent vectors, matrices, and higher-
dimensional datasets. Arrays are more efficient than Python lists for numerical operations, making
NumPy an essential tool for data science and machine learning.
Arrays
ndarrays: An ndarray (n-dimensional array) is a multidimensional, homogeneous array of fixed-size
items.
Creation: Arrays can be created from Python lists or tuples using [Link](), and there are functions
like [Link](), [Link](), and [Link]() for generating arrays.
Example:
import numpy as np
#Creating an array from a list
array_from_list = [Link]([1, 2, 3, 4])
print(array_from_list)
#Creating a 3x3 array of zeros
zeros_array = [Link]((3, 3))
print(zeros_array)
#Creating an array with a range of values
range_array = [Link](10)
print(range_array)
Output:
[1 2 3 4]
[[0 0 0]
[0 0 0]
[0 0 0]]
[0 1 2 3 4 5 6 7 8 9]
9
Data Types: Each ndarray has a dtype (data type) object that describes the type of elements in the
array. You can specify the dtype during array creation or convert it using the astype() method.
Example:
Output:
[1.0 2.0 3.0]
[1 2 3]
Vectorized Computation
Element-wise Operations: NumPy allows for element-wise operations on arrays without explicit
loops, which is known as Vectorization. These operations include addition, subtraction, multiplication,
and division.
Example:
Output:
[5 7 9]
[4 10 18]
Universal Functions (ufuncs): Universal functions are functions that operate element-wise on
ndarrays. Examples include mathematical functions like [Link](), [Link](), and [Link]().
Example:
Output:
[1.0 2.0 3.0 4.0]
10
[2.71828183e+00 5.45981500e+01 8.10308393e+03 8.88611052e+06]
Example:
Output:
[[2 3 4]
[3 4 5]
[4 5 6]]
Reductions: Reduction operations like summing, finding the minimum, or maximum can be
performed using methods like [Link](), [Link]() and [Link](). These functions can be applied to the
entire array or along a specific axis.
Example:
Output:
21
[5 7 9]
[6 15]
Applications:
NumPy's array operations and vectorized computations are used in various applications, including:
11
Example:
Let's consider an example where we perform basic data manipulation using NumPy:
import numpy as np
#Generate a random dataset of 1000 samples with 3 features
data = [Link](1000, 3)
#Normalize the data (feature scaling)
data_mean = [Link](data, axis=0)
data_std = [Link](data, axis=0)
normalized_data = (data - data_mean) / data_std
print("Original Data:\n", data[:5]) # Display first 5 samples
print("Normalized Data:\n", normalized_data[:5]) # Display first 5 normalized
samples
Output:
Original Data:
[[0.73417432 0.74037936 0.90738968]
[0.72763622 0.67006168 0.41185916]
[0.920575 0.57639958 0.82472397]
[0.05283245 0.25583882 0.85750512]
[0.98428961 0.05666713 0.8623092]]
Normalized Data:
[[ 0.80828951 0.77270154 1.40698663]
[0.78535802 0.53086531 -0.30160892]
[1.46206495 0.20874309 1.12195421]
[-1.5814258 -0.89372793 1.23498405]
[1.68553539-1.57871823 1.25154857]]
In this example, we generate a random dataset, normalize it by subtracting the mean and dividing by
the standard deviation for each feature, and then print the first five samples of the original and
normalized data. This showcases how NumPy can efficiently handle data manipulation and
preprocessing tasks crucial for data science and machine learning workflows.
12
How ndarrays Work ?
Inside the computer, an ndarray stores all its data in a continuous block of memory, which makes it
much faster than normal Python lists. Since all the elements are of the same type, NumPy can use C
programming speed in the background, instead of slow Python loops. This makes ndarrays very useful
for data science, machine learning, image processing, and scientific research.
An ndarray (N-dimensional array) is the core data structure of NumPy. It is similar to a Python list but
more powerful because it can store large amounts of numerical data efficiently and allows
mathematical operations to be applied directly on the data.
There are many ways to create ndarrays in NumPy, depending on the requirement. They are:
1. Using [Link](): The [Link]() function converts Python lists, tuples, or other array-like
objects into an ndarray. It takes input data (like a list) and creates an ndarray. You can also
specify the data type (dtype).
Syntax:
[Link](object, dtype=None)
Example:
import numpy as np
# 1D array from a list
array1d = [Link]([1, 2, 3, 4, 5])
print("1D Array:\n", array1d)
# 2D array from a list of lists
array2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", array2d)
Output:
1D Array:
[1 2 3 4 5]
2D Array:
[[1 2 3]
[4 5 6]]
2. Using [Link]() and [Link](): You provide the shape (rows × columns), and it fills the array
with zeros or ones.
• [Link]() creates an array filled with 0s.
• [Link]() creates an array filled with 1s.
Syntax:
[Link](shape)
[Link](shape)
Example:
Output:
Array of zeros:
[[0 0 0]
[0 0 0]
[0 0 0]]
Array of ones:
[[1 1 1 1]
[1 1 1 1]]
3. Using [Link](): [Link]() creates an array with values in a given range, similar to
Python’s range() function, but returns an ndarray. You specify the start, stop, and step size.
Syntax:
Output:
Array with range values: [0 2 4 6 8]
4. Using [Link](): [Link]() creates an array with evenly spaced values between a start
and end point. You give the start, end, and number of elements required.
Syntax:
[Link](start, stop, num)
Example:
Output:
Array with evenly spaced values: [0 0.25 0.5 0.75 1.0]
5. Using [Link](): [Link]() creates an identity matrix (a square matrix with 1s on the diagonal
and 0s elsewhere). It is mostly used in linear algebra and mathematical computations.
Syntax:
[Link](N)
14
Example:
Output:
Identity matrix:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
6. Using Random Functions ([Link]): NumPy provides random number functions to create
arrays filled with random values.
• [Link]() → random values between 0 and 1.
• [Link]() → random integers in a given range.
Syntax:
[Link](shape)
[Link](low, high, shape)
Example:
Output:
Random values:
[[0.79023899 0.83268165 0.30868424]
[0.01097371 0.0861339 0.11680078]
[0.79638802 0.75836068 0.49767694]]
Random integers:
[[4 0 2]
[1 4 0]
[0 5 6]]
Structured Arrays: Structured arrays allow storing different data types (like strings, integers, floats)
in a single array, similar to a database table. You define a custom data type with fields and then create
an array with that type.
15
Syntax:
dtype: The dtype (data type) tells NumPy what type of values each field will store. In structured arrays,
dtype defines a blueprint of the structure:
• Field name → acts like a column name (e.g., 'name', 'age', 'height').
• Data type → defines the storage type (e.g., string, int, float).
Example:
Output:
Structured array:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]
Names: ['Alice' 'Bob']
Ages: [25 30]
Heights: [5.5 5.8]
The dtype parameter in NumPy allows you to set the default data type of an array. If you don’t give
dtype, NumPy automatically selects it from your input values. If you give it explicitly, the array will
be converted to that type.
Syntax:
[Link](data, dtype=datatype)
Example:
import numpy as np
int_array = [Link]([1.5, 2.7, 3.8], dtype='int32')
print("Integer array:", int_array)
float_array = [Link]([1, 2, 3], dtype='float64')
print("Float array:", float_array)
complex_array = [Link]([1, 2, 3], dtype='complex')
print("Complex array:", complex_array)
Output:
Integer array: [1 2 3]
Float array: [1. 2. 3.]
Complex array: [1.+0.j 2.+0.j 3.+0.j]
16
Advantages of NumPy ndarray:
• Performance: Ndarrays store data in contiguous memory blocks and use optimized C
operations. This makes array operations much faster and more efficient than standard Python
lists.
• Ease of Use: NumPy provides ready-made functions for creating, manipulating, and operating
on arrays. Complex computations (like matrix multiplication or Fourier transforms) can be
done with simple functions.
• Broadcasting: Enables performing operations on arrays of different shapes and sizes without
writing loops. Saves coding effort and improves performance.
• Integration: Ndarrays work seamlessly with SciPy, Pandas, Matplotlib, Scikit-learn, and many
other libraries. This makes them the backbone of scientific Python programming.
• Homogeneous Data Types: All elements in an ndarray must have the same type (all integers,
all floats, etc.). This makes it less suitable for mixed data (like names + numbers).
• Memory Usage: Handling very large arrays requires a lot of memory. High-dimensional data
can cause memory errors on low-resource systems.
• Learning Curve: Beginners coming from basic Python lists may find it difficult to understand
concepts like broadcasting, slicing, and vectorization.
Applications of NumPy ndarray:
• Scientific Computing: Used in physics, chemistry, and biology for simulations, modeling, and
data analysis.
• Data Analysis: Core tool for processing, cleaning, and analyzing large datasets.
• Machine Learning: Acts as the base for handling training and testing datasets before feeding
them into ML algorithms.
• Image Processing: Images are stored as pixel arrays; NumPy makes it easy to manipulate and
process them.
• Financial Modeling: Applied in quantitative finance for risk analysis, option pricing, and
algorithmic trading.
17
1. Numeric Data Types:
(a) Integer Types: Integers are whole numbers (positive or negative, without decimals). Based on
size, they come in different forms:
(b) Unsigned Integers: Unsigned means only positive values (no negatives). They can store larger
positive numbers with the same bit size:
• uint8 → 0 to 255.
• uint16 → 0 to 65,535.
• uint32 → 0 to 4,294,967,295.
• uint64 → up to 18 quintillion+
Example: [Link]([10, 20, 30], dtype='uint8')
(c) Floating-Point Types: These are numbers with decimals. Based on precision (accuracy), they
are:
(d) Complex Numbers: Complex numbers have a real part and an imaginary part. NumPy
supports:
18
• complex128 → 64-bit real + 64-bit imaginary
Example: [Link]([2+3j, 4+5j], dtype='complex128)
3. String Data Type(str): It stores fixed-length text. For variable-length text, we can use object
dtype.
4. Object Data Type(object): It allows storing Python objects of different types inside one array
(like a mix of numbers, strings, etc.).
Example:
import numpy as np
data_type = [Link]([('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = [Link]([('Alice', 25, 5.5), ('Bob', 30, 5.8)], dtype=data_type)
print(structured_array)
Output:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]
• Homogeneous Requirement → Normal ndarrays must store elements of the same type (except
structured arrays).
• Memory Usage → Large arrays with high precision (float64) take a lot of memory.
• Learning Curve → Beginners may find it confusing to select the right dtype.
Applications of Numpy Data Types:
• Scientific Computing – Used for simulations and complex calculations. Floating types help
control accuracy.
19
• Data Analysis – Handle large datasets efficiently; right data type saves memory and speeds up
processing.
• Machine Learning – Store and process big datasets; improves model training speed and reduces
memory use.
• Image Processing – Images stored as arrays; data type (e.g., uint8) affects memory use and
image quality.
Example:
import numpy as np
# Integer Array
int_array = [Link]([1, 2, 3, 4], dtype='int32')
print("Integer array:", int_array)
# Float Array
float_array = [Link]([1.1, 2.2, 3.3], dtype='float64')
print("Float array:", float_array)
# Complex Array
complex_array = [Link]([1+2j, 3+4j], dtype='complex128')
print("Complex array:", complex_array)
# Boolean Array
bool_array = [Link]([True, False, True], dtype='bool')
print("Boolean array:", bool_array)
# Structured Array
data_type = [Link]([('name', 'U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = [Link]([('Alice', 25, 5.5), ('Bob', 30, 5.8)], dtype=data_type)
print("Structured array:\n", structured_array)
Output:
Integer array: [1 2 3 4]
Float array: [1.1 2.2 3.3]
Complex array: [1.+2.j 3.+4.j]
Boolean array: [True False True]
Structured array:
[('Alice', 25, 5.5) ('Bob', 30, 5.8)]
Example:
import numpy as np
20
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print("Addition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)
Output:
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [ 4 10 18]
Division: [0.25 0.4 0.5 ]
b) Universal Functions (ufuncs): NumPy also provides ready-made functions to do the same
operations, such as [Link], [Link], [Link], and [Link]. These do the same element-
wise operations but in a more explicit way.
Example:
import numpy as np
addition = [Link](a, b)
multiplication = [Link](a, b)
print("Addition using ufunc:", addition)
print("Multiplication using ufunc:", multiplication)
Output:
Addition using ufunc: [5 7 9]
Multiplication using ufunc: [ 4 10 18]
Example:
a = [Link]([1, 2, 3])
print(a + 5)
Output:
[6 7 8]
b) Arrays of Different Shapes: If the arrays have compatible shapes, NumPy “stretches” the
smaller array to match the bigger one.
Example:
21
Output:
[[11 22 33]
[14 25 36]]
Output:
[[19 22]
[43 50]]
Example:
a = [Link]([1, 2, 3])
print(a ** 2)
Output:
[1 4 9]
• Fast and Efficient: Operations are much faster than normal Python lists because NumPy uses
optimized C code.
• Less Code: No need for loops. One line of code can handle the whole array.
• Broadcasting: Easy to work with arrays of different sizes.
• Consistency: Functions are simple and work in the same way for different operations.
Disadvantages of Arithmetic with Numpy Arrays:
• Memory Usage: For very large arrays, it may use a lot of memory.
• Same Data Type: All elements in a NumPy array must be of the same type (all integers or all
floats).
• Broadcasting Confusion: If we don’t understand broadcasting rules, we may get wrong results.
Applications:
22
Example:
import numpy as np
#Basic arithmetic
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])
print("Addition:", a + b )
print("Subtraction:", a - b )
print("Multiplication:", a*b )
print("Division:", a / b )
#Broadcasting with scalar
scalar = 10
print("Scalar addition:", a + scalar)
#Broadcasting with arrays
c = [Link]([[1, 2, 3], [4, 5, 6]])
d = [Link]([10, 20, 30])
print("Broadcasted addition:\n", c + d)
# Matrix multiplication
A= [Link] ([ [1, 2] , [3, 4]])
B = [Link] ([ [5, 6] , [7, 8] ])
print("Matrix multiplication:\n", A @ B)
#Element-wise power
print("Element-wise power.", a **2)
Output:
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [4 10 18]
Division: [0.25 0.4 0.5]
Scalar addition: [11 12 13]
Broadcasted addition:
[[11 22 33]
[14 25 36]]
Matrix multiplication:
[[19 22]
[43 50]]
Element-wise power: [1 4 9]
1. Indexing: Indexing refers to accessing individual elements of an array using their positions.
a) Indexing in 1D Arrays: A 1D array is like a simple list. Index numbers start from 0 (first
element) and go up to n-1 (last element).
Example:
import numpy as np
23
array1D = [Link]([10, 20, 30, 40, 50])
print("First element:", array1D[0])
print("Third element:", array1D[2])
print("Last element:", array1D[-1])
Output:
First element: 10
Third element: 30
Last element: 50
b) Indexing in 2D Arrays: A 2D array looks like a table (rows and columns). We use a pair
of indices:
• First index → Row number
• Second index → Column number
Example:
import numpy as np
array2D = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("Element at (1,2):", array2D[1, 2]) # 6
print("Element at (0,0):", array2D[0, 0]) # 1
print("Element at (-1,-1):", array2D[-1, -1]) # 9
Output:
Element at (1,2): 6
Element at (0,0): 1
Element at (-1,-1): 9
2. Slicing: Slicing allows us to select subarrays (a continuous block of data) instead of single
elements. Slicing in Python is a technique used to extract a portion of data from sequences such
as lists, strings, or NumPy arrays by specifying a range of indices. It follows the general syntax
as sequence[start:stop:step], where the start index is included, the stop index is excluded, and
the step defines the interval between elements. If the start or stop values are omitted, Python
assumes defaults (beginning or end of the sequence), and if the step is omitted, it defaults to 1.
Syntax:
a) Slicing in 1D Arrays:
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
24
print("Slice from index 1 to 4:", array1D[1:4]) # [20 30 40]
print("Every second element:", array1D[::2]) # [10 30 50]
print("From index 2 till end:", array1D[2:]) # [30 40 50]
print("From start to index 3:", array1D[:3]) # [10 20 30]
Output:
Slice from index 1 to 4: [20 30 40]
Every second element: [10 30 50]
From index 2 till end: [30 40 50]
From start to index 3: [10 20 30]
b) Slicing in 2D Arrays: We can slice rows and columns together. Slicing creates a view of the
original array (not a copy).
Example:
import numpy as np
array2D = [Link]([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print("Rows 0 to 1, columns 1 to 2:\n", array2D[0:2, 1:3])
print("Every second row and column:\n", array2D[::2, ::2])
Output:
Rows 0 to 1, columns 1 to 2:
[[2 3]
[5 6]]
Every second row and column:
[[1 3]
[7 9]]
c) Advanced Indexing: Sometimes basic indexing and slicing are not enough. NumPy provides
advanced indexing methods:
• Boolean Indexing: We can select elements that satisfy a condition using a boolean mask.
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
mask = array1D > 25
print("Boolean mask:", mask)
print("Elements greater than 25:", array1D[mask])
Output:
Boolean mask: [False False True True True]
Elements greater than 25: [30 40 50]
• Integer Array Indexing: We can specify a list of indices to pick multiple elements.
Example:
import numpy as np
array1D = [Link]([10, 20, 30, 40, 50])
indices = [0, 2, 4]
print("Selected elements:", array1D[indices])
25
Output:
Selected elements: [10 30 50]
Applications:
import numpy as np
#Create a 2D array
array = [Link]([[10, 20, 30, 40], [50, 60, 70, 80], [90, 100, 110, 120]])
# Basic indexing
print("Element at (1, 2): ", array[1, 2])
# Negative indexing
print("Element at (- 1, - 1): ", array[-1, -1])
# Basic slicing
print("Slice from rows 0 to 2 and columns 1 to 3: \n" , array [0:2, 1:3] )
# Slicing with step
print("Every second row and column: \n", array [:: 2 , :2])
# Boolean indexing
mask = array[:, 0] > 30
print("Rows where first column > 30: \n", array[mask])
# Integer array indexing
indices = [0, 2]
print("Rows at indices 0 and 2: \n", array[indices])
Output:
Element at (1, 2): 70
Element at (- 1, - 1): 120
Slice from rows 0 to 2 and columns 1 to 3:
[[20 30]
26
[60 70]]
Every second row and column:
[[ 10 20]
[ 90 100]]
Rows where first column > 30:
[[ 50 60 70 80]
[ 90 100 110 120]]
Rows at indices 0 and 2:
[[ 10 20 30 40]
[ 90 100 110 120]]
First, we create a Boolean mask → an array of the same shape that contains only True or False values.
Each True represents an element that satisfies the condition, and False means it does not. When this
mask is applied to the original array, we get only those elements where the mask is True.
How Boolean Indexing Works ?
Example:
import numpy as np
# Create an array
array = [Link]([10, 20, 30, 40, 50])
# Condition: elements greater than 25
mask = array > 25
print("Boolean mask:", mask)
Output:
Boolean mask: [False False True True True]
import numpy as np
# Create an array
array = [Link]([10, 20, 30, 40, 50])
# Condition: elements greater than 25
mask = array > 25
filtered_array = array[mask]
27
print("Filtered array:", filtered_array)
Output:
Filtered array: [30 40 50]
• Memory Usage → The mask itself is another array, so with huge datasets, it takes extra
memory.
• Performance Issues with Large Masks → Creating very large boolean masks can slow down
performance.
• Errors Possible → If the mask shape does not match the array shape, it causes errors.
Applications of Boolean Indexing:
• Data Filtering → Extracting specific rows or values (e.g., marks > 35).
• Data Cleaning → Removing unwanted data points.
• Conditional Updates → Changing only the values that satisfy a condition.
Example:
import numpy as np
# Create a 2D array
array = [Link]([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
# Create a boolean mask to select elements greater than 50
mask = array > 50
print("Boolean mask:\n", mask)
#Apply the mask to filter elements
filtered_array = array[mask]
print("Filtered array:", filtered_array)
# Example of modifying elements using boolean indexing
array[mask] = -1
print("Modified array:\n", array)
Output:
Boolean mask:
[[False False False]
[False False True]
[True True True]]
Filtered array: [60 70 80 90]
Modified array:
[[10 20 30]
[40 50-1]
[-1-1-1]]
28
Transposing Arrays and Swapping Axes in NumPy:
Working with data often requires changing its orientation or the way its rows and columns are arranged.
In NumPy, two very important operations help us do this:
1. Transposing arrays
2. Swapping axes
Both operations are useful for reshaping data, making it ready for mathematical operations like
multiplication, or preparing it for analysis in machine learning and data science.
1. Transposing Arrays:
Meaning: Transposing an array means flipping its rows into columns and columns into rows.
For a 2D array (matrix), it simply means exchanging rows with columns. For arrays with more
than two dimensions, the transpose operation rearranges the axes based on the given order.
import numpy as np
# 2D array (matrix)
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original 2D array:\n", array_2d)
# Transpose the array
transposed_array = array_2d.T
print("Transposed array:\n", transposed_array)
Output:
Original 2D array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]]
In Mathematics: Used in linear algebra for matrix multiplication, dot products, and solving equations.
Data Science: Aligns data when rows and columns need to be flipped (e.g., features vs. samples). In
Image Processing: Transposing can flip images or rearrange pixel values.
29
2. Swapping Axes:
Meaning: Sometimes, instead of flipping all rows and columns, we may want to rearrange only
specific dimensions (axes) of a multidimensional array. Swapping axes means exchanging any
two given axes (dimensions) of the array.
• [Link](a, axis1, axis2) → Returns a view of array a with the two specified
axes interchanged.
Here:
Example:
import numpy as np
# 3D array
array_3d = [Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("Original 3D array:\n", array_3d)
# Swap the first and last axes
swapped_array = [Link](array_3d, 0, 2)
print("Array after swapping axes:\n", swapped_array)
Output:
Original 3D array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Array after swapping axes:
[[[1 5]
[3 7]]
[[2 6]
[4 8]]]
Here:
• The first axis (0) and the last axis (2) have been swapped.
• This changes how the data is arranged, but it doesn’t lose any information.
Advantages:
Disadvantages:
• Memory Usage – While transpose usually gives just a “view” of data, sometimes swapping axes
may use extra memory for large arrays.
30
• Complexity in Higher Dimensions – For arrays with 3 or more dimensions, it can become
confusing to keep track of which axis is being swapped.
Applications:
Example:
import numpy as np
#Create a 2D array
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print("Original 2D array:\n", array_2d)
#Transpose the array
transposed_array = array_2d.T
print("Transposed array:\n", transposed_array)
#Create a 3D array
array_3d =[Link]([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print("Original 3D array:\n", array_3d)
#Swap the first and last axes
swapped_array = [Link](array_3d, 0, 2)
print("Array with swapped axes:\n", swapped_array)
Output:
Original 2D array:
[[1 2 3]
[4 5 6]]
Transposed array:
[[1 4]
[2 5]
[3 6]]
Original 3D array:
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
Array with swapped axes:
[[[1 5]
[3 7]]
[[2 6]
[4 8]]]
31
Universal functions (ufuncs): Fast element-wise array functions
In NumPy, Universal Functions (ufuncs) are special functions that work on each element of an array
individually. They are written in the C programming language inside NumPy, which makes them very
fast compared to normal Python loops. Instead of writing long loops to apply a formula on each
element, we can simply use a ufunc, and it will apply the operation to all elements automatically.
For example, if we want to find the square root of each element in an array, a ufunc like [Link]() can
do this in one line.
Example:
import numpy as np
array = [Link]([1, 2, 3, 4])
sqrt_array = [Link](array)
print("Square root:", sqrt_array)
Output:
Square root: [1. 1.41421356 1.73205081 2.]
Advantages:
• Speed – Ufuncs are very fast because they are built in C and avoid Python loops.
• Simple Code – One line of code replaces many lines of loops.
• Broadcasting Support – Ufuncs can also work with arrays of different shapes, making them
flexible.
Disadvantages:
• High Memory Use – For very large arrays, they may consume more memory.
• Limited Scope – They are mainly for element-wise operations (not for complex logic).
• Errors in Special Cases – Some operations (like division by zero) may cause errors or warnings.
Applications:
import numpy as np
array = [Link]([1, 4, 9, 16])
square_root = [Link](array) # Square root
squared = [Link](array) # Square of each element
logarithm = [Link](array) # Natural log
32
print("Original array:", array)
print("Square root:", square_root)
print("Squared:", squared)
print("Logarithm:", logarithm)
Output:
Original array: [1 4 9 16]
Square root: [1. 2. 3. 4.]
Squared: [1 16 81 256]
Logarithm: [0. 1.38629436 2.19722458 2.77258872]
In NumPy, sorting is very easy and efficient. The two main functions used are:
• [Link]() → Gives a new array with all elements arranged in ascending order (by default).
• [Link]() → Gives the positions (indices) that can be used to rearrange the array into sorted
order.
For example, if we have an array of numbers, we can sort it directly or find the order in which we
should arrange its elements. In the case of multidimensional arrays (matrices), sorting can be done
along rows or columns, which makes it flexible for handling real-world datasets.
• [Link]() → sorts the values directly and returns a new sorted array.
• [Link]() → returns the index numbers of elements in sorted order (which can be used to
rearrange the data).
Example:
import numpy as np
# Create an array
array = [Link]([3, 1, 2, 5, 4])
# Sort the array
sorted_array = [Link](array)
print("Sorted array:", sorted_array)
# Get indices that sort the array
sorted_indices = [Link](array)
print("Indices to sort the array:", sorted_indices)
Output:
Sorted array: [1 2 3 4 5]
Indices to sort the array: [1 2 0 4 3]
Here, the numbers are arranged in ascending order. The indices show where the numbers originally
were before sorting.
33
Advantages of Sorting:
• Efficiency – Sorting in NumPy is fast and can handle large data sets easily because it uses
powerful algorithms like quicksort, mergesort, and heapsort.
• Flexibility – Sorting works not only on 1D arrays but also on multi-dimensional arrays along
any row or column.
Disadvantages of Sorting:
• Extra Memory Usage – When using [Link](), a new array is created, which requires extra
memory.
• Complexity in Higher Dimensions – Sorting 2D or 3D arrays can be confusing because we
need to decide whether to sort by row, by column, or by another axis.
Applications of Sorting:
Example:
import numpy as np
# 1D Array
array = [Link]([7, 2, 5, 1, 9])
sorted_array = [Link](array)
print("Sorted 1D array:", sorted_array)
# 2D Array
array_2d = [Link]([[3, 1, 2], [6, 5, 4]])
# Sort column-wise (axis=0)
sorted_axis0 = [Link](array_2d, axis=0)
print("Sorted 2D array along axis 0:\n", sorted_axis0)
# Sort row-wise (axis=1)
sorted_axis1 = [Link](array_2d, axis=1)
print("Sorted 2D array along axis 1:\n", sorted_axis1)
Output:
Sorted 1D array: [1 2 5 7 9]
Sorted 2D array along axis 0:
[[3 1 2]
[6 5 4]]
Sorted 2D array along axis 1:
[[1 2 3]
[4 5 6]]
Here, the 1D array is sorted normally. In the 2D array, sorting along axis=0 means column-wise sorting,
while axis=1 means row-wise sorting.
34
Unique and Set Logic in NumPy:
When we work with data, sometimes we get repeated values or need to compare two datasets. NumPy
makes this easy by giving us special functions to handle unique values and set operations (like
intersection, union, and difference). These operations are very useful in data cleaning, analysis, and
comparison.
1. Unique Elements:
The [Link]() function helps us find the unique (non-repeated) values in an array. It removes
duplicates. It can also give us how many times each value appears. By default, it returns the
results in sorted order.
Example:
import numpy as np
array = [Link]([1, 2, 2, 3, 4, 4, 5])
unique_elements = [Link](array)
print("Unique elements:", unique_elements)
Output:
Unique elements: [1 2 3 4 5]
import numpy as np
array1 = [Link]([1, 2, 3, 4])
array2 = [Link]([3, 4, 5, 6])
# Intersection
print("Intersection:", np.intersect1d(array1, array2))
# Union
print("Union:", np.union1d(array1, array2))
# Difference
print("Difference:", np.setdiff1d(array1, array2))
Output:
Intersection: [3 4]
Union: [1 2 3 4 5 6]
Difference: [1 2]
35
Advantages:
36