Data Pre processing SY Btech Sem:III
What is Data Preprocessing? • Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
Why do we need Data Preprocessing? • data generally contains noises, missing values, unusable format • tasks for cleaning the data and making it suitable for a machine learning model • increasing the accuracy and efficiency of a machine learning model.
Steps in Data Pre processing • Getting the dataset • Importing libraries • Importing datasets • Finding Missing Data • Encoding Categorical Data • Splitting dataset into training and test set • Feature scaling
Python Libraries for Data Preprocessing • NumPy • Pandas • Matplotlib
NumPy: Numerical Python • NumPy is used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • NumPy was created in 2005 by Travis Oliphant. • It is an open source project and we can use it freely.
Import NumPy • import numpy • import numpy as np import numpy arr = numpy.array([1, 2, 3, 4, 5]) print(arr) import numpy as np arr = numpy.array([1, 2, 3, 4, 5]) print(arr)
Create a NumPy ndarray Object • The array object in NumPy is called ndarray. • We can create a NumPy ndarray object by using the array() function. import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr) print(type(arr))
Dimensions in Arrays • 0-D Arrays • 1-D Arrays import numpy as np arr = np.array(42) print(arr) import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr)
Array cont… • 2-D Arrays • 3-D arrays import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) print(arr) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(arr)
Check Number of Dimensions? • NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions the array have. import numpy as np a = np.array(42) b = np.array([1, 2, 3, 4, 5]) c = np.array([[1, 2, 3], [4, 5, 6]]) d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(a.ndim) print(b.ndim) print(c.ndim) print(d.ndim)
NumPy Array Indexing import numpy as np arr = np.array([1, 2, 3, 4]) print(arr[0]) import numpy as np arr = np.array([1, 2, 3, 4]) print(arr[2] + arr[3])
Cont… import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('2nd element on 1st row: ', arr[0, 1]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('5th element on 2nd row: ', arr[1, 4])
Cont… import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) print(arr[0, 1, 2]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('Last element from 2nd dim: ', arr[1, -1])
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 15
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 16
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 17
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 18
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 19
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 20
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 21
Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 22
Arrays, danger zone • Must be dense, no holes. • Must be one type • Cannot combine arrays of different shape 23
Slicing arrays • taking elements from one given index to another given index. • [start:end] • [start:end:step] • If we don't pass start its considered 0 • If we don't pass end its considered length of array in that dimension • If we don't pass step its considered 1
import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[4:]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[:4])
import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[-3:-1]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5:2])
import numpy as np arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[1, 1:4]) import numpy as np arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[0:2, 1:4])
Data Types in NumPy • strings - used to represent text data, the text is given under quote marks. e.g. "ABCD" • integer - used to represent integer numbers. e.g. - 1, -2, -3 • float - used to represent real numbers. e.g. 1.2, 42.42 • boolean - used to represent True or False. • complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
Cont… import numpy as np arr = np.array([1, 2, 3, 4], dtype='i4') print(arr) print(arr.dtype) import numpy as np arr = np.array([1.1, 2.1, 3.1]) newarr = arr.astype(int) print(newarr) print(newarr.dtype)
NumPy Array Shape/Reshape import numpy as np arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) print(arr.shape) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) newarr = arr.reshape(2, 3, 2) print(newarr)
NumPy Array Iterating import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) for x in arr: print(x) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) for x in arr: for y in x: for z in y: print(z)
Iterating Arrays Using nditer() import numpy as np arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) for x in np.nditer(arr): print(x) import numpy as np arr = np.array([1, 2, 3]) for idx, x in np.ndenumerate(arr): print(idx, x)
Joining NumPy Arrays import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.concatenate((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([[1, 2], [3, 4]]) arr2 = np.array([[5, 6], [7, 8]]) arr = np.concatenate((arr1, arr2), axis=1) print(arr)
Joining Arrays Using Stack Functions • Stacking Along Rows import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.stack((arr1, arr2), axis=1) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.hstack((arr1, arr2)) print(arr)
Stacking Along Columns • Stacking Along Height (depth) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.vstack((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.dstack((arr1, arr2)) print(arr)
Splitting NumPy Arrays import numpy as np arr = np.array([1, 2, 3, 4, 5, 6]) newarr = np.array_split(arr, 3) print(newarr)
NumPy Searching Arrays import numpy as np arr = np.array([1, 2, 3, 4, 5, 4, 4]) x = np.where(arr == 4) print(x) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8]) x = np.where(arr%2 == 0) print(x)
Sorting Arrays import numpy as np arr = np.array([3, 2, 0, 1]) print(np.sort(arr)) import numpy as np arr = np.array(['banana', 'cherry', 'apple']) print(np.sort(arr))
Random Numbers in NumPy • What is a Random Number? – Random means something that can not be predicted logically. • Generate Random Number from numpy import random x = random.randint(100) print(x)
Generate Random Float • Generate Random Array – x = random.randint(100, size=(3, 5)) – x = random.rand(3, 5) – x = random.choice([3, 5, 7, 9]) from numpy import random x = random.rand() print(x) from numpy import random x=random.randint(100, size=(5)) print(x)

Data Preprocessing Introduction for Machine Learning

  • 1.
  • 2.
    What is DataPreprocessing? • Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.
  • 3.
    Why do weneed Data Preprocessing? • data generally contains noises, missing values, unusable format • tasks for cleaning the data and making it suitable for a machine learning model • increasing the accuracy and efficiency of a machine learning model.
  • 4.
    Steps in DataPre processing • Getting the dataset • Importing libraries • Importing datasets • Finding Missing Data • Encoding Categorical Data • Splitting dataset into training and test set • Feature scaling
  • 5.
    Python Libraries forData Preprocessing • NumPy • Pandas • Matplotlib
  • 6.
    NumPy: Numerical Python •NumPy is used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • NumPy was created in 2005 by Travis Oliphant. • It is an open source project and we can use it freely.
  • 7.
    Import NumPy • importnumpy • import numpy as np import numpy arr = numpy.array([1, 2, 3, 4, 5]) print(arr) import numpy as np arr = numpy.array([1, 2, 3, 4, 5]) print(arr)
  • 8.
    Create a NumPyndarray Object • The array object in NumPy is called ndarray. • We can create a NumPy ndarray object by using the array() function. import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr) print(type(arr))
  • 9.
    Dimensions in Arrays •0-D Arrays • 1-D Arrays import numpy as np arr = np.array(42) print(arr) import numpy as np arr = np.array([1, 2, 3, 4, 5]) print(arr)
  • 10.
    Array cont… • 2-DArrays • 3-D arrays import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) print(arr) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(arr)
  • 11.
    Check Number ofDimensions? • NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions the array have. import numpy as np a = np.array(42) b = np.array([1, 2, 3, 4, 5]) c = np.array([[1, 2, 3], [4, 5, 6]]) d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) print(a.ndim) print(b.ndim) print(c.ndim) print(d.ndim)
  • 12.
    NumPy Array Indexing importnumpy as np arr = np.array([1, 2, 3, 4]) print(arr[0]) import numpy as np arr = np.array([1, 2, 3, 4]) print(arr[2] + arr[3])
  • 13.
    Cont… import numpy asnp arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('2nd element on 1st row: ', arr[0, 1]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('5th element on 2nd row: ', arr[1, 4])
  • 14.
    Cont… import numpy asnp arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) print(arr[0, 1, 2]) import numpy as np arr = np.array([[1,2,3,4,5], [6,7,8,9,10]]) print('Last element from 2nd dim: ', arr[1, -1])
  • 15.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 15
  • 16.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 16
  • 17.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 17
  • 18.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 18
  • 19.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 19
  • 20.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 20
  • 21.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 21
  • 22.
    Arrays, creation • np.ones,np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 22
  • 23.
    Arrays, danger zone •Must be dense, no holes. • Must be one type • Cannot combine arrays of different shape 23
  • 24.
    Slicing arrays • takingelements from one given index to another given index. • [start:end] • [start:end:step] • If we don't pass start its considered 0 • If we don't pass end its considered length of array in that dimension • If we don't pass step its considered 1
  • 25.
    import numpy asnp arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[4:]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[:4])
  • 26.
    import numpy asnp arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[-3:-1]) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7]) print(arr[1:5:2])
  • 27.
    import numpy asnp arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[1, 1:4]) import numpy as np arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]) print(arr[0:2, 1:4])
  • 28.
    Data Types inNumPy • strings - used to represent text data, the text is given under quote marks. e.g. "ABCD" • integer - used to represent integer numbers. e.g. - 1, -2, -3 • float - used to represent real numbers. e.g. 1.2, 42.42 • boolean - used to represent True or False. • complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j
  • 29.
    Cont… import numpy asnp arr = np.array([1, 2, 3, 4], dtype='i4') print(arr) print(arr.dtype) import numpy as np arr = np.array([1.1, 2.1, 3.1]) newarr = arr.astype(int) print(newarr) print(newarr.dtype)
  • 30.
    NumPy Array Shape/Reshape importnumpy as np arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) print(arr.shape) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) newarr = arr.reshape(2, 3, 2) print(newarr)
  • 31.
    NumPy Array Iterating importnumpy as np arr = np.array([[1, 2, 3], [4, 5, 6]]) for x in arr: print(x) import numpy as np arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) for x in arr: for y in x: for z in y: print(z)
  • 32.
    Iterating Arrays Usingnditer() import numpy as np arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) for x in np.nditer(arr): print(x) import numpy as np arr = np.array([1, 2, 3]) for idx, x in np.ndenumerate(arr): print(idx, x)
  • 33.
    Joining NumPy Arrays importnumpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.concatenate((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([[1, 2], [3, 4]]) arr2 = np.array([[5, 6], [7, 8]]) arr = np.concatenate((arr1, arr2), axis=1) print(arr)
  • 34.
    Joining Arrays UsingStack Functions • Stacking Along Rows import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.stack((arr1, arr2), axis=1) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.hstack((arr1, arr2)) print(arr)
  • 35.
    Stacking Along Columns •Stacking Along Height (depth) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.vstack((arr1, arr2)) print(arr) import numpy as np arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) arr = np.dstack((arr1, arr2)) print(arr)
  • 36.
    Splitting NumPy Arrays importnumpy as np arr = np.array([1, 2, 3, 4, 5, 6]) newarr = np.array_split(arr, 3) print(newarr)
  • 37.
    NumPy Searching Arrays importnumpy as np arr = np.array([1, 2, 3, 4, 5, 4, 4]) x = np.where(arr == 4) print(x) import numpy as np arr = np.array([1, 2, 3, 4, 5, 6, 7, 8]) x = np.where(arr%2 == 0) print(x)
  • 38.
    Sorting Arrays import numpyas np arr = np.array([3, 2, 0, 1]) print(np.sort(arr)) import numpy as np arr = np.array(['banana', 'cherry', 'apple']) print(np.sort(arr))
  • 39.
    Random Numbers inNumPy • What is a Random Number? – Random means something that can not be predicted logically. • Generate Random Number from numpy import random x = random.randint(100) print(x)
  • 40.
    Generate Random Float •Generate Random Array – x = random.randint(100, size=(3, 5)) – x = random.rand(3, 5) – x = random.choice([3, 5, 7, 9]) from numpy import random x = random.rand() print(x) from numpy import random x=random.randint(100, size=(5)) print(x)