teach tech toe presents Machine Learning with Python
What is Machine Learning? • Machine Learning is making the computer learn from studying data and statistics. • Machine Learning is a step into the direction of artificial intelligence (AI). • Machine Learning is a program that analyses data and learns to predict the outcome.
What is a data set? • Data Set • In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database. • Example of an array: • [99,86,87,88,111,86,103,87,94,78,77,85,86]
Example of a database: Carname Color Age Speed AutoPass BMW red 5 99 Y Volvo black 7 86 Y VW gray 8 87 N VW white 7 88 Y Ford white 2 111 Y VW white 17 86 Y Tesla red 2 103 Y BMW black 9 87 Y Volvo gray 4 94 N Ford white 11 78 N Toyota gray 12 77 N VW white 9 85 N Toyota blue 6 86 Y
Array vs Dataset • By looking at the array, we can guess that the average value is probably around 80 or 90, and we are also able to determine the highest value and the lowest value, but what else can we do? • And by looking at the database we can see that the most popular color is white, and the oldest car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the other values? • That is what Machine Learning is for! Analyzing data and predict the outcome!
Types of Data? • To analyze data, it is important to know what type of data we are dealing with. • We can split the data types into three main categories: • Numerical • Categorical • Ordinal • Numerical data are numbers, and can be split into two numerical categories: • Discrete Data - numbers that are limited to integers. Example: The number of cars passing by. • Continuous Data - numbers that are of infinite value. Example: The price of an item, or the size of an item
Types of Data? • Categorical data are values that cannot be measured up against each other. Example: a color value, or any yes/no values. • Ordinal data are like categorical data, but can be measured up against each other. Example: school grades where A is better than B and so on. • By knowing the data type of your data source, you will be able to know what technique to use when analyzing them.
Machine Learning - Mean Median Mode In Machine Learning (and in mathematics) there are often three values that interests us: • Mean - The average value • Median - The mid point value • Mode - The most common value • Example: We have registered the speed of 13 cars: • speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Standard Deviation: • Standard deviation is a number that describes how spread out the values are. • A low standard deviation means that most of the numbers are close to the mean (average) value. • A high standard deviation means that the values are spread out over a wider range. • Example: This time we have registered the speed of 7 cars: • speed = [86,87,88,86,87,85,86] • The standard deviation is: • 0.9
Standard Deviation: • Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4. • Let us do the same with a selection of numbers with a wider range: • speed = [32,111,138,28,59,77,97] • The standard deviation is: 37.85 • Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4. • As you can see, a higher standard deviation indicates that the values are spread out over a wider range. • The NumPy module has a method to calculate the standard deviation:
Variance • Variance is another number that indicates how spread out the values are. • In fact, if you take the square root of the variance, you get the standard deviation! • Or the other way around, if you multiply the standard deviation by itself, you get the variance! • To calculate the variance you have to do as follows:
Variance • 1. Find the mean: (32+111+138+28+59+77+97) / 7 = 77.4 2. For each value: find the difference from the mean: 32 - 77.4 = -45.4 111 - 77.4 = 33.6 138 - 77.4 = 60.6 28 - 77.4 = -49.4 59 - 77.4 = -18.4 77 - 77.4 = - 0.4 97 - 77.4 = 19.6
Variance • 3. For each difference: Find the square value: • (-45.4)2 = 2061.16 • (33.6)2 = 1128.96 • (60.6)2 = 3672.36 • (-49.4)2 = 2440.36 • (-18.4)2 = 338.56 • (- 0.4)2 = 0.16 • (19.6)2 = 384.16
Variance • 4. The variance is the average number of these squared differences: • (2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2 • Luckily, With NumPy we have a method to calculate the variance:
Standard Deviation • As we have learned, the formula to find the standard deviation is the square root of the variance: • √1432.25 = 37.85
Machine Learning - Percentiles • What are Percentiles? • Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than. • Example: Let's say we have an array of the ages of all the people that lives in a street. • ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] • What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.
Machine Learning - Data Distribution • Data Distribution • In the real world, the data sets are much bigger than we have seen so far, but it can be difficult to gather real world data, at least at an early stage of a project. • How Can we Get Big Data Sets? • To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.
Histogram • To visualize the data set we can draw a histogram with the data we collected. • We will use the Python module Matplotlib to draw a histogram:
Normal Data Distribution: • Now, we will learn how to create an array where the values are concentrated around a given value. • In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.
Machine Learning - Scatter Plot • Scatter Plot • A scatter plot is a diagram where each value in the data set is represented by a dot.
Machine Learning - Scatter Plot • The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis: • x = [5,7,8,7,2,17,2,9,4,11,12,9,6] • y = [99,86,87,88,111,86,103,87,94,78,77,85,86] • The x array represents the age of each car. • The y array represents the speed of each car.
Machine Learning - Scatter Plot
Machine Learning - Linear Regression • The term regression is used when you try to find the relationship between variables. • In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events. • Linear Regression: • Linear regression uses the relationship between the data-points to draw a straight line through all them. • This line can be used to predict future values.
Machine Learning - Linear Regression
How Does it Work? • Python has methods for finding a relationship between data-points and to draw a line of linear regression. We will show you how to use these methods instead of going through the mathematic formula. • In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected could be used in a linear regression:
R-Squared • It is important to know how well the relationship between the values of the x-axis and the values of the y-axis is, if there are no relationship the linear regression can not be used to predict anything. • The relationship is measured with a value called the r-squared. • The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. • Python and the Scipy module will computed this value for you, all you have to do is feed it with the x and y values:
Predict Future Values • Now we can use the information we have gathered to predict future values. • Example: Let us try to predict the speed of a 10 years old car. • To do so, we need the same myfunc() function from the example above: def myfunc(x): return slope * x + intercept
Machine Learning - Polynomial Regression • If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression. • Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.
How Does it Work? • Python has methods for finding a relationship between data-points and to draw a line of polynomial regression. We will show you how to use these methods instead of going through the mathematic formula. • In the example below, we have registered 18 cars as they were passing a certain tollbooth. • We have registered the car's speed, and the time of day (hour) the passing occurred. • The x-axis represents the hours of the day and the y-axis represents the speed:
R - Squared • It is important to know how well the relationship between the values of the x- and y-axis is, if there are no relationship the polynomial regression can not be used to predict anything. • The relationship is measured with a value called the r-squared. • The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. • Python and the Sklearn module will computed this value for you, all you have to do is feed it with the x and y arrays:
Predict Future Values • Now we can use the information we have gathered to predict future values. • Example: Let us try to predict the speed of a car that passes the tollbooth at around 17 P.M: • To do so, we need the same mymodel array from the example above:
Machine Learning - Multiple Regression • Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. • Take a look at the data set below, it contains some information about cars. • https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be1 0b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv • We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate.
Machine Learning • In next video we will understand: • 1 .) Scale • 2 .) Train / Test and • 3 .) Decision Tree
teach tech toe presents Machine Learning with Python- 2
Machine Learning - Scale • When the data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time? • The answer to this problem is scaling. We can scale data into new values that are easier to compare. • Take a look at the table below, it is the same data set that we used in the multiple regression , but this time the volume column contains values in liters instead of ccm (1.0 instead of 1000).
Machine Learning - Scale • It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other. • There are different methods for scaling data, in this tutorial we will use a method called standardization. • The standardization method uses this formula: • z = (x - u) / s
Machine Learning - Scale • Where z is the new value, x is the original value, u is the mean and s is the standard deviation. • If you take the weight column from the data set above, the first value is 790, and the scaled value will be: • (790 - 1292.23) / 238.74 = -2.1 • If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be: • (1.0 - 1.61) / 0.38 = -1.59 • Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.
Machine Learning - Scale • You do not have to do this manually, the Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.
Machine Learning - Scale • Predict CO2 Values • The task in the Multiple Regression was to predict the CO2 emission from a car when you only knew its weight and volume. • When the data set is scaled, you will have to use the scale when you predict values:
Machine Learning with Python made easy and simple

Machine Learning with Python made easy and simple

  • 1.
    teach tech toe presents MachineLearning with Python
  • 2.
    What is MachineLearning? • Machine Learning is making the computer learn from studying data and statistics. • Machine Learning is a step into the direction of artificial intelligence (AI). • Machine Learning is a program that analyses data and learns to predict the outcome.
  • 3.
    What is adata set? • Data Set • In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database. • Example of an array: • [99,86,87,88,111,86,103,87,94,78,77,85,86]
  • 4.
    Example of adatabase: Carname Color Age Speed AutoPass BMW red 5 99 Y Volvo black 7 86 Y VW gray 8 87 N VW white 7 88 Y Ford white 2 111 Y VW white 17 86 Y Tesla red 2 103 Y BMW black 9 87 Y Volvo gray 4 94 N Ford white 11 78 N Toyota gray 12 77 N VW white 9 85 N Toyota blue 6 86 Y
  • 5.
    Array vs Dataset •By looking at the array, we can guess that the average value is probably around 80 or 90, and we are also able to determine the highest value and the lowest value, but what else can we do? • And by looking at the database we can see that the most popular color is white, and the oldest car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the other values? • That is what Machine Learning is for! Analyzing data and predict the outcome!
  • 6.
    Types of Data? •To analyze data, it is important to know what type of data we are dealing with. • We can split the data types into three main categories: • Numerical • Categorical • Ordinal • Numerical data are numbers, and can be split into two numerical categories: • Discrete Data - numbers that are limited to integers. Example: The number of cars passing by. • Continuous Data - numbers that are of infinite value. Example: The price of an item, or the size of an item
  • 7.
    Types of Data? •Categorical data are values that cannot be measured up against each other. Example: a color value, or any yes/no values. • Ordinal data are like categorical data, but can be measured up against each other. Example: school grades where A is better than B and so on. • By knowing the data type of your data source, you will be able to know what technique to use when analyzing them.
  • 8.
    Machine Learning -Mean Median Mode In Machine Learning (and in mathematics) there are often three values that interests us: • Mean - The average value • Median - The mid point value • Mode - The most common value • Example: We have registered the speed of 13 cars: • speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
  • 9.
    Standard Deviation: • Standarddeviation is a number that describes how spread out the values are. • A low standard deviation means that most of the numbers are close to the mean (average) value. • A high standard deviation means that the values are spread out over a wider range. • Example: This time we have registered the speed of 7 cars: • speed = [86,87,88,86,87,85,86] • The standard deviation is: • 0.9
  • 10.
    Standard Deviation: • Meaningthat most of the values are within the range of 0.9 from the mean value, which is 86.4. • Let us do the same with a selection of numbers with a wider range: • speed = [32,111,138,28,59,77,97] • The standard deviation is: 37.85 • Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4. • As you can see, a higher standard deviation indicates that the values are spread out over a wider range. • The NumPy module has a method to calculate the standard deviation:
  • 11.
    Variance • Variance isanother number that indicates how spread out the values are. • In fact, if you take the square root of the variance, you get the standard deviation! • Or the other way around, if you multiply the standard deviation by itself, you get the variance! • To calculate the variance you have to do as follows:
  • 12.
    Variance • 1. Findthe mean: (32+111+138+28+59+77+97) / 7 = 77.4 2. For each value: find the difference from the mean: 32 - 77.4 = -45.4 111 - 77.4 = 33.6 138 - 77.4 = 60.6 28 - 77.4 = -49.4 59 - 77.4 = -18.4 77 - 77.4 = - 0.4 97 - 77.4 = 19.6
  • 13.
    Variance • 3. Foreach difference: Find the square value: • (-45.4)2 = 2061.16 • (33.6)2 = 1128.96 • (60.6)2 = 3672.36 • (-49.4)2 = 2440.36 • (-18.4)2 = 338.56 • (- 0.4)2 = 0.16 • (19.6)2 = 384.16
  • 14.
    Variance • 4. Thevariance is the average number of these squared differences: • (2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2 • Luckily, With NumPy we have a method to calculate the variance:
  • 15.
    Standard Deviation • Aswe have learned, the formula to find the standard deviation is the square root of the variance: • √1432.25 = 37.85
  • 16.
    Machine Learning -Percentiles • What are Percentiles? • Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than. • Example: Let's say we have an array of the ages of all the people that lives in a street. • ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31] • What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.
  • 17.
    Machine Learning -Data Distribution • Data Distribution • In the real world, the data sets are much bigger than we have seen so far, but it can be difficult to gather real world data, at least at an early stage of a project. • How Can we Get Big Data Sets? • To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.
  • 18.
    Histogram • To visualizethe data set we can draw a histogram with the data we collected. • We will use the Python module Matplotlib to draw a histogram:
  • 19.
    Normal Data Distribution: •Now, we will learn how to create an array where the values are concentrated around a given value. • In probability theory this kind of data distribution is known as the normal data distribution, or the Gaussian data distribution, after the mathematician Carl Friedrich Gauss who came up with the formula of this data distribution.
  • 20.
    Machine Learning -Scatter Plot • Scatter Plot • A scatter plot is a diagram where each value in the data set is represented by a dot.
  • 21.
    Machine Learning -Scatter Plot • The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis: • x = [5,7,8,7,2,17,2,9,4,11,12,9,6] • y = [99,86,87,88,111,86,103,87,94,78,77,85,86] • The x array represents the age of each car. • The y array represents the speed of each car.
  • 22.
    Machine Learning -Scatter Plot
  • 23.
    Machine Learning -Linear Regression • The term regression is used when you try to find the relationship between variables. • In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events. • Linear Regression: • Linear regression uses the relationship between the data-points to draw a straight line through all them. • This line can be used to predict future values.
  • 24.
    Machine Learning -Linear Regression
  • 25.
    How Does itWork? • Python has methods for finding a relationship between data-points and to draw a line of linear regression. We will show you how to use these methods instead of going through the mathematic formula. • In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected could be used in a linear regression:
  • 26.
    R-Squared • It isimportant to know how well the relationship between the values of the x-axis and the values of the y-axis is, if there are no relationship the linear regression can not be used to predict anything. • The relationship is measured with a value called the r-squared. • The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. • Python and the Scipy module will computed this value for you, all you have to do is feed it with the x and y values:
  • 27.
    Predict Future Values •Now we can use the information we have gathered to predict future values. • Example: Let us try to predict the speed of a 10 years old car. • To do so, we need the same myfunc() function from the example above: def myfunc(x): return slope * x + intercept
  • 28.
    Machine Learning -Polynomial Regression • If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression. • Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.
  • 29.
    How Does itWork? • Python has methods for finding a relationship between data-points and to draw a line of polynomial regression. We will show you how to use these methods instead of going through the mathematic formula. • In the example below, we have registered 18 cars as they were passing a certain tollbooth. • We have registered the car's speed, and the time of day (hour) the passing occurred. • The x-axis represents the hours of the day and the y-axis represents the speed:
  • 30.
    R - Squared •It is important to know how well the relationship between the values of the x- and y-axis is, if there are no relationship the polynomial regression can not be used to predict anything. • The relationship is measured with a value called the r-squared. • The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. • Python and the Sklearn module will computed this value for you, all you have to do is feed it with the x and y arrays:
  • 31.
    Predict Future Values •Now we can use the information we have gathered to predict future values. • Example: Let us try to predict the speed of a car that passes the tollbooth at around 17 P.M: • To do so, we need the same mymodel array from the example above:
  • 32.
    Machine Learning -Multiple Regression • Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. • Take a look at the data set below, it contains some information about cars. • https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be1 0b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv • We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate.
  • 33.
    Machine Learning • Innext video we will understand: • 1 .) Scale • 2 .) Train / Test and • 3 .) Decision Tree
  • 34.
    teach tech toe presents MachineLearning with Python- 2
  • 35.
    Machine Learning -Scale • When the data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time? • The answer to this problem is scaling. We can scale data into new values that are easier to compare. • Take a look at the table below, it is the same data set that we used in the multiple regression , but this time the volume column contains values in liters instead of ccm (1.0 instead of 1000).
  • 36.
    Machine Learning -Scale • It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other. • There are different methods for scaling data, in this tutorial we will use a method called standardization. • The standardization method uses this formula: • z = (x - u) / s
  • 37.
    Machine Learning -Scale • Where z is the new value, x is the original value, u is the mean and s is the standard deviation. • If you take the weight column from the data set above, the first value is 790, and the scaled value will be: • (790 - 1292.23) / 238.74 = -2.1 • If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be: • (1.0 - 1.61) / 0.38 = -1.59 • Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.
  • 38.
    Machine Learning -Scale • You do not have to do this manually, the Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.
  • 39.
    Machine Learning -Scale • Predict CO2 Values • The task in the Multiple Regression was to predict the CO2 emission from a car when you only knew its weight and volume. • When the data set is scaled, you will have to use the scale when you predict values: