KNN Classification From Scratch in Python
The k-Nearest Neighbors classification algorithm is one of the most useful but simplest algorithms to implement. In this article, we will implement KNN classification from scratch in Python.
What is the K-Nearset Neighbors Classification Algorithm?
K-Nearest Neighbors (KNN) is a popular classification algorithm in machine learning that belongs to the family of instance-based learning or lazy learning algorithms. KNN is a simple, non-parametric, and easy-to-understand algorithm that is often used for solving classification problems in machine learning.
In the KNN algorithm, the classification of a new instance is based on the majority class of its K nearest neighbors in the training data. K is a hyperparameter that represents the number of nearest neighbors to consider. The KNN algorithm works as follows:
- Calculate the distance between the new instance and all the instances in the training data using a distance metric such as Euclidean distance, Manhattan distance, or Minkowski distance.
- Select the K data points from the training dataset that are closest to the new data point based on the calculated distance.
- Classify the new data point by assigning it to the majority class among its K nearest neighbors. In the case of a tie, the class is randomly assigned.
- Repeat steps 1-3 for all new data points that we need to classify.
KNN is a simple and intuitive algorithm that works well on small to medium-sized datasets with a small number of classes. However, it can be computationally expensive for large datasets and high-dimensional feature spaces. In addition, the KNN algorithm requires the use of a distance metric, which may not be suitable for all types of data.
To learn the algorithm, advantages, and disadvantages of KNN classification, you can read this article on KNN classification numerical example. You can also read this article on the implementation of the KNN classification using the sklearn module in python.
Let us now implement the KNN classification algorithm from scratch in python.
How to Implement KNN classification from scratch in Python?
To implement the KNN classification algorithm from scratch in python, we will use the following steps.
- First, we will load the training dataset into the program and separate the features and class labels.
- Next, we will calculate the distance between the new data point and all the existing data points in the training data set.
- After this, we will select the k nearest neighbors of the new data point in the training set.
- Finally, we will find the class label for the new data point using the majority of the class labels of k nearest neighbors.
Let us implement each step one by one.
Load the Training Dataset
For this example, we will use the following dataset.
| Point | X_Coordinate | Y_Coordinate | Class Label |
| A1 | 2 | 10 | C2 |
| A2 | 2 | 6 | C1 |
| A3 | 11 | 11 | C3 |
| A4 | 6 | 9 | C2 |
| A5 | 6 | 5 | C1 |
| A6 | 1 | 2 | C1 |
| A7 | 5 | 10 | C2 |
| A8 | 4 | 9 | C2 |
| A9 | 10 | 12 | C3 |
| A10 | 7 | 5 | C1 |
| A11 | 9 | 11 | C3 |
| A12 | 4 | 6 | C1 |
| A13 | 3 | 10 | C2 |
| A15 | 3 | 8 | C2 |
| A15 | 6 | 11 | C2 |
This dataset contains 15 data points with their coordinates and class labels. Now, let us predict the class label for a new data point (5, 7) by implementing KNN classification from scratch in python.
We have saved the training dataset in the following CSV file.
To load the dataset into our program, we will use the read_csv() function defined in the pandas module. The read_csv() function takes the name of the CSV file as its input argument and returns a data frame as shown below.
import pandas as pd training_data=pd.read_csv("KNN_Dataset.csv") print("The training data is:") print(training_data) Output:
The training data is: Point X_Coordinate Y_Coordinate Class Label 0 A1 2 10 C2 1 A2 2 6 C1 2 A3 11 11 C3 3 A4 6 9 C2 4 A5 6 5 C1 5 A6 1 2 C1 6 A7 5 10 C2 7 A8 4 9 C2 8 A9 10 12 C3 9 A10 7 5 C1 10 A11 9 11 C3 11 A12 4 6 C1 12 A13 3 10 C2 13 A15 3 8 C2 14 A15 6 11 C2 In the above dataset, the name of the points is not a data attribute but an identifier for the points. So, we will convert the Point column of the dataframe to an index. For this, we will use the set_index() method. The set_index() method, when invoked on a dataframe, takes a column name as its input argument. After execution, it converts the column into an index column and returns the modified dataframe. You can observe this in the following example.
training_data=training_data.set_index("Point") print("The training data is:") print(training_data) Output:
The training data is: X_Coordinate Y_Coordinate Class Label Point A1 2 10 C2 A2 2 6 C1 A3 11 11 C3 A4 6 9 C2 A5 6 5 C1 A6 1 2 C1 A7 5 10 C2 A8 4 9 C2 A9 10 12 C3 A10 7 5 C1 A11 9 11 C3 A12 4 6 C1 A13 3 10 C2 A15 3 8 C2 A15 6 11 C2 The above dataset is pretty simple and doesn’t require any data cleaning. If your data set contains missing values or any other abnormalities, you should perform data preprocessing first to make sure that the data is suitable for classification tasks.
Now, we will extract the attributes columns and the class labels from the training dataset as shown below.
data_points=training_data[["X_Coordinate","Y_Coordinate"]].applymap(int) print("The data points are:") print(data_points) class_labels=training_data["Class Label"] print("The class labels are:") print(class_labels) Output:
The data points are: X_Coordinate Y_Coordinate Point A1 2 10 A2 2 6 A3 11 11 A4 6 9 A5 6 5 A6 1 2 A7 5 10 A8 4 9 A9 10 12 A10 7 5 A11 9 11 A12 4 6 A13 3 10 A15 3 8 A15 6 11 The class labels are: Point A1 C2 A2 C1 A3 C3 A4 C2 A5 C1 A6 C1 A7 C2 A8 C2 A9 C3 A10 C1 A11 C3 A12 C1 A13 C2 A15 C2 A15 C2 Name: Class Label, dtype: object Calculate the Distance Between the New Data Point and the Points in the Training Data
Now, we will calculate the distance between the new data point and the existing data points in the training data. For this, we will define a function that takes the new data point and the training dataset as its input argument and executes the following steps.
- First, we will define an empty list named “
distances” to store the distance of the new data point from existing data points. - Next, we will iterate through the rows of the training data using the
iterrows()function. While iterating, we will extract the coordinates of the data points. - Once we get the coordinates of the data points in the training data, we will calculate the distance between the data point in the training data and the new data point. For this, we will use the
dist()function defined in the math module. Thedist()function takes two data points as its input arguments and returns the euclidean distance between them. - We will save the distances in the list
“distances”and return this list from the function.
Once we get the list of distances, we will insert it into the training dataset as a column. You can observe this in the following code.
import math new_data_point=(5,7) def distance_calculation(data_points,new_data_point): distances=list() for index,row in data_points.iterrows(): point=(row["X_Coordinate"],row["Y_Coordinate"]) distance=math.dist(point, new_data_point) distances.append(distance) return distances distance_list=distance_calculation(data_points,new_data_point) training_data["distances"]=distance_list print("The dataset with distances from new data point is:") print(training_data) Output:
The dataset with distances from new data point is: X_Coordinate Y_Coordinate Class Label distances Point A1 2 10 C2 4.242641 A2 2 6 C1 3.162278 A3 11 11 C3 7.211103 A4 6 9 C2 2.236068 A5 6 5 C1 2.236068 A6 1 2 C1 6.403124 A7 5 10 C2 3.000000 A8 4 9 C2 2.236068 A9 10 12 C3 7.071068 A10 7 5 C1 2.828427 A11 9 11 C3 5.656854 A12 4 6 C1 1.414214 A13 3 10 C2 3.605551 A15 3 8 C2 2.236068 A15 6 11 C2 4.123106 Select K Nearest Neighbors of The New Data Point
In the next step to implement KNN classification from scratch in python, we will find the k nearest neighbors of the new data point. As we have already calculated the distance between the data points in the training data and the new data point, we will sort the training data according to their distance from the new data point.
To sort the training data, we will use the sort_values() method. The sort_values() method, when invoked on the training dataframe, takes the take the name of distance column as its input argument and returns the sorted dataframe by distance.
After sorting, we will select the k-nearest data points from the sorted data using the iloc attribute of the dataframe. You can observe this in the following code.
training_data=training_data.sort_values(by="distances") print("The dataset sorted by distance from new data point is:") print(training_data) k=3 nearest_data_points=training_data.iloc[0:k] print("The nearest data points are:") print(nearest_data_points) Output:
The dataset sorted by distance from new data point is: X_Coordinate Y_Coordinate Class Label distances Point A12 4 6 C1 1.414214 A4 6 9 C2 2.236068 A5 6 5 C1 2.236068 A8 4 9 C2 2.236068 A15 3 8 C2 2.236068 A10 7 5 C1 2.828427 A7 5 10 C2 3.000000 A2 2 6 C1 3.162278 A13 3 10 C2 3.605551 A15 6 11 C2 4.123106 A1 2 10 C2 4.242641 A11 9 11 C3 5.656854 A6 1 2 C1 6.403124 A9 10 12 C3 7.071068 A3 11 11 C3 7.211103 The nearest data points are: X_Coordinate Y_Coordinate Class Label distances Point A12 4 6 C1 1.414214 A4 6 9 C2 2.236068 A5 6 5 C1 2.236068 Find The Class Label of The New Data Point
In the final step to implement the KNN classification algorithm from scratch in python, we have to find the class label of the new data point. For this, we will select the class labels of the k-nearest data points. Then, we will find the mode of the class labels. For this, we will use the mode() function defined in the statistics module.
The mode() function takes the class labels of the k nearest data points and returns the mode i.e. the most frequent class label. We will assign the class label returned by the mode() function to the new data point.
You can observe this in the following example.
neighbor_class_labels=nearest_data_points["Class Label"] print("The k nearest class labels are") print(neighbor_class_labels) import statistics predicted_class_label=statistics.mode(neighbor_class_labels) print("Predicted class label for new data point is:") print(predicted_class_label) Output:
The k nearest class labels are Point A12 C1 A4 C2 A5 C1 Name: Class Label, dtype: object Predicted class label for new data point is: C1 Hence, point (5, 7) is assigned the class C1 after implementing the KNN classification algorithm from scratch in Python.
Conclusion
In this article, we have discussed the implementation of the KNN classification algorithm from scratch in Python. To learn more about machine learning algorithms, you can read this article on k-means clustering numerical example. You might also like this article on KNN regression using the sklearn module in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
