K-Nearest Neighbor Classifier — Implement Homemade Class & Compare with Sklearn Import

Blakelobato
3 min readOct 25, 2020

--

What is K-Nearest Neighbors?

Courtesy of Datacamp

K-Nearest Neighbors is an algorithm used in classification and regression problems; however, this example application uses the algorithm as a classifier. K-nn is a form supervised learning where classification outputs are class membership and is considered lazy learning. This object in question is assigned a class based on the classes of its neighbors. This can be modified by alternating the number of neighbors provided (K). In regression the outputs are the property values for the specified objects. The value is the average of the values of k nearest neighbors.

~Objective~

The goal here was to create a model that takes in row values, computes the Euclidean distance between one another and output a prediction of the proper class. Then the results were compared to the sklearn model that can be imported with the Scikit-Learn library.

~Implementation~

The class was implemented to fit the data, predict on the data, and display the data. There is a class definition that calculates the Euclidean distance used for the math stuff. Next the input data is fit into an X_train and y_train (target). Then doing a bunch of fancy math, calculating distances between points, class predictions, analyzing neighbors, and more; the model is able to predict a class. Lastly, a quick function was created to be able to display the data similar to the Sklearn model such that comparison can be easy for the three datasets tested on. These datasets are Iris, Breast Cancer, and Wine datasets from Sklearn. The results and comparison between the imported model and my model are shown below for the listed datasets:

Iris Dataset Comparison
Breast Cancer Dataset Comparison
Wine Dataset Comparison

Results indicate my model performs as well if not better in most rudimentary basics. Please feel free to take this code to the next level and see what you can make of it! Code can be found for the model, testing, and blog are on Github. The class implementation can be seen below:

# Importsimport numpy as np# Classclass k_nn:# Calc euclidean distance to compare neighbors: Fit - predict - displaydef __init__(self, num_neighbors=5):"""Init definition"""self.num_neighbors = num_neighbors# Euclidean distancedef euclidean_distance(self, a, b):"""Returns euclidean distance between rows"""euclidean_distance_sum = 0.0  # initial valuefor i in range(len(a)):euclidean_distance_sum += (a[i] - b[i]) ** 2"""Subtract - square - add to euclidean_distance_sum"""euclidean_distance = np.sqrt(euclidean_distance_sum)return euclidean_distance# Fit k Nearest Neighborsdef fit_knn(self, X_train, y_train):"""Fits the model using training data. X_train and y_train inputs for func"""self.X_train = X_trainself.y_train = y_train# Predict X for kNNdef predict_knn(self, X):"""Return predictions for X based on the fit X_train and y_train data"""# initialize prediction_knn as empty listprediction_knn = []for i in range(len(X)):# initialize euclidean_distance as empty listeuclidean_distance = []for row in self.X_train:# find eucl_distance to X using# euclidean_distance() function call and append to euclidean_distance listeuclidean_distance_sum = self.euclidean_distance(row, X[i])euclidean_distance.append(euclidean_distance_sum)neighbors = np.array(euclidean_distance).argsort()[: self.num_neighbors]# initialize dict to count class occurrences in y_trainneighbor_count = {}for num in neighbors:if self.y_train[num] in neighbor_count:neighbor_count[self.y_train[num]] += 1else:neighbor_count[self.y_train[num]] = 1# max count labels to prediction_knnprediction_knn.append(max(neighbor_count, key=neighbor_count.get))return prediction_knn# display list of nearest_neighbors & euclidian dist
def display_knn(self, x):"""Inputs -- x // outputs a list w/ nearest neighbors and euclidean distance."""# initialize euclidean_distance as empty listeuclidean_distance = []for row in self.X_train:euclidean_distance_sum = self.euclidean_distance(row, x)euclidean_distance.append(euclidean_distance_sum)neighbors = np.array(euclidean_distance).argsort()[: self.num_neighbors]# empty display_knn_values listdisplay_knn_values = []for i in range(len(neighbors)):n_i = neighbors[i]e_dist = euclidean_distance[i]display_knn_values.append((n_i, e_dist)) # changed to list of tuplesreturn display_knn_values

--

--