K-NEAREST NEIGHBOR ALGORITHM

  K-Nearest neighbor algorithm is a non-parametric classification algorithm. Its also known as Lazy algorithm.

KNN  use a database in which the data points are separated into several classes to predict the classification of a new sample point.
The technique is non-parametric, it means that it does not make any assumptions on the underlying data distribution. In other words, the model structure is determined by the data. 

KNN could and probably should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution data. We can say that KNN Algorithm is based on feature similarity

The the algorithm uses the neighbor points information to predict the target class.

 

  
  

K-nearest neighbor classification step by step procedure



As show an image first of all we choose number k of neighbors. Then take the K nearest neighbors of the data point according to the Euclidean distance.
After these we count K neighbors numbers of the data point in each category. Then assign the new data point to the category where you counted the most neighbors.


How to choose the value of K?

Selecting the value of K in K-nearest neighbor is the most critical problem. A small value of K means that noise will have a higher influence on the result i.e., the probability of overfitting is very high. A large value of K makes it computationally expensive and defeats the basic idea behind KNN (that points that are near might have similar classes ). A simple approach to select k is k = n^(1/2).


Advantages  of K-nearest neighbors algorithm

  • Knn is simplest algorithm to implement.
  • Knn executes quickly for small training data sets.
  • Don’t need any prior knowledge about the structure of data in the training set.
  • No retraining is required if the new training pattern is added to the existing training set.

The limitation to K-nearest neighbors algorithm

  • When the training set is large, it may take a lot of space.
  • Computationally expensivebecause the algorithm stores all of the training data
  • High memory requirement
  • Stores all (or almost all) of the training data
  • Prediction stage might be slow