[8] Understanding K-Neighborhood (KNN)

2021. 1. 17. 23:19English/Machine learning algorithm

728x90
반응형

KNN

KNN belongs to a typical 'Lazy Learner'. That is, instead of learning the discriminant function from the training data, we proceed with the learning by storing the training dataset in memory. Thanks to this, there is no cost in the learning process. Instead, the computational cost in the prediction phase is high.

 

Memory-based classifiers have the advantage of being able to adapt immediately to the collected new training data. However, working on large datasets can cause storage problems.

 

 

Parametric Model and Nonparametric Model

Parametric models refer to models that estimate model parameters (predict the target) from the training dataset to learn a function that can classify new data points. Typical parametric models include perceptron, logistic regression, and linear SVM.

 

A Nonparametric model is a model that cannot be explained by a fixed number of parameters. As training data increases, so does the number of parameters. Typical nonparametric models include decision trees, random forests, and kernel SVM(nonlinear). KNN also belongs here.

 

In addition, KNN is classified as an instance-based model. Instance models are characterized by storing training datasets in memory. Lazy learning belongs to an instance-based model.

 

 

KNN Algorithm Learning Process
1. Select the number k and distance measurement criteria.
2. Locate the k nearest neighbor in the sample you want to classify.
3. Assign class labels by majority vote.

 

As you can see in the learning process, only 'K' and 'distance measurement criteria' are required to learn KNN algorithms. Generally, the distance measurement criterion is 'Eucladian distance'.

Manhattan distance : \( \sum_{i=0}^{n}|p_i - q_i| \)

Eucladian distance : \( \sqrt{ \sum_{i=0}^{n}(p_i - q_i)^2 }  \)

Minkowski distance : \( _p\sqrt{ \sum_{k}|x_{k}^{i} - x_{k}^{i}|^p }  \) 

* Minkowski distance becomes Manhattan distance when p is 1 and Eucladian distance when p is 2.

 

The process of standardizing each characteristic to be treated the same before using the corresponding measurement method is critical. This is because if the scale differs between the characteristics, the large scale becomes considered a very important feature.

 

 

Advantages of KNN

- Since it is a memory-based learning algorithm, the collected data can be adapted immediately.
- There is no cost to the learning process.

 

 

Disadvantages of KNN

- Large datasets may experience storage problems.
- Learning with non-standardized datasets is likely to result in poor learning.
- If there are more characteristics than the number of learning data, it is likely to become overfitting due to *'urse of dimensional '.

 

* Curse of dimension: characteristic space becomes increasingly scarce as a fixed set of training data grows in dimensions.

** To avoid the phenomenon, we mainly learn the algorithm using 'characteristic selection' and 'dimensional reduction techniques'.

 

728x90
반응형