What is a k-mean algorithm

A detailed explanation of the K-mean algorithm

K-means is our most widely used distance-based clustering algorithm. The closer the distance between two targets, the greater the similarity.

K-means has a famous explanation: the Pastor Village model:
Four priests went to the suburbs to preach. First, the pastors randomly selected several evangelistic items and announced the situation of these evangelistic items to all the villagers in the suburbs, so that each villager would go to the evangelistic item closest to their home. Listen to the lesson.
After the lecture everyone felt that it was too far away, so each pastor counted the addresses of all the villagers in his class, walked in the middle of each address, and updated his evangelistic item's position on the poster.
Every time the pastor moves, it is impossible to be closer to everyone. Some people find that after Pastor A's move it is better to go to Pastor B to take the lesson, so each villager went to the nearest evangelism point ...
In this way, the pastor updated his position every week, and the villagers chose the evangelism point according to their own situation and eventually stabilized.
We can see that the pastor's purpose is to determine the distance and minimum of each villager to the nearest center point.
So the algorithm steps of K-means are:
  1. Select the initial k samples as the initial clustering center.
  2. For each sample in the data set, calculate the distance to k cluster centers and divide them into the class that corresponds to the cluster center with the smallest distance.
  3. Recalculate the clustering center for each category (that is, the center of gravity of all samples belonging to that category).
  4. Repeat the two above steps 2 to 3 until a certain stop condition (number of iterations, minimum error change, etc.) is reached.
Let's look at the pseudocode first:
Time complexity: where t is the number of iterations, k is the number of clusters, n is the number of sample points, and m is the dimension of the sample points.
Spatial complexity: where k is the number of clusters, m is the dimension of the sample points and n is the number of sample points.

  • It's easy to see that the clustering effect is good. Although it is locally optimal, it is often sufficient to be locally optimal.
  • When processing large amounts of data, the algorithm can ensure good scalability.
  • If the clusters are roughly Gaussian, the effect is very good;
  • The complexity of the algorithm is low.
  • The K-value must be set manually and the results obtained with different K-values ​​will be different.
  • Different selection methods that are sensitive to the original cluster center produce different results.
  • Sensitive to outliers;
  • The sample can only be divided into one category that is not suitable for multiple classification tasks.
  • It is not suitable for too discrete classification, unbalanced sample classification, and non-convex shape classification.

Optimization and improvement of the algorithm

Given the shortcomings of the K-mean algorithm, we can apply many kinds of optimization methods: such as data preprocessing (removing abnormal points), judicious choice of K-value, high dimensional mapping, etc. The following is briefly introduced:
The essence of K-means is a data division algorithm based on the Euclidean distance. The large mean, large variance dimension will have a critical impact on the clustering of data. Therefore, it is impossible to directly participate in the calculation and comparison without normalizing and unifying the unit data. Common data preprocessing methods include: data normalization and data standardization.
In addition, outliers or noisy data have a greater influence on the mean value, which leads to a center shift. Therefore, we also need to perform anomaly detection on the data.
3.2 Appropriate choice of the K value
The selection of the K value has a great influence on the K mean values, which is also the biggest shortcoming of the K mean values. Common methods for selecting the K value are: elbow method and gap statistics method.
When K <3, the curve drops sharply, and when K> 3, the curve tends to be stable. By the elbow method, we think that inflection point 3 is the best value of K.
The disadvantage of the elbow method is that it is not automatic enough to be displayed manually. Therefore we have the gap statistics method, which was derived from the work of several scientists at Stanford University: Estimating the number of clusters in a data set via the gap statistic
Below this is the loss function, here it relates to the expectations. This value is usually generated by Monte Carlo simulation. We randomly generate as many random samples as the original number of samples according to the even distribution in the area where the sample is located, and we run K-means on that random sample to get a. So many times, usually 20 times, we can get 20. Averaging these 20 values ​​gives the approximate value. Finally, gap statistics can be calculated. The K corresponding to the maximum value obtained from the gap statistic is the best K.
The figure shows that at K = 3 Gap (K) assumes the largest value, so that the optimal number of clusters is K = 3.
The last project on Github was called gap_statistic, the number of proposed clusters can be determined more conveniently.
3.3 Using the kernel function
K-means based on Euclidean distance assume that the data of each data cluster has the same prior probability and has a spherical distribution, but this distribution is not common in real life. Given the non-convex data distribution shape, we can introduce the kernel function for optimization. Currently, the algorithm is also known as the Kernel K-Means Algorithm, which is a kind of kernel clustering method. The main idea of ​​the kernel clustering method is to map the data points in the input space to the superordinate feature space by means of a non-linear assignment and to carry out the clustering in the new feature space. Non-linear assignment increases the probability that the data points can be linearly separated, so that if the classic clustering algorithm fails, a more precise clustering result can be achieved by introducing a kernel function.
We know that choosing the starting value has a huge impact on the outcome and improving the selection of the starting value is an important part. Of all the improved algorithms, K-means ++ is the best known.
The steps of the K-means ++ algorithm are as follows:
  1. Randomly pick a central point.
  2. Calculate the furthest distance from the previous n cluster centers to the data and select a new center with a certain probability.
  3. Repeat the second step.
Simply put, K-means ++ means picking the point farthest from the selected center point. This is more in line with common sense. Of course, the cluster centers are as far apart as possible.
However, the disadvantage of this algorithm is that it is difficult to parallelize. Therefore, instead of sampling just one sample per traverse as in k-means ++, k-means II changed the sampling strategy, but sampling k per traverse, repeating the sampling, and then a set of sample points was obtained Select k points. Of course, no sampling is generally required, just five times.
The full name of ISODATA is an iterative self-organizing data analysis method. It overcomes the shortcoming that the value of K has to be manually determined in advance. In the case of high-dimensional, massive data sets, it is often difficult for humans to accurately estimate the size of K. ISODATA aims to improve this problem, and its idea is also very intuitive: if the number of samples belonging to a category is too small, remove that category, if the number of samples belonging to a category is too large and the degree of dispersion is large in two Divided into subcategories.

First, let's look at the steps of the K-Means algorithm: first randomly pick the starting node, then calculate the category to which each sample belongs, and then follow the new initialization node through the category. Have you ever thought about this process? EM algorithm。
What we need to know is that the iterative K-means clustering algorithm is actually the EM algorithm. The EM algorithm solves the problem of parameter estimation when there are hidden variables that cannot be observed in the probabilistic model. The hidden variables in K-means are the categories to which each category belongs. In the iteration step of the K mean value algorithm corresponds to step E in the EM algorithm. While Corresponds to the M step in the EM algorithm。
First we consider the shape of the loss function:
To find the extreme value, we make the partial derivative of the loss function equal to 0:
k refers to the k th center point, so we have:
It can be seen that the new focal point are all focal points of this type.
The disadvantage of the EM algorithm is that it is easy to fall into the local minimum value, which is why the K-means sometimes get the local optimal solution.

[1] "Machine Learning" Zhou Zhihua
[2] https://zhuanlan.zhihu.com/p/20463356
[3] http://sofasofa.io/forum_main_post.php?postid=1000282


Programmer encyclopedia

Copyright © 2020-2021 - All Rights Reserved - programmerwiki.com