上QQ阅读APP看书,第一时间看更新
How it works...
K-means was developed by James MacQueen, who, in 1967, designed it for the purpose of dividing groups of objects into k partitions based on their attributes. It is a variation of the expectation-maximization (EM) algorithm, whose objective is to determine the groups of k data generated by Gaussian distributions. The difference between the two algorithms lies in the Euclidean distance calculation method. In k-means, it is assumed that the attributes of the object can be represented as vectors, and thus form a vector space. The goal is to minimize the total intra-cluster variance (or standard deviation). Each cluster is identified by a centroid.
The algorithm follows an iterative procedure, as follows:
- Choose the number of k clusters
- Initially, create k partitions and assign each entry partition either randomly, or by using some heuristic information
- Calculate the centroid of each group
- Calculate the distance between each observation and each cluster centroid
- Then, construct a new partition by associating each entry point with the cluster whose centroid is closer to it
- The centroid for new clusters is recalculated
- Repeat steps 4 to 6 until the algorithm converges