Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Cluster assumption

This assumption is strictly linked to the previous one and it's probably easier to accept. It can be expressed with a chain of interdependent conditions. Clusters are high density regions; therefore, if two points are close, they are likely to belong to the same cluster and their labels must be the same. Low density regions are separation spaces; therefore, samples belonging to a low density region are likely to be boundary points and their classes can be different. To better understand this concept, it's useful to think about supervised SVM: only the support vectors should be in low density regions. Let's consider the following bidimensional example:

In a semi-supervised scenario, we couldn't know the label of a point belonging to a high density region; however, if it is close enough to a labeled point that it's possible to build a ball where all the points have the same average density, we are allowed to predict the label of our test sample. Instead, if we move to a low-density region, the process becomes harder, because two points can be very close but with different labels. We are going to discuss the semi-supervised, low-density separation problem at the end of this chapter.