Example of label spreading
We can test this algorithm using the Scikit-Learn implementation. Let's start by creating a very dense dataset:
from sklearn.datasets import make_classification
nb_samples = 5000
nb_unlabeled = 1000
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=100)
Y[nb_samples - nb_unlabeled:nb_samples] = -1
We can train a LabelSpreading instance with a clamping factor alpha=0.2. We want to preserve 80% of the original labels but, at the same time, we need a smooth solution:
from sklearn.semi_supervised import LabelSpreading
ls = LabelSpreading(kernel='rbf', gamma=10.0, alpha=0.2)
ls.fit(X, Y)
Y_final = ls.predict(X)
The result is shown, as usual, together with the original dataset:
Original dataset (left). Dataset after a complete label spreading (right)
As it's possible to see in the first figure (left), in the central part of the cluster (x [-1, 0]), there's an area of circle dots. Using a hard-clamping, this aisle would remain unchanged, violating both the smoothness and clustering assumptions. Setting α > 0, it's possible to avoid this problem. Of course, the choice of α is strictly correlated with each single problem. If we know that the original labels are absolutely correct, allowing the algorithm to change them can be counterproductive. In this case, for example, it would be better to preprocess the dataset, filtering out all those samples that violate the semi-supervised assumptions. If, instead, we are not sure that all samples are drawn from the same pdata, and it's possible to be in the presence of spurious elements, using a higher α value can smooth the dataset without any other operation.