Hands-On Unsupervised Learning with Python

上QQ阅读APP看书，第一时间看更新

Completeness score

This measure (together with all the other ones discussed from now on) is based on knowledge of the ground truth. Before introducing the index, it's helpful to define some common values. If we denote with Y_true the set containing the true assignments and with Y_pred, the set of predictions (both containing M values and K clusters), we can estimate the following probabilities:

In the previous formulas, n_true/pred(k) represents the number of true/predicted samples belonging the cluster k ∈ K. At this point, we can compute the entropies of Y_true and Y_pred:

Considering the definition of entropy, H(•) is maximized by a uniform distribution, which, in its turn, corresponds to the maximum uncertainty of every assignment. For our purposes, it's also necessary to introduce the conditional entropies (representing the uncertainty of a distribution given the knowledge of another one) of Y_true given Y_pred and the other way around:

The function n(i, j) represents, in the first case, the number of samples with true label i assigned to K_j and, in the second case, the number of samples with true label j assigned to K_i.

The completeness score is defined as:

It's straightforward to understand that when H(Y_pred|Y_true) → 0, the knowledge of Y_true reduces the uncertainty of the predictions and, therefore, c → 1. This is equivalent to saying that all samples with the same true label are assigned to the same cluster. Conversely, when H(Y_pred|Y_true) → H(Y_pred), it means the ground truth doesn't provide any information that reduces the uncertainty of the predictions and c → 0.

Of course, a good clustering is characterized by c → 1. In the case of the Breast Cancer Wisconsin dataset, the completeness score, computed using the scikit-learn function completenss_score(), (which works also with textual labels) and K=2 (the only configuration associated with ground truth), is as follows:

import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score

km = KMeans(n_clusters=2, max_iter=1000, random_state=1000)
Y_pred = km.fit_predict(cdf)

df_km = pd.DataFrame(Y_pred, columns=['prediction'], index=cdf.index)
kmdff = pd.concat([dff, df_km], axis=1)

print('Completeness: {}'.format(completeness_score(kmdff['diagnosis'], kmdff['prediction'])))

The output of the previous snippet is as follows:

Completeness: 0.5168089972809706

This result confirms that, for K=2, K-means is not perfectly able to separate the clusters, because, as we have seen, there are some malignant samples that are wrongly assigned to the cluster containing the vast majority of benign samples. However, as c is not extremely small, we can be sure that most of the samples for both classes have been assigned to the different clusters. The reader is invited to check this value using other methods (discussed in Chapter 3, Advanced Clustering) and to provide a brief explanation of the different results.