Hands-On Unsupervised Learning with Python
上QQ阅读APP看书,第一时间看更新

Diagnostic analysis

Till now, we have worked with output data, which has been observed after a specific underlying process has generated it. The natural question after having described the system relates to the causes. Temperature depends on many meteorological and geographical factors, which can be either easily observable or completely hidden. Seasonality in the time series is clearly influenced by the period of the year, but what about the outliers?

For example, we have discovered a peak in a region identified as winter. How can we justify it? In a simplistic approach, this can be considered as a noisy outlier that can be filtered out. However, if it has been observed and there's a ground truth behind the measure (for example, all the parties agree that it's not an error), we should assume the presence of a hidden (or latent) cause.

It can be surprising, but the majority of more complex scenarios are characterized by a huge number of latent causes (sometimes called factors) that are too difficult to analyze. In general, this is not a bad condition but, as we're going to discuss, it's important to include them in the model to learn their influence through the dataset.

On the other hand, deciding to drop all unknown elements means reducing the predictive ability of the model with a proportional loss of accuracy. Therefore, the primary goal of diagnostic analysis is not necessarily to find out all the causes but to list the observable and measurable elements (known as factors), together with all the potential latent ones (which are generally summarized into a single global element).

To a certain extent, a diagnostic analysis is often similar to a reverse-engineering process, because we can easily monitor the effects, but it's more difficult to detect existing relationships between potential causes and observable effects. For this reason, such an analysis is often probabilistic and helps find the probability that a certain identified cause brings about a specific effect. In this way, it's also easier to exclude non-influencing elements and to determine relationships that were initially excluded. However, this process requires a deeper knowledge of statistical learning methods and it won't be discussed in this book, apart from a few examples, such as a Gaussian mixture.