上QQ阅读APP看书，第一时间看更新

Combinatorial entropy

Information gain criterion is based on the Shannon entropy notion. The Shannon entropy is a very important topic in the information theory, physics, and other domains. Mathematically, it is expressed as:

Where i is a state of a system, N is a total number of possible states, and p_i is a probability of the system being in the state i. Entropy describes the amount of uncertainty in the system. The more order you have in the system, the less entropy there is.

For the visual introduction to the information theory, check Visual Information Theory by Christopher Olah at: http://colah.github.io/posts/2015-09-Visual-Information/.

If you want to learn more about entropy, check the nice, interactive blog Entropy Explained, With Sheep by Aatish Bhatia at: https://aatishb.com/entropy/.

Let's show a simple example of how entropy can be useful for decision tree construction. For this, we'll simplify a task of alien creature classification, assuming that we can measure only one feature: body length. We have 10 individuals (= platyhog and = rabbosaurus) with the following body lengths:

If we take one random individual from the group, it can be a platyhog with the probability of 0.6, or a rabbosaurus with the probability of 0.4. We have two states in this system for two outcomes. Let's calculate the entropy of it:

So, the amount of uncertainty in this dataset is 0.97. Is it a lot, or a little? We don't have anything yet to compare it with, so let's divide the set at the middle (> 5 meters), and calculate the entropy for both subsets:

H and H are now less than the original H. This demonstrates how you can reduce the entropy by splitting the dataset in the right place. This idea lies in the fundamentals of decision tree learning algorithms.

We can calculate how effectively we reduced the entropy by splitting the set using the Information Gain (IG) criterion:

Information Gain = Entropy(parent) - Weighted Sum of Entropy (Children), or:

q is a number of groups after splitting, N_i is a count of elements in the i-th group, N—is the total count of elements before split. In our example, q = 2, N = 10, and N₁= N₂= 5:

This means that asking the question Is the body length greater than 5? gives us an information gain of 0.61. Is it a lot, or a little? Let's compare it to the information loss of the split around length > 7:

Apparently, the choice of the middle point was lucky, because all other splits don't look promising. But you are free to check them if you want.

There is no sense to split the left part further, but we can continue splitting the right subset until entropy of each of its children will not be equal to zero (see Figure 2.9).

So, this is our decision tree, and a recursive algorithm for its building. But now comes an interesting question: how to know which split yields the maximal information gain? The simplest way is a greedy search: just check all possible variants.

Information gain is only one of the heuristics, there are more of them; for instance, in our scikit-learn decision tree learner, we used Gini impurity as a heuristic. According to the Michigan State University (http://www.cse.msu.edu/~cse802/DecisionTrees.pdf):

"Gini impurity is the expected error rate at node N if the category label is selected randomly from the class distribution present at N."

Check the documentation on the criterion property of DecisionTreeClassifier for more information about different heuristics available for tree learning in scikit-learn. In practice, Gini works very similarly to the information gain. A historical fact to dilute the theoretical exposition: Corrado Gini was an Italian statistician and the author of The Scientific Basis of Fascism (1927):

Figure 2.9: Building a decision tree. H stands for entropy in each group. Picture by Mykola Sosnovshchenko.

本周热推：

从零开始学51单片机C语言打印机维修不是事儿（第2版）电脑日常维护与故障排除计算机主板维修不是事儿（第2版）电脑组装、维护、维修全能一本通（全彩版）