Combinatorial entropy
Information gain criterion is based on the Shannon entropy notion. The Shannon entropy is a very important topic in the information theory, physics, and other domains. Mathematically, it is expressed as:
Where i is a state of a system, N is a total number of possible states, and pi is a probability of the system being in the state i. Entropy describes the amount of uncertainty in the system. The more order you have in the system, the less entropy there is.
If you want to learn more about entropy, check the nice, interactive blog Entropy Explained, With Sheep by Aatish Bhatia at: https://aatishb.com/entropy/.
Let's show a simple example of how entropy can be useful for decision tree construction. For this, we'll simplify a task of alien creature classification, assuming that we can measure only one feature: body length. We have 10 individuals (= platyhog and = rabbosaurus) with the following body lengths:
If we take one random individual from the group, it can be a platyhog with the probability of 0.6, or a rabbosaurus with the probability of 0.4. We have two states in this system for two outcomes. Let's calculate the entropy of it:
So, the amount of uncertainty in this dataset is 0.97. Is it a lot, or a little? We don't have anything yet to compare it with, so let's divide the set at the middle (> 5 meters), and calculate the entropy for both subsets:
H and H are now less than the original H. This demonstrates how you can reduce the entropy by splitting the dataset in the right place. This idea lies in the fundamentals of decision tree learning algorithms.
We can calculate how effectively we reduced the entropy by splitting the set using the Information Gain (IG) criterion:
Information Gain = Entropy(parent) - Weighted Sum of Entropy (Children), or:
q is a number of groups after splitting, Ni is a count of elements in the i-th group, N—is the total count of elements before split. In our example, q = 2, N = 10, and N1 = N2 = 5:
This means that asking the question Is the body length greater than 5? gives us an information gain of 0.61. Is it a lot, or a little? Let's compare it to the information loss of the split around length > 7:
Apparently, the choice of the middle point was lucky, because all other splits don't look promising. But you are free to check them if you want.
There is no sense to split the left part further, but we can continue splitting the right subset until entropy of each of its children will not be equal to zero (see Figure 2.9).
So, this is our decision tree, and a recursive algorithm for its building. But now comes an interesting question: how to know which split yields the maximal information gain? The simplest way is a greedy search: just check all possible variants.
Information gain is only one of the heuristics, there are more of them; for instance, in our scikit-learn decision tree learner, we used Gini impurity as a heuristic. According to the Michigan State University (http://www.cse.msu.edu/~cse802/DecisionTrees.pdf):
Check the documentation on the criterion property of DecisionTreeClassifier for more information about different heuristics available for tree learning in scikit-learn. In practice, Gini works very similarly to the information gain. A historical fact to dilute the theoretical exposition: Corrado Gini was an Italian statistician and the author of The Scientific Basis of Fascism (1927):