Machine Learning with Swift
上QQ阅读APP看书,第一时间看更新

Tree visualization

Let us take a look at the code to visualize a tree as follows:

In []: 
labels = df.label.astype('category').cat.categories 
labels = list(labels) 
labels 
Out[]: 
[u'platyhog', u'rabbosaurus']  

Define a variable to store all the names for the features:

In []: 
feature_names = map(lambda x: x.encode('utf-8'), features.columns.get_values()) 
feature_names 
Out[]: 
['length', 
 'fluffy', 
 'color_light black', 
 'color_pink gold', 
 'color_purple polka-dot', 
 'color_space gray'] 

Then, create the graph object using the export_graphviz function:

In []: 
import pydotplus  
dot_data = tree.export_graphviz(tree_model, out_file=None,  
                                feature_names=feature_names,   
                                 class_names=labels,   
                                 filled=True, rounded=True,   
                                 special_characters=True) 
dot_data 
Out[]: 
u'digraph Tree {nnode [shape=box, style="filled, rounded", color="black", fontname=helvetica] ;nedge [fontname=helvetica] ;n0 [label=<length &le; 26.6917<br/>entropy = 0.9971<br/>samples = 700<br/>value = [372, ... 
In []: 
graph = pydotplus.graph_from_dot_data(dot_data.encode('utf-8')) 
graph.write_png('tree1.png') 
Out[]: 
True 

Put a markdown to the next cell to see the newly-created file as follows:

![](tree1.png) 
Figure 2.5: Decision tree structure and a close-up of its fragment

The preceding diagram shows what our decision tree looks like. During the training, it grows upside-down. Data (features) travels through it from its root (top) to the leaves (bottom). To predict the label for a sample from our dataset using this classifier, we should start from the root, and move until we reach the leaf. In each node, one feature is compared to some value; for example, in the root node, the tree checks if the length is < 26.0261. If the condition is met, we move along the left branch; if not, along the right.

Let's look closer at a part of the tree. In addition to the condition in each node, we have some useful information:

  • Entropy value
  • Number of samples in the training set which supports this node
  • How many samples support each outcome
  • The most likely outcome at this stage