Feature Engineering Made Easy
上QQ阅读APP看书,第一时间看更新

Supervised learning

Oftentimes, we hear about feature engineering in the specific context of supervised learning, otherwise known as predictive analytics. Supervised learning algorithms specifically deal with the task of predicting a value, usually one of the attributes of the data, using the other attributes of the data. Take, for example, the dataset representing the network intrusion:

This is the same dataset as before, but let's dissect it further in the context of predictive analytics.

Notice that we have four attributes of this dataset: DateTime, Protocol, Urgent, and Malicious. Suppose now that the malicious attribute contains values that represent whether or not the observation was a malicious intrusion attempt. So in our very small dataset of four network connections, the first, second, and fourth connection were malicious attempts to intrude a network.

Suppose further that given this dataset, our task is to be able to take in three of the attributes (datetime, protocol, and urgent) and be able to accurately predict the value of malicious. In laymen’s terms, we want a system that can map the values of datetime, protocol, and urgent to the values in malicious. This is exactly how a supervised learning problem is set up:

Network_features = pd.DataFrame({'datetime': ['6/2/2018', '6/2/2018', '6/2/2018', '6/3/2018'], 'protocol': ['tcp', 'http', 'http', 'http'], 'urgent': [False, True, True, False]})
Network_response = pd.Series([True, True, False, True])
Network_features
>>
datetime protocol urgent 0 6/2/2018 tcp False 1 6/2/2018 http True 2 6/2/2018 http True 3 6/3/2018 http False
Network_response
>>
0 True 1 True 2 False 3 True dtype: bool

When we are working with supervised learning, we generally call the attribute (usually only one of them, but that is not necessary) of the dataset that we are attempting to predict the response of. The remaining attributes of the dataset are then called the features.

Supervised learning can also be considered the class of algorithms attempting to exploit the structure in data. By this, we mean that the machine learning algorithms try to extract patterns in usually very nice and neat data. As discussed earlier, we should not always expect data to come in tidy; this is where feature engineering comes in.

But if we are not predicting something, what good is machine learning you may ask? I’m glad you did. Before machine learning can exploit the structure of data, sometimes we have to alter or even create structure. That’s where unsupervised learning becomes a valuable tool.