Focusing on model features
As a simplified representation of reality, a model also includes a set of variables that contain the relevant information that describes the different parts of the problem we are representing. These variables can be something as concrete as 1 kg of ice cream, as we saw in our previous example, or as abstract as a numerical value that represents how similar the meaning is of two words in a text document.
In the particular case of a machine learning model, these variables are called features. Choosing significant features that provide relevant information about the phenomenon that we try to explain or predict is of paramount importance. If we consider unsupervised learning, then the relevant features are those that better represent the clustering or association of information in the dataset. For supervised learning, the most important features are those that highly correlate with the target variable – that is, the value that we want to predict or explain.
The quality of the insights that can be obtained from a machine learning model depends on the features used as input to the model. Feature selection and feature engineering are regularly-used techniques to improve a model's input. Feature selection is the process of selecting a subset of relevant features for use in any identified model construction. It can also be termed as variable selection or attribute selection. While building any machine learning model, feature selection and data cleaning should be the first and most important steps. Feature engineering is defined as the process of using the domain knowledge of the identified data to create features that make the machine learning algorithm(s) to work. If this is done correctly, then it will increase the predictive power of machine learning algorithms by creating features from new data that is fed into this model or system.
In our previous example, the model features are the mean temperature and the amount of ice cream sold. Since we have already proved that there are more variables involved, we could add some additional features to better explain the daily ice cream consumption. For example, we could take into account which day of the week we are recording data for and include this information as another feature. Additionally, any other relevant information can be represented, more or less accurately, into a feature. In supervised learning, it is customary to call the input variables features, and the target or predicted variable label.
Features can be numerical (such as the temperature in our previous example), or categorical (such as the day of the week). Since everything in computers is represented as numerical data, categorical data should be converted into numerical form by assigning categories to numbers. One-hot encoding is a process by which categorical variables are converted into a numerical form (or encoded) so that they can be input into machine learning algorithms.
Following our example, we could translate the day-of-the-week into day number, as follows:
This encoding reflects the order of the days and reserves the highest values for the weekend.
Let's say that you want to be more specific and predict the amount of ice cream for each flavor that you sell. For ease, let's say that you produce four different flavors: chocolate, strawberry, lemon, and vanilla. Could you just assign one number to each flavor, in the same way that you did in the day-of-the-week encoding? The answer, as we shall see, is negative. Let's try it and see what is wrong:
By using this encoding, we are implicitly saying that chocolate is closer to strawberry than to vanilla (1 unit versus 3 units), which is not a real property of the flavors. The right way of translating to numbers is to create binary variables. This approach is known as one-hot encoding and looks like the following table:
This method creates some overhead, since it increases the number of features by creating one binary variable for each possible value of the original variable. On the positive side, it correctly calculates the properties of the feature. We will see some examples of this in the next chapter.
Depending on the type of target variable, we can classify them into regression models (that is, continuous target variables) or classification models (that is, discrete target variables). For example, if we want to predict a real number or an integer number, we use regression, whereas if we are trying to predict a tag with a finite number of options, we use classification.