Studying machine learning models in practice
We have already seen a very simple example and used it to explain some basic concepts. In the next chapter, we are going to explore more complex models. We restricted ourselves to a very small dataset, just for clarity and to start our journey towards mastering machine learning with an easy task. There are some general considerations that we need to be aware of when working with machine learning models to solve real problems:
- The amount of data is usually very large. In fact, a larger dataset helps to get a more accurate model and a more reliable prediction. Extremely large datasets, usually called big data, can present storage and manipulation challenges.
- Data is never clean and ready to use, so data cleansing is extremely important and takes a lot of time.
- The number of features required to correctly represent a real-life problem is often large. The feature engineering techniques previously mentioned are impossible to perform by hand, so automatic methods must be devised and applied.
- It is far more important to assess the predictive power of a combination of input features than the significance of each individual one. Some simple examples of how to select features are given in Chapter 5, Correlations and the Importance of Variables.
- It is very unlikely that we will get a very good result with the first model that we apply. Testing and evaluating many different machine learning models implies repeating the same steps several times and usually requires automation as well.
- The dataset should be large enough to use a percentage of the data for training purposes (usually 80%) and the rest for testing. Evaluating the accuracy of a model only on the training data is misleading. A model can be very precise at explaining and predicting the training dataset, but it can fail to generalize and deliver wrong results when presented with new, previously unseen data values.
- Training and test data should be selected, usually at random, from the same full dataset. Trying to make a prediction based on input that lies far away from the training range is unlikely to give good results.
Supervised machine learning models are usually trained using a fraction of the input data and tested on the remaining part. The model can be then used to predict the outcome when fed with new and unknown feature values, as shown in the following diagram:
A typical supervised machine learning project includes the following steps:
- Obtaining the data and merging different data sources (there is more on this in Chapter 3, Importing Data into Excel from Different Data Sources)
- Cleansing the data (you can refer to Chapter 4, Data Cleansing and Preliminary Data Analysis)
- Preliminary analysis and feature engineering (you can refer to Chapter 5, Correlations and the Importance of Variables)
- Trying different models and parameters for each of them, and training them by using a percentage of the full dataset and using the rest for testing
- Deploying the model so that it can be used in a continuous analysis flow and not only in small, isolated tests
- Predicting values for new input data
This procedure will become clear in the examples shown in the next chapter.