Hands-On Predictive Analytics with Python
上QQ阅读APP看书,第一时间看更新

Data collection and preparation

Goal: Get a dataset that is ready for analysis.

This phase is where we take a look at the data that is available. Depending on the project, you will need to interact with the database administrators and ask them to provide you with the data. You may also need to rely on many different sources to get the data that is needed. Sometimes, the data may not exist yet and you may be part of the team that comes up with a plan to collect it. Remember, the goal of this phase is to have a dataset you will be using for building the predictive model.

In the process of getting the dataset, potential problems with the data may be identified, which is why this phase is, of course, very closely related with the previous one. While performing the tasks for getting the dataset ready, you will go back and forth between this and the former phase as you may realize that the available data is not enough to solve the proposed problem as was formulated in the business understanding phase, so you may need to go back to the stakeholders and discuss the situation and maybe reformulate the problem and solution.

While building the dataset, you may notice some problems with some of the features. Maybe one column has a lot of missing values or the values have not been properly encoded. Although in principle it would be great to deal with problems such as missing values and outliers in this phase, that is often not the case, which is why there isn't a hard boundary between this phase and the next phase: EDA.