Statistics for Data Science
上QQ阅读APP看书,第一时间看更新

Understanding basic data cleaning

The importance of having clean (and therefore reliable) data in any statistical project cannot be overstated. Dirty data, even with sound statistical practice, can be unreliable and can lead to producing results that suggest courses of action that are incorrect or that may even cause harm or financial loss. It has been stated that a data scientist spends nearly 90 percent of his or her time in the process of cleaning data and only 10 percent on the actual modeling of the data and deriving results from it.

So, just what is data cleaning?

Data cleaning is also referred to as data cleansing or data scrubbing and involves both the processes of detecting as well as addressing errors, omissions, and inconsistencies within a population of data.

This may be done interactively with data wrangling tools, or in batch mode through scripting. We will use R in this book as it is well-fitted for data science since it works with even very complex datasets, allows handling of the data through various modeling functions, and even provides the ability to generate visualizations to represent data and prove theories and assumptions in just a few lines of code.

During cleansing, you first use logic to examine and evaluate your data pool to establish a level of quality for the data. The level of data quality can be affected by the way the data is entered, stored, and managed. Cleansing can involve correcting, replacing, or just removing data points or entire actual records.

Cleansing should not be confused with validating as they differ from each other. A validation process is a pass or fails process, usually occurring as the data is captured (time of entry), rather than an operation performed later on the data in preparation for an intended purpose.

As a data developer, one should not be new to the concept of data cleaning or the importance of improving the level of quality of data. A data developer will also agree that the process of addressing data quality requires a routine and regular review and evaluation of the data, and in fact, most organizations have enterprise tools and/or processes (or at least policies) in place to routinely preprocess and cleanse the enterprise data.

There is quite a list of both free and paid tools to sample, if you are interested, including iManageData, Data Manager, DataPreparator (Trifecta) Wrangler, and so on. From a statistical perspective, the top choices would be R, Python, and Julia.

Before you can address specific issues within your data, you need to detect them. Detecting them requires that you determine what would qualify as an issue or error, given the context of your objective (more on this later in this section).