Applied Supervised Learning with R
上QQ阅读APP看书,第一时间看更新

Introduction

Chapter 1, R for Advanced Analytics, introduced to you the R language and its ecosystem for data science. We are now ready to enter a crucial part of data science and machine learning, that is, Exploratory Data Analysis (EDA), the art of understanding the data.

In this chapter, we will approach EDA with the same banking dataset used in the previous chapter, but in a more problem-centric way. We will start by defining the problem statement with industry standard artifacts, design a solution for the problem, and learn how EDA fits in the larger problem framework. We will then tackle the EDA for the direct marketing campaigns (phone calls) of a Portuguese banking institution use case using a combination of data engineering, data wrangling, and data visualization techniques in R, backed up by a business-centric approach.

In any data science use case, understanding the data consumes the bulk of the time and effort. Most data science professionals spend around 80% of their time understanding data. Given that this is the most crucial part of your journey, it is important to have a macro-view of the overall process for any data science use case.

A typical data science use case takes the path of a core business-analytics problem or a machine-learning problem. With either path approached, EDA is inevitable. Figure 2.1 demonstrates the life cycle of a basic data science use case. It starts by defining the problem statement using one or more standard frameworks, and then it delves into data gathering and reaches EDA. The majority of efforts and time in any project is consumed in EDA. Once the process of understanding the data is complete, a project may take a different path based on the scope of the use case. In most business analytics-based use cases, the next step is to assimilate all the observed patterns into meaningful insights. Though this might sound trivial, it is an iterative and arduous task. This step then evolves into story-telling, where the condensed insights are tailored into a meaningful story for the business stakeholders. Similarly, in scenarios where the objective is to develop a predictive model, the next step would be to actually develop a machine learning model and then deploy it into a production system/product.

Figure 2.1: Life cycle of a data science use case

Let's take a brief look at the first step, Defining the Problem Statement.