3 Explaining Machine Learning with Facets
Lack of the right data often poisons an artificial intelligence (AI) project from the start. We are used to downloading ready-to-use datasets from Kaggle, scikit-learn, and other reliable sources.
We focus on learning how to use and implement machine learning (ML) algorithms. However, reality hits AI project managers hard on day one of a project.
Companies rarely have clean or even sufficient data for a project. Corporations have massive amounts of data, but they often come from different departments.
Each department of a company may have its own data management system and policy. When finally you obtain a training dataset sample, you may find that your AI model does not work as planned. You might have to change ML models or find out what is wrong with the data. You are trapped right from the start. What you thought would be an excellent AI project has turned into a nightmare.
You need to get out of this trap rapidly by first explaining the data availability problem. You must find a way to explain why the datasets require improvements. You must also explain which features require more data, better quality, or volume. You do not have the time or resources to develop a new explainable AI (XAI) solution for each project.
Facets Overview and Facets Dive provide visualization tools to analyze your training and testing data feature by feature.
We will start by installing and exploring Facets Overview, a statistical visualization tool. We will use the input virus detection data we are familiar with, taken from Chapter 1, Explaining Artificial Intelligence with Python.
We will then build the Facets Dive display code to visualize data points. Facets Dive offers many options to display and explain data point features. The interactive interface has labels and color options. You will define the binning of the x axis and y axis, among other productive functions.
This chapter covers the following topics:
- Installing and running Facets Overview in a Jupyter Notebook on Google Colaboratory
- Implementing the feature statistics code
- Implementing the HTML code to display statistics
- Analyzing the features by feature order
- Visualizing the minimum, maximum, median, and mean values feature by feature
- Looking for non-uniformity in the data distributions
- Sorting the features by missing records or zeros
- Analyzing the distribution distances and the Kullback-Leibler pergence
- Building the Facets Dive display code
- Comparing the values of a data point and counterfactual data points
Our first step will be to install and run Facets.