R Machine Learning By Example
上QQ阅读APP看书,第一时间看更新

Machine learning basics

Now that you have refreshed your memory about R, we will be talking about the basics of what machine learning is, how it is used today, and what are the main areas inside machine learning. This section intends to provide an overview into machine learning which will help in paving the way to the next chapter where we will be exploring it in more depth.

Machine learning – what does it really mean?

Machine learning does not have just one distinct textbook definition because it is a field which encompasses and borrows concepts and techniques from several other areas in computer science. It is also taught as an academic course in universities and has recently gained more prominence, with machine learning and data science being widely adopted online, in the form of educational videos, courses, and training. Machine learning is basically an intersection of elements from the fields of computer science, statistics, and mathematics, which uses concepts from artificial intelligence, pattern detection, optimization, and learning theory to develop algorithms and techniques which can learn from and make predictions on data without being explicitly programmed.

The learning here refers to the ability to make computers or machines intelligent based on the data and algorithms which we provide to them so that they start detecting patterns and insights from the provided data. This learning ensures that machines can detect patterns on data fed to it without explicitly programming them every time. The initial data or observations are fed to the machine and the machine learning algorithm works on that data to generate some output which can be a prediction, a hypothesis, or even some numerical result. Based on this output, there can be feedback mechanisms to our machine learning algorithm to improve our results. This whole system forms a machine learning model which can be used directly on completely new data or observations to get results from it without needing to write any separate algorithm again to work on that data.

Machine learning – how is it used in the world?

You might be wondering how on earth some algorithms or code can be used in the real world. It turns out they are used in a wide variety of use-cases in different verticals. Some examples are as follows:

  • Retail: Machine learning is widely used in the retail and e-commerce vertical where each store wants to outperform its competitors.
    • Pricing analytics: Machine learning algorithms are used to compare prices for items across various stores so that a store can sell the item at the most competitive price.
    • Market basket analysis: They are used for analysis of customer shopping trends and recommendation of products to buy, which we will be covering in Chapter 3, Predicting Customer Shopping Trends with Market Basket Analysis.
    • Recommendation engines: They are used to analyze customer purchases, ratings, and satisfaction to recommend products to various users. We will be building some recommendation systems of our own in Chapter 4, Building a Product Recommendation System.
  • Advertising: The advertising industry heavily relies on machine learning to promote and show the right advertisements to consumers for maximum conversion.
    • Web analytics: Analyzes website traffic
    • Churn analytics: Predicts customer churn rate
    • Advertisement click-through prediction: Used to predict how effective an advertisement would be to consumers such that they click on it to buy the relevant product
  • Healthcare: Machine learning algorithms are used widely in the healthcare vertical for more effective treatment of patients.
    • Disease detection and prediction: Used to detect and predict chances of a disease based on the patient's medical history.
    • Studying complex structures such as the human brain and DNA to understand the human body's functionality better for more effective treatment.
  • Detection and filtering of spam e-mails and messages.
  • Predicting election results.
  • Fraud detection and prediction. We will be taking a stab at one of the most critical fraud detection problems in Chapters 5, Credit Risk Detection and Prediction – Descriptive Analytics and Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics.
  • Text prediction in a messaging application.
  • Self-driving cars, planes, and other vehicles.
  • Weather, traffic, and crime activity forecasting and prediction.
  • Sentiment and emotion analysis, which we will be covering in Chapter 8, Sentiment Analysis of Twitter Data.

The preceding examples just scratch the surface of what machine learning can really do and by now I am sure that you have got a good flavor of the various areas where machine learning is being used extensively.

Types of machine learning algorithms

As we talked about earlier, to make machines learn, you need machine learning algorithms. Machine learning algorithms are a special class of algorithms which work on data and gather insights from it. The idea is to build a model using a combination of data and algorithms which can then be used to work on new data and derive actionable insights.

Each machine learning algorithm depends on what type of data it can work on and what type of problem are we trying to solve. You might be tempted to learn a couple of algorithms and then try to apply them to every problem you face. Do remember that there is no universal machine learning algorithm which fits all problems. The main input to machine learning algorithms is data which consists of features, where each feature can be described as an attribute of the data set, such as your height, weight, and so on if we were dealing with data related to human beings. Machine learning algorithms can be pided into two main areas, namely supervised and unsupervised learning algorithms.

Supervised machine learning algorithms

The supervised learning algorithms are a subset of the family of machine learning algorithms which are mainly used in predictive modeling. A predictive model is basically a model constructed from a machine learning algorithm and features or attributes from training data such that we can predict a value using the other values obtained from the input data. Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the previous data sets. The main types of supervised learning algorithms include:

  • Classification algorithms: These algorithms build predictive models from training data which have features and class labels. These predictive models in-turn use the features learnt from training data on new, previously unseen data to predict their class labels. The output classes are discrete. Types of classification algorithms include decision trees, random forests, support vector machines, and many more. We will be using several of these algorithms in Chapter 2, Let's Help Machines Learn, Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, and Chapter 8, Sentiment Analysis of Twitter Data.
  • Regression algorithms: These algorithms are used to predict output values based on some input features obtained from the data. To do this, the algorithm builds a model based on features and output values of the training data and this model is used to predict values for new data. The output values in this case are continuous and not discrete. Types of regression algorithms include linear regression, multivariate regression, regression trees, and lasso regression, among many others. We explore some of these in Chapter 2, Let's Help Machines Learn.
Unsupervised machine learning algorithms

The unsupervised learning algorithms are the family of machine learning algorithms which are mainly used in pattern detection and descriptive modeling. A descriptive model is basically a model constructed from an unsupervised machine learning algorithm and features from input data similar to the supervised learning process. However, there are no output categories or labels here based on which the algorithm can try to model relationships. These algorithms try to use techniques on the input data to mine for rules, detect patterns, and summarize and group the data points which help in deriving meaningful insights and describe the data better to the users. There is no specific concept of training or testing data here since we do not have any specific relationship mapping and we are just trying to get useful insights and descriptions from the data we are trying to analyze. The main types of unsupervised learning algorithms include:

  • Clustering algorithms: The main objective of these algorithms is to cluster or group input data points into different classes or categories using just the features derived from the input data alone and no other external information. Unlike classification, the output labels are not known beforehand in clustering. There are different approaches to build clustering models, such as by using means, medoids, hierarchies, and many more. Some popular clustering algorithms include k-means, k-medoids, and hierarchical clustering. We will look at some clustering algorithms in Chapter 2, Let's Help Machines Learn, and Chapter 7, Social Media Analysis – Analyzing Twitter Data.
  • Association rule learning algorithms: These algorithms are used to mine and extract rules and patterns from data sets. These rules explain relationships between different variables and attributes, and also depict frequent item sets and patterns which occur in the data. These rules in turn help discover useful insights for any business or organization from their huge data repositories. Popular algorithms include Apriori and FP Growth. We will be using some of these in Chapter 2, Let's Help Machines Learn, and Chapter 3, Predicting Customer Shopping Trends with Market Basket Analysis.
Popular machine learning packages in R

After getting a brief overview of machine learning basics and types of algorithms, you must be getting inquisitive as to how we apply some of these algorithms to solve real world problems using R. It turns out, there are a whole lot of packages in R which are dedicated to just solving machine learning problems. These packages consist of algorithms which are optimized and ready to be used to solve problems. We will list several popular machine learning packages in R, so that you are aware of what tools you might need later on and also feel more familiar with some of these packages when used in the later chapters. Based on usage and functionality, the following R packages are quite popular in solving machine learning problems:

  • caret: This package (short for classification and regression training) consists of several machine learning algorithms for building predictive models
  • randomForest: This package deals with implementations of the random forest algorithm for classification and regression
  • rpart: This package focuses on recursive partitioning and decision trees
  • glmnet: The main focus of this package is lasso and elastic-net regularized regression models
  • e1071: This deals with fourier transforms, clustering, support vector machines, and many more supervised and unsupervised algorithms
  • party: This deals with recursive partitioning
  • arules: This package is used for association rule learning algorithms
  • recommenderlab: This is a library to build recommendation engines
  • nnet: This package enables predictive modeling using neural networks
  • h2o: It is one of the most popular packages being used in data science these days and offers fast and scalable algorithms including gradient boosting and deep learning

Besides the preceding libraries, there are a ton of other packages out there related to machine learning in R. What matters is choosing the right algorithm and model based on the data and problem in hand.