Statistics for Data Science
上QQ阅读APP看书,第一时间看更新

Boosting

In a manner of speaking, boosting is a process generally accepted in data science for improving the accuracy of a weak learning data science process.

Data science processes defined as weak learners are those that produce results that are only slightly better than if you would randomly guess the outcome. Weak learners are basically thresholds or a 1-level decision tree.

Specifically, boosting is aimed at reducing bias and variance in supervised learning.

What do we mean by bias and variance? Before going on further about boosting, let's take note of what we mean by bias and variance.

Data scientists describe bias as a level of favoritism that is present in the data collection process, resulting in uneven, disingenuous results and can occur in a variety of different ways. A sampling method is called biased if it systematically favors some outcomes over others.

A variance may be defined (by a data scientist) simply as the distance from a variable mean (or how far from the average a result is).

The boosting method can be described as a data scientist repeatedly running through a data science process (that has been identified as a weak learning process), with each iteration running on different and random examples of data sampled from the original population recordset. All the results (or classifiers or residue) produced by each run are then combined into a single merged result (that is a gradient).

This concept of using a random subset of the original recordset for each iteration originates from bootstrap sampling in bagging and has a similar variance-reducing effect on the combined model.

In addition, some data scientists consider boosting a means to convert weak learners into strong ones; in fact, to some, the process of boosting simply means turning a weak learner into a strong learner.