上QQ阅读APP看书，第一时间看更新

Preparing the data

Now that we have the source data in text files, we need to convert it to a format that can be used as an input for a ML library. Most general-purpose ML packages, such as scikit-learn and Apache Spark, only accept a matrix of numbers as input. Hence, feature transformation is required for a text dataset. A common approach is to use language models such as bag-of-words (BoW). In this example, we build a BoW for each tweet and construct a matrix in which each row represents a tweet and each column signals the presence of a particular word. We also have a column for the label that can distinguish tweets from Republicans (1) or Democrats (0), as we can see in the following table:

Table 3: Converting a text dataset to a structured dataset

This table represents the matrix that can be derived from tweets. However, there are many points to remember when generating such a matrix. Due to the number of terms in the language lexicon, the number of columns in the matrix can be very high. This poses a problem in ML known as the curse of dimensionality (see section X). There are several ways to tackle this problem; however, as our example is fairly small in terms of data, we will only briefly discuss methods to reduce the number of columns:

Stopwords: Certain common words might add no value to our task (for example, the words the, for, or as). We call these words stopwords, and we shall remove these words from dems.txt and gop.txt.
Stemming: There may be many variants of a word that are used in the text. For example, argue, argued, argues, and arguing all stem from the word argue. Techniques such as stemming and lemmatization can be used to find the stem of the word and replace variants of that word with the stem.
Tokenization: Tokenization can be used to combine various words into phrases so that the number of features can be reduced. For example, tea party has a totally different meaning, politically, than the literal meaning of the two words alone. We won't consider this for our simple example, but tokenization techniques help to find such phrases.

Another issue to consider is that words appearing more than once in a tweet have equal importance on a training row. There are ways to utilize this information by using multinomial or term frequency-inverse document frequency (TFIDF) models. Since tweets are relatively short text, we will not consider this aspect in our implementation.

The matrix describes the words you would find for each class (that is each political party). However, when we want to predict the source of the tweet, the inverse problem is posed. Given a specific bag of words, we're interested in assessing how likely it is that the terms are used by one party or another. In other words, we know the probability of a bag of words given a particular party, and we are interested in the reverse: the probability of a tweet being written by a party given a bag of words. This is where the Naive Bayes algorithm is applied.