- get dataset 20 newspaper info
- use sklearn to vectorize data text with TF-IDF
- Modify the conditional probability
- Build MNB
Build MNB using
A working example is shown below:
To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):
where is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.
We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .
docID | words in document | in China? | |||
training set | 1 | Chinese Beijing Chinese | yes | ||
2 | Chinese Chinese Shanghai | yes | |||
3 | Chinese Macao | yes | |||
4 | Tokyo Japan Chinese | no | |||
test set | 5 | Chinese Chinese Chinese Tokyo Japan | ? |
Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors and and the following conditional probabilities:
The denominators are and because the lengths of and are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.
We then get:
Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.