[20160926]CS297 Weekly Report

  1. get dataset 20 newspaper info
  2. use sklearn to vectorize data text with TF-IDF
  3. Modify the conditional probability
  4. Build MNB

Build MNB using screenshot.png

A working example is shown below:

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):

\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}+1}{\sum_{\tcwor... ...frac{T_{c\tcword}+1}{(\sum_{\tcword' \in V} T_{c\tcword'})+B}, \end{displaymath} (119)

where $B=\vert V\vert$ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

Table 13.1: Data for parameter estimation examples.
docID words in document in $c$ $=$ China?
training set 1 Chinese Beijing Chinese yes
2 Chinese Chinese Shanghai yes
3 Chinese Macao yes
4 Tokyo Japan Chinese no
test set 5 Chinese Chinese Chinese Tokyo Japan ?


Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors $\hat{P}(c) = 3/4$ and $\hat{P}(\overline{c}) = 1/4$ and the following conditional probabilities:

\begin{eqnarray*} \hat{P}(\term{Chinese}\vert c) &=& (5+1)/(8+6) = 6/14=3/7 \\ ... ...) = \hat{P}(\term{Japan}\vert\overline{c}) &=& (1+1)/(3+6)= 2/9 \end{eqnarray*}

The denominators are $(8+6)$ and $(3+6)$ because the lengths of $text_c$ and $text_{\overline{c}}$ are 8 and 3, respectively, and because the constant $B$ in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

\begin{eqnarray*} \hat{P}(c\vert d_5) &\propto& 3/4 \cdot (3/7)^3 \cdot 1/14 \c... ... &\propto& 1/4 \cdot (2/9)^3 \cdot 2/9 \cdot 2/9 \approx 0.0001. \end{eqnarray*}

Thus, the classifier assigns the test document to $c$ = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in $d_5$ outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.

[20160910] Weekly Report of CS297

Proposed Project Name

Categorize Text with Naive Bayes and word2vec word embedding

Literature Review




yelp review  –> based on review to predict business category

pros: easy to get dataset;

cons: the business category seems to be obvious

Twitter thread –> based on thread content to predict the category

pros: makes more sense of category prediction

cons: hard to get labeled dataset;

Implementation Plan

  1. extract yelp dataset
    1. visualize dataset
    2. explore the number of categories
  2. Use naive Bayes classifier from python package to classify business
  3. implement Naive Bayes Classifier using python
  4. implement word2vec enhanced classifier
  5. prototype classification by category with classic Naive Bayes Classifier