[20160926]CS297 Weekly Report

  1. get dataset 20 newspaper info
  2. use sklearn to vectorize data text with TF-IDF
  3. Modify the conditional probability
  4. Build MNB

Build MNB using screenshot.png

A working example is shown below:

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):

\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}+1}{\sum_{\tcwor... ...frac{T_{c\tcword}+1}{(\sum_{\tcword' \in V} T_{c\tcword'})+B}, \end{displaymath} (119)

where $B=\vert V\vert$ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

Table 13.1: Data for parameter estimation examples.
docID words in document in $c$ $=$ China?
training set 1 Chinese Beijing Chinese yes
2 Chinese Chinese Shanghai yes
3 Chinese Macao yes
4 Tokyo Japan Chinese no
test set 5 Chinese Chinese Chinese Tokyo Japan ?

 

Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors $\hat{P}(c) = 3/4$ and $\hat{P}(\overline{c}) = 1/4$ and the following conditional probabilities:

\begin{eqnarray*} \hat{P}(\term{Chinese}\vert c) &=& (5+1)/(8+6) = 6/14=3/7 \\ ... ...) = \hat{P}(\term{Japan}\vert\overline{c}) &=& (1+1)/(3+6)= 2/9 \end{eqnarray*}

The denominators are $(8+6)$ and $(3+6)$ because the lengths of $text_c$ and $text_{\overline{c}}$ are 8 and 3, respectively, and because the constant $B$ in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

\begin{eqnarray*} \hat{P}(c\vert d_5) &\propto& 3/4 \cdot (3/7)^3 \cdot 1/14 \c... ... &\propto& 1/4 \cdot (2/9)^3 \cdot 2/9 \cdot 2/9 \approx 0.0001. \end{eqnarray*}

Thus, the classifier assigns the test document to $c$ = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in $d_5$ outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.

[20160910] Weekly Report of CS297

Proposed Project Name

Categorize Text with Naive Bayes and word2vec word embedding

Literature Review

screenshot.png

screenshot.png

DataSet

yelp review  –> based on review to predict business category

pros: easy to get dataset;

cons: the business category seems to be obvious

Twitter thread –> based on thread content to predict the category

pros: makes more sense of category prediction

cons: hard to get labeled dataset;

Implementation Plan

  1. extract yelp dataset
    1. visualize dataset
    2. explore the number of categories
  2. Use naive Bayes classifier from python package to classify business
  3. implement Naive Bayes Classifier using python
  4. implement word2vec enhanced classifier
  5. prototype classification by category with classic Naive Bayes Classifier

[20160812] Weekly Report of CS297

Reviewed 1 paper using Word2Vec to enhance Naive Bayes Classifier. The Title is shown below:

Screen Shot 2016-08-15 at 12.40.06 PM

 

Pros:

  • Introduced semantic analysis into text classification. Word2Vec is shown to improve the classification accuracy.
  • Applied distributed method for large-scale computation 

Cons:

  • For each class, the same corpus is used. If find corpus with each different class, the result may be more accurate

[160729] Weekly Report of CS297

Reviewed one paper using the tool (word2Vec) for determining the characteristic vocabulary.

The title of the paper is posted as below:

Screen Shot 2016-08-01 at 12.54.48 AM.png

The author proposed a work flow to detect the characteristic vocabulary of the domain in question by using 1. a crawler to gather the text information and 2. word2vec to rank the similar words. The schematic of the work flow can be seen below:

Screen Shot 2016-08-01 at 1.02.57 AM.png

Reference:

https://arxiv.org/abs/1605.09564

 

[160624] Weekly Report of CS297

Summary

Reviewed 1 paper focusing on text classification . The paper discussed about comparing the accuracy of classification between using web2Vec, Doc2Vec model and bag of words representation as the feature. The results show that web2Vec, Doc2Vec models offer higher accuracy.

Process pipeline

Document->Text processing –> Word2Vec/Doc2Vec feature generation –> (IDF/TF-IDF weight adjustment) –> Train Classifier –> Evaluate performance

Pros:

  • Introduced neural network based Word2Vec and Doc2Vec Model
  • Coupled word vector with weighting strategy like IDF and TF-IDF

Cons:

  • Training set and testing set used the same dataset, the accuracy is not persuasive
  • Only logistic regression is used. Can try more classifiers
  • Only bag of words is used for baseline. Can try more models, e.g. LDA

Reference:

Jiang, Suqi, et al. “Integrating rich document representations for text classification.” 2016 IEEE Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2016.