[20161107] Weekly Report

Added stemmer: Feature size from 26879 to 25054, accuracy from 

0.77848 to 0.77830

–with original Naive Bayes

Final:  positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

Non-stemmed: (0.9, 1)->160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896

Stemmed: (0.9, 1)->132 (0.8, 0.9)->1032 (0.7, 0.8)5040 (0.6, 0.7)41750

Non-stemmed: —>F1:0.77848

Clusters-in-(0.9-1): Num of clusters: 27num of features: 77 —>F1: 0.7806

Clusters-in-(0.8-0.9): Num of clusters: 524num of features: 1211 —>F1: 0.7754

Clusters-in-(0.7-0.8): Num of clusters: 1844, num of features: 5216 —>F1: 0.7771

Clusters-in-(0.6-0.7): Num of clusters: 1587, num of features: 10224 —>F1: 0.7525

Stemmed: —> F1: 0.77830

Clusters-in-(0.9-1): Num of clusters: 17num of features: 54 —> F1: 0.77748

Clusters-in-(0.8-1): Num of clusters: 316num of features: 746 —>F1: 0.778366

Clusters-in-(0.7-1): Num of clusters: 1122, num of features: 2982 —>F1: 0.77680

Clusters-in-(0.6-1): Num of clusters: 1444, num of features: 7659 —>F1: 0.75831

screenshot.png

Messing around with word2vec

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

[20161022]CS297 Weekly Report

Original F1-score0.7785

use only >0.9 similarities,merged 77 features into 27 cluster-features, total from 26879 to 26829, F1-score: 0.7806

use only [0.8,0.9) similarities, merged 1211 features into 524 cluster-features total from 26879 to 26192, F1-score: 0.7754

use only [0.7,0.8) similarities, merged 5216 features into 1844 cluster-features, total from 26879 to 23507, F1-score0.7771

use only[0.6,0.7) similarities, merged 10224 features into 1587 cluster-features, total from 26879 to 18242, F1-score0.7525

use >=0.7 similarities, merged features from 26879 to 23081F1-score0.7754

Investigation on 0.9 similarities group

0 revenue revenues
1 astounding astonishing
2 eighth seventh ninth sixth fifth
3 north west east south
4 concerning regarding
5 kilometers km kms
6 benefitted benefited
7 5pm 6pm
8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st
9 southern northern
10 forty thirty twenty
11 wj vlb dca
12 descendents descendants
13 fourth third
14 jr sr
15 photos pictures
16 hundreds thousands tens
17 four seven five six three eight nine two
18 totally completely
19 predominantly predominately
20 incredible amazing
21 humankind mankind
22 disappeared vanished
23 forbids prohibits
24 northeast southeast southwest
25 disappear vanish
26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

[20161015]CS297 Weekly Report

Topic: Use Word2Vec to select feature

  1. Analysis of feature
    1. [0.9,1) 80 pairs, [0.8, 0.9) 1570 pairs,[0.7, 0.8) 8660 pairs, [0.6,0.7)55896 pairs

screenshot.png screenshot.png screenshot.png screenshot.png

add up all features that have high similarity (>0.9)

screenshot.png  But, similar features are intertwined

–> Use graph search to identify all connected components:

screenshot.png

–> add up weight for each group and become new features

 

 

 

[20161009]CS297 Weekly Report

Incorporate Word2Vec Similarity into Vectors

  • Corpus used in Word2Vec, From Google News, 3M words
  • Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
  • ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

Issue

  1. speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
    1. numpy matrix multiplication
      • Matrix_new_weight = Matrix_old_weight  * Matrix_Similarity
    2. Cython complies into c code
  2. calculated similarity sum deviate the original vector weight too much
    1. use factor to decrease the weight –> W1 = W1 + factor * Wnew
  3. similarity of features becomes negative –> filter out

screenshot.png

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

 

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

screenshot.png

[20161002]Weekly Report

  1. Achieved naive Bayes classifier implementation with Count based and TF-IDF based vectors.
  2. The accuracy is the same with the NB in sklearn
  3. The speed is far slower than NB in sklearn

categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count: screenshot.png

TF-IDF: screenshot.png

Original:

count:screenshot.png

TF-IDF:screenshot.png

 

[20160926]CS297 Weekly Report

  1. get dataset 20 newspaper info
  2. use sklearn to vectorize data text with TF-IDF
  3. Modify the conditional probability
  4. Build MNB

Build MNB using screenshot.png

A working example is shown below:

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):

\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}+1}{\sum_{\tcwor... ...frac{T_{c\tcword}+1}{(\sum_{\tcword' \in V} T_{c\tcword'})+B}, \end{displaymath} (119)

where $B=\vert V\vert$ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

Table 13.1: Data for parameter estimation examples.
docID words in document in $c$ $=$ China?
training set 1 Chinese Beijing Chinese yes
2 Chinese Chinese Shanghai yes
3 Chinese Macao yes
4 Tokyo Japan Chinese no
test set 5 Chinese Chinese Chinese Tokyo Japan ?

 

Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors $\hat{P}(c) = 3/4$ and $\hat{P}(\overline{c}) = 1/4$ and the following conditional probabilities:

\begin{eqnarray*} \hat{P}(\term{Chinese}\vert c) &=& (5+1)/(8+6) = 6/14=3/7 \\ ... ...) = \hat{P}(\term{Japan}\vert\overline{c}) &=& (1+1)/(3+6)= 2/9 \end{eqnarray*}

The denominators are $(8+6)$ and $(3+6)$ because the lengths of $text_c$ and $text_{\overline{c}}$ are 8 and 3, respectively, and because the constant $B$ in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

\begin{eqnarray*} \hat{P}(c\vert d_5) &\propto& 3/4 \cdot (3/7)^3 \cdot 1/14 \c... ... &\propto& 1/4 \cdot (2/9)^3 \cdot 2/9 \cdot 2/9 \approx 0.0001. \end{eqnarray*}

Thus, the classifier assigns the test document to $c$ = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in $d_5$ outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.

[20160910] Weekly Report of CS297

Proposed Project Name

Categorize Text with Naive Bayes and word2vec word embedding

Literature Review

screenshot.png

screenshot.png

DataSet

yelp review  –> based on review to predict business category

pros: easy to get dataset;

cons: the business category seems to be obvious

Twitter thread –> based on thread content to predict the category

pros: makes more sense of category prediction

cons: hard to get labeled dataset;

Implementation Plan

  1. extract yelp dataset
    1. visualize dataset
    2. explore the number of categories
  2. Use naive Bayes classifier from python package to classify business
  3. implement Naive Bayes Classifier using python
  4. implement word2vec enhanced classifier
  5. prototype classification by category with classic Naive Bayes Classifier