[20161107] Weekly Report

Added stemmer: Feature size from 26879 to 25054, accuracy from 

0.77848 to 0.77830

–with original Naive Bayes

Final:  positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

Non-stemmed: (0.9, 1)->160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896

Stemmed: (0.9, 1)->132 (0.8, 0.9)->1032 (0.7, 0.8)5040 (0.6, 0.7)41750

Non-stemmed: —>F1:0.77848

Clusters-in-(0.9-1): Num of clusters: 27num of features: 77 —>F1: 0.7806

Clusters-in-(0.8-0.9): Num of clusters: 524num of features: 1211 —>F1: 0.7754

Clusters-in-(0.7-0.8): Num of clusters: 1844, num of features: 5216 —>F1: 0.7771

Clusters-in-(0.6-0.7): Num of clusters: 1587, num of features: 10224 —>F1: 0.7525

Stemmed: —> F1: 0.77830

Clusters-in-(0.9-1): Num of clusters: 17num of features: 54 —> F1: 0.77748

Clusters-in-(0.8-1): Num of clusters: 316num of features: 746 —>F1: 0.778366

Clusters-in-(0.7-1): Num of clusters: 1122, num of features: 2982 —>F1: 0.77680

Clusters-in-(0.6-1): Num of clusters: 1444, num of features: 7659 —>F1: 0.75831

screenshot.png

https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

[20161022]CS297 Weekly Report

Original F1-score0.7785

use only >0.9 similarities,merged 77 features into 27 cluster-features, total from 26879 to 26829, F1-score: 0.7806

use only [0.8,0.9) similarities, merged 1211 features into 524 cluster-features total from 26879 to 26192, F1-score: 0.7754

use only [0.7,0.8) similarities, merged 5216 features into 1844 cluster-features, total from 26879 to 23507, F1-score0.7771

use only[0.6,0.7) similarities, merged 10224 features into 1587 cluster-features, total from 26879 to 18242, F1-score0.7525

use >=0.7 similarities, merged features from 26879 to 23081F1-score0.7754

Investigation on 0.9 similarities group

0 revenue revenues
1 astounding astonishing
2 eighth seventh ninth sixth fifth
3 north west east south
4 concerning regarding
5 kilometers km kms
6 benefitted benefited
7 5pm 6pm
8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st
9 southern northern
10 forty thirty twenty
11 wj vlb dca
12 descendents descendants
13 fourth third
14 jr sr
15 photos pictures
16 hundreds thousands tens
17 four seven five six three eight nine two
18 totally completely
19 predominantly predominately
20 incredible amazing
21 humankind mankind
22 disappeared vanished
23 forbids prohibits
24 northeast southeast southwest
25 disappear vanish
26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

[20161015]CS297 Weekly Report

Topic: Use Word2Vec to select feature

  1. Analysis of feature
    1. [0.9,1) 80 pairs, [0.8, 0.9) 1570 pairs,[0.7, 0.8) 8660 pairs, [0.6,0.7)55896 pairs

screenshot.png screenshot.png screenshot.png screenshot.png

add up all features that have high similarity (>0.9)

screenshot.png  But, similar features are intertwined

–> Use graph search to identify all connected components:

screenshot.png

–> add up weight for each group and become new features

 

 

 

[20161009]CS297 Weekly Report

Incorporate Word2Vec Similarity into Vectors

  • Corpus used in Word2Vec, From Google News, 3M words
  • Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
  • ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

Issue

  1. speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
    1. numpy matrix multiplication
      • Matrix_new_weight = Matrix_old_weight  * Matrix_Similarity
    2. Cython complies into c code
  2. calculated similarity sum deviate the original vector weight too much
    1. use factor to decrease the weight –> W1 = W1 + factor * Wnew
  3. similarity of features becomes negative –> filter out

screenshot.png

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

 

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

screenshot.png

[20161002]Weekly Report

  1. Achieved naive Bayes classifier implementation with Count based and TF-IDF based vectors.
  2. The accuracy is the same with the NB in sklearn
  3. The speed is far slower than NB in sklearn

categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count: screenshot.png

TF-IDF: screenshot.png

Original:

count:screenshot.png

TF-IDF:screenshot.png