- Use similarity (0.7, 1) to form clusters

2. cut all clusters that have more than 20 features

3. use merged result on only articles with word count > 80

The original result is :0.7783

Skip to content
# Weekly Report CS297

# Weekly Report CS297

# [20161107] Weekly Report

# [20161022]CS297 Weekly Report

# [20161015]CS297 Weekly Report

# [20161009]CS297 Weekly Report

# [20161002]Weekly Report

- Use similarity (0.7, 1) to form clusters

2. cut all clusters that have more than 20 features

3. use merged result on only articles with word count > 80

The original result is :0.7783

Confusion Matrix for original NB:

for NB with similarity in (0.8, 1)

for NB with similarity in (0.7, 1)

Added stemmer: Feature size from 26879 to 25054, accuracy from

0.77848 to 0.77830

–with original Naive Bayes

Final: positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

*Non-stemmed: (0.9, 1)-> 160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896*

Stemmed: (0.9, 1)->**132** (0.8, 0.9)->**1032** (0.7, 0.8)**5040** (0.6, 0.7)**41750**

Non-stemmed: —>**F1:0.77848 **

Clusters-in-(0.9-1): Num of clusters: **27**, num of features: **77 —>F1: 0.7806**

Clusters-in-(0.8-0.9): Num of clusters: **524**, num of features: **1211 —>F1: 0.7754**

Clusters-in-(0.7-0.8): Num of clusters: **1844**, num of features: **5216 —>F1: 0.7771**

Clusters-in-(0.6-0.7): Num of clusters:** 1587**, num of features:

Stemmed: —> **F1: 0.77830**

Clusters-in-(0.9-1): Num of clusters: **17**, num of features: **54 —> F1: 0.****77748**

Clusters-in-(0.8-1): Num of clusters: **316**, num of features: **746 —>F1: 0.****778366**

Clusters-in-(0.7-1): Num of clusters: **1122**, num of features: **2982 —>F1: ****0.77680**

Clusters-in-(0.6-1): Num of clusters: **1444**, num of features: **7659 —>F1: ****0.75831**

https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/

Word2Vec distance isn’t semantic distanceThe Word2Vec metric tends to place two words close to each other if they occur in similar

contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’ (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far. For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words. ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

Original **F1-score**: **0.7785**

use only >0.9 similarities,**merged 77 features into 27**** cluster-features**, total from 26879 to 26829, **F1-score**:** 0.7806**

use only [0.8,0.9) similarities, **merged 1211 features into ****524 cluster-features**, total from 26879 to 26192, **F1-score**: **0.7754**

use only [0.7,0.8) similarities, **merged 5216 features into ****1844 cluster-features**, total from 26879 to 23507, **F1-score**: **0.7771**

use only[0.6,0.7) similarities, **merged 10224 features into ****1587 cluster-features**, total from 26879 to 18242, **F1-score**: **0.7525**

use >=0.7 similarities, **merged features from 26879 to ****23081**, **F1-score**: **0.7754**

Investigation on 0.9 similarities group

0 revenue revenues

1 astounding astonishing

2 eighth seventh ninth sixth fifth

3 north west east south

4 concerning regarding

5 kilometers km kms

6 benefitted benefited

7 5pm 6pm

8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st

9 southern northern

10 forty thirty twenty

11 wj vlb dca

12 descendents descendants

13 fourth third

14 jr sr

15 photos pictures

16 hundreds thousands tens

17 four seven five six three eight nine two

18 totally completely

19 predominantly predominately

20 incredible amazing

21 humankind mankind

22 disappeared vanished

23 forbids prohibits

24 northeast southeast southwest

25 disappear vanish

26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

Topic: Use Word2Vec to select feature

- Analysis of feature
- [0.9,1)
**80**pairs, [0.8, 0.9)**1570**pairs,[0.7, 0.8)**8660**pairs, [0.6,0.7)**55896**pairs

- [0.9,1)

add up all features that have high similarity (>0.9)

But, similar features are intertwined

–> Use graph search to identify all connected components:

–> add up weight for each group and become new features

**Incorporate Word2Vec Similarity into Vectors**

- Corpus used in Word2Vec, From Google News, 3M words
- Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
- ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

**Issue**

- speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
- numpy matrix multiplication
- Matrix_new_weight = Matrix_old_weight * Matrix_Similarity

- Cython complies into c code

- numpy matrix multiplication
- calculated similarity sum deviate the original vector weight too much
- use factor to decrease the weight –> W1 = W1 + factor * Wnew

- similarity of features becomes negative –> filter out

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

- Achieved naive Bayes classifier implementation with Count based and TF-IDF based vectors.
- The accuracy is the same with the NB in sklearn
- The speed is far slower than NB in sklearn

categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count:

TF-IDF:

Original:

count:

TF-IDF: