Weekly Report CS297

  1. Use similarity (0.7, 1) to form clusters

2. cut all clusters that have more than 20 features

3. use merged result on only articles with word count > 80

screenshot.png

The original result is :0.7783

Advertisements

[20161107] Weekly Report

Added stemmer: Feature size from 26879 to 25054, accuracy from 

0.77848 to 0.77830

–with original Naive Bayes

Final:  positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

Non-stemmed: (0.9, 1)->160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896

Stemmed: (0.9, 1)->132 (0.8, 0.9)->1032 (0.7, 0.8)5040 (0.6, 0.7)41750

Non-stemmed: —>F1:0.77848

Clusters-in-(0.9-1): Num of clusters: 27num of features: 77 —>F1: 0.7806

Clusters-in-(0.8-0.9): Num of clusters: 524num of features: 1211 —>F1: 0.7754

Clusters-in-(0.7-0.8): Num of clusters: 1844, num of features: 5216 —>F1: 0.7771

Clusters-in-(0.6-0.7): Num of clusters: 1587, num of features: 10224 —>F1: 0.7525

Stemmed: —> F1: 0.77830

Clusters-in-(0.9-1): Num of clusters: 17num of features: 54 —> F1: 0.77748

Clusters-in-(0.8-1): Num of clusters: 316num of features: 746 —>F1: 0.778366

Clusters-in-(0.7-1): Num of clusters: 1122, num of features: 2982 —>F1: 0.77680

Clusters-in-(0.6-1): Num of clusters: 1444, num of features: 7659 —>F1: 0.75831

screenshot.png

https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’  (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.)  If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far.  For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words.  ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.