[20161022]CS297 Weekly Report

Original F1-score0.7785

use only >0.9 similarities,merged 77 features into 27 cluster-features, total from 26879 to 26829, F1-score: 0.7806

use only [0.8,0.9) similarities, merged 1211 features into 524 cluster-features total from 26879 to 26192, F1-score: 0.7754

use only [0.7,0.8) similarities, merged 5216 features into 1844 cluster-features, total from 26879 to 23507, F1-score0.7771

use only[0.6,0.7) similarities, merged 10224 features into 1587 cluster-features, total from 26879 to 18242, F1-score0.7525

use >=0.7 similarities, merged features from 26879 to 23081F1-score0.7754

Investigation on 0.9 similarities group

0 revenue revenues
1 astounding astonishing
2 eighth seventh ninth sixth fifth
3 north west east south
4 concerning regarding
5 kilometers km kms
6 benefitted benefited
7 5pm 6pm
8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st
9 southern northern
10 forty thirty twenty
11 wj vlb dca
12 descendents descendants
13 fourth third
14 jr sr
15 photos pictures
16 hundreds thousands tens
17 four seven five six three eight nine two
18 totally completely
19 predominantly predominately
20 incredible amazing
21 humankind mankind
22 disappeared vanished
23 forbids prohibits
24 northeast southeast southwest
25 disappear vanish
26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

Advertisements

[20161015]CS297 Weekly Report

Topic: Use Word2Vec to select feature

  1. Analysis of feature
    1. [0.9,1) 80 pairs, [0.8, 0.9) 1570 pairs,[0.7, 0.8) 8660 pairs, [0.6,0.7)55896 pairs

screenshot.png screenshot.png screenshot.png screenshot.png

add up all features that have high similarity (>0.9)

screenshot.png  But, similar features are intertwined

–> Use graph search to identify all connected components:

screenshot.png

–> add up weight for each group and become new features

 

 

 

[20161009]CS297 Weekly Report

Incorporate Word2Vec Similarity into Vectors

  • Corpus used in Word2Vec, From Google News, 3M words
  • Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
  • ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

Issue

  1. speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
    1. numpy matrix multiplication
      • Matrix_new_weight = Matrix_old_weight  * Matrix_Similarity
    2. Cython complies into c code
  2. calculated similarity sum deviate the original vector weight too much
    1. use factor to decrease the weight –> W1 = W1 + factor * Wnew
  3. similarity of features becomes negative –> filter out

screenshot.png

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

 

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

screenshot.png

[20161002]Weekly Report

  1. Achieved naive Bayes classifier implementation with Count based and TF-IDF based vectors.
  2. The accuracy is the same with the NB in sklearn
  3. The speed is far slower than NB in sklearn

categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count: screenshot.png

TF-IDF: screenshot.png

Original:

count:screenshot.png

TF-IDF:screenshot.png