Weekly Report CS297

November 22, 2016November 23, 2016 ~ gelihao609 ~ Leave a comment

Use similarity (0.7, 1) to form clusters

2. cut all clusters that have more than 20 features

3. use merged result on only articles with word count > 80

The original result is :0.7783

Weekly Report CS297

November 22, 2016November 22, 2016 ~ gelihao609 ~ Leave a comment

Confusion Matrix for original NB:

for NB with similarity in (0.8, 1)

for NB with similarity in (0.7, 1)

[20161107] Weekly Report

November 7, 2016November 7, 2016 ~ gelihao609 ~ Leave a comment

Added stemmer: Feature size from 26879 to 25054, accuracy from

0.77848 to 0.77830

–with original Naive Bayes

Final: positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

Non-stemmed: (0.9, 1)->160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896

Stemmed: (0.9, 1)->132 (0.8, 0.9)->1032 (0.7, 0.8)5040 (0.6, 0.7)41750

Non-stemmed: —>F1:0.77848

Clusters-in-(0.9-1): Num of clusters: 27, num of features: 77 —>F1: 0.7806

Clusters-in-(0.8-0.9): Num of clusters: 524, num of features: 1211 —>F1: 0.7754

Clusters-in-(0.7-0.8): Num of clusters: 1844, num of features: 5216 —>F1: 0.7771

Clusters-in-(0.6-0.7): Num of clusters: 1587, num of features: 10224 —>F1: 0.7525

Stemmed: —> F1: 0.77830

Clusters-in-(0.9-1): Num of clusters: 17, num of features: 54 —> F1: 0.77748

Clusters-in-(0.8-1): Num of clusters: 316, num of features: 746 —>F1: 0.778366

Clusters-in-(0.7-1): Num of clusters: 1122, num of features: 2982 —>F1: 0.77680

Clusters-in-(0.6-1): Num of clusters: 1444, num of features: 7659 —>F1: 0.75831

Messing around with word2vec

Word2Vec distance isn’t semantic distance

The Word2Vec metric tends to place two words close to each other if they occur in similar contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’ (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:

>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far. For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words. ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

[20161022]CS297 Weekly Report

October 18, 2016October 24, 2016 ~ gelihao609 ~ Leave a comment

Original F1-score: 0.7785

use only >0.9 similarities,merged 77 features into 27 cluster-features, total from 26879 to 26829, F1-score: 0.7806

use only [0.8,0.9) similarities, merged 1211 features into 524 cluster-features, total from 26879 to 26192, F1-score: 0.7754

use only [0.7,0.8) similarities, merged 5216 features into 1844 cluster-features, total from 26879 to 23507, F1-score: 0.7771

use only[0.6,0.7) similarities, merged 10224 features into 1587 cluster-features, total from 26879 to 18242, F1-score: 0.7525

use >=0.7 similarities, merged features from 26879 to 23081, F1-score: 0.7754

Investigation on 0.9 similarities group

0 revenue revenues
1 astounding astonishing
2 eighth seventh ninth sixth fifth
3 north west east south
4 concerning regarding
5 kilometers km kms
6 benefitted benefited
7 5pm 6pm
8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st
9 southern northern
10 forty thirty twenty
11 wj vlb dca
12 descendents descendants
13 fourth third
14 jr sr
15 photos pictures
16 hundreds thousands tens
17 four seven five six three eight nine two
18 totally completely
19 predominantly predominately
20 incredible amazing
21 humankind mankind
22 disappeared vanished
23 forbids prohibits
24 northeast southeast southwest
25 disappear vanish
26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

[20161015]CS297 Weekly Report

October 15, 2016October 17, 2016 ~ gelihao609 ~ Leave a comment

Topic: Use Word2Vec to select feature

Analysis of feature
1. [0.9,1) 80 pairs, [0.8, 0.9) 1570 pairs,[0.7, 0.8) 8660 pairs, [0.6,0.7)55896 pairs

add up all features that have high similarity (>0.9)

But, similar features are intertwined

–> Use graph search to identify all connected components:

–> add up weight for each group and become new features

[20161009]CS297 Weekly Report

October 10, 2016October 11, 2016 ~ gelihao609 ~ Leave a comment

Incorporate Word2Vec Similarity into Vectors

Corpus used in Word2Vec, From Google News, 3M words
Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

Issue

speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
1. numpy matrix multiplication
  - Matrix_new_weight = Matrix_old_weight * Matrix_Similarity
2. Cython complies into c code
calculated similarity sum deviate the original vector weight too much
1. use factor to decrease the weight –> W1 = W1 + factor * Wnew
similarity of features becomes negative –> filter out

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

[20161002]Weekly Report

October 2, 2016October 2, 2016 ~ gelihao609 ~ Leave a comment

Achieved naive Bayes classifier implementation with Count based and TF-IDF based vectors.
The accuracy is the same with the NB in sklearn
The speed is far slower than NB in sklearn

categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count:

TF-IDF:

Original:

count:

TF-IDF:

[20160926]CS297 Weekly Report

September 26, 2016September 26, 2016 ~ gelihao609 ~ Leave a comment

get dataset 20 newspaper info
use sklearn to vectorize data text with TF-IDF
Modify the conditional probability
Build MNB

Build MNB using

A working example is shown below:

To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):

$\begin{displaymath} \hat{P}(\tcword\vert c) = \frac{T_{c\tcword}+1}{\sum_{\tcwor... ...frac{T_{c\tcword}+1}{(\sum_{\tcword' \in V} T_{c\tcword'})+B}, \end{displaymath}$

(119)

where $B=\vert V\vert$ is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

**Table 13.1:** Data for parameter estimation examples.
	docID	words in document	in China?
training set	1	Chinese Beijing Chinese	yes
	2	Chinese Chinese Shanghai	yes
	3	Chinese Macao	yes
	4	Tokyo Japan Chinese	no
test set	5	Chinese Chinese Chinese Tokyo Japan	?

Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors $\hat{P}(c) = 3/4$ and $\hat{P}(\overline{c}) = 1/4$ and the following conditional probabilities:

$\begin{eqnarray*} \hat{P}(\term{Chinese}\vert c) &=& (5+1)/(8+6) = 6/14=3/7 \\ ... ...) = \hat{P}(\term{Japan}\vert\overline{c}) &=& (1+1)/(3+6)= 2/9 \end{eqnarray*}$

The denominators are and because the lengths of and $text_{\overline{c}}$ are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

$\begin{eqnarray*} \hat{P}(c\vert d_5) &\propto& 3/4 \cdot (3/7)^3 \cdot 1/14 \c... ... &\propto& 1/4 \cdot (2/9)^3 \cdot 2/9 \cdot 2/9 \approx 0.0001. \end{eqnarray*}$

Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.