2. cut all clusters that have more than 20 features

3. use merged result on only articles with word count > 80

The original result is :0.7783

]]>for NB with similarity in (0.8, 1)

for NB with similarity in (0.7, 1)

]]>

0.77848 to 0.77830

–with original Naive Bayes

Final: positiveCount: 582 zeroCount: 23395 oneCount: 13912 negativeCount: 17036552 l

max: 1.0

min: -0.303092024177

*Non-stemmed: (0.9, 1)-> 160 (0.8, 0.9)->1570 (0.7, 0.8)8660 (0.6, 0.7)55896*

Stemmed: (0.9, 1)->**132** (0.8, 0.9)->**1032** (0.7, 0.8)**5040** (0.6, 0.7)**41750**

Non-stemmed: —>**F1:0.77848 **

Clusters-in-(0.9-1): Num of clusters: **27**, num of features: **77 —>F1: 0.7806**

Clusters-in-(0.8-0.9): Num of clusters: **524**, num of features: **1211 —>F1: 0.7754**

Clusters-in-(0.7-0.8): Num of clusters: **1844**, num of features: **5216 —>F1: 0.7771**

Clusters-in-(0.6-0.7): Num of clusters:** 1587**, num of features:

Stemmed: —> **F1: 0.77830**

Clusters-in-(0.9-1): Num of clusters: **17**, num of features: **54 —> F1: 0.****77748**

Clusters-in-(0.8-1): Num of clusters: **316**, num of features: **746 —>F1: 0.****778366**

Clusters-in-(0.7-1): Num of clusters: **1122**, num of features: **2982 —>F1: ****0.77680**

Clusters-in-(0.6-1): Num of clusters: **1444**, num of features: **7659 —>F1: ****0.75831**

https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/

Word2Vec distance isn’t semantic distanceThe Word2Vec metric tends to place two words close to each other if they occur in similar

contexts— that is, w and w’ are close to each other if the words that tend to show up near w also tend to show up near w’ (This is probably an oversimplification, but see this paper of Levy and Goldberg for a more precise formulation.) If two words are very close to synonymous, you’d expect them to show up in similar contexts, and indeed synonymous words tend to be close:>>> model.similarity(‘tremendous’,’enormous’)

0.74432902555062841

The notion of similarity used here is just cosine distance (which is to say, dot product of vectors.) It’s positive when the words are close to each other, negative when the words are far. For two completely random words, the similarity is pretty close to 0.

On the other hand:

>>> model.similarity(‘tremendous’,’negligible’)

0.37869063705009987

Tremendous and negligible are very far apart semantically; but both words are likely to occur in contexts where we’re talking about size, and using long, Latinate words. ‘Negligible’ is actually one of the 500 words closest to ’tremendous’ in the whole 3m-word database.

]]>
**F1-score**: **0.7785**

use only >0.9 similarities,**merged 77 features into 27**** cluster-features**, total from 26879 to 26829, **F1-score**:** 0.7806**

use only [0.8,0.9) similarities, **merged 1211 features into ****524 cluster-features**, total from 26879 to 26192, **F1-score**: **0.7754**

use only [0.7,0.8) similarities, **merged 5216 features into ****1844 cluster-features**, total from 26879 to 23507, **F1-score**: **0.7771**

use only[0.6,0.7) similarities, **merged 10224 features into ****1587 cluster-features**, total from 26879 to 18242, **F1-score**: **0.7525**

use >=0.7 similarities, **merged features from 26879 to ****23081**, **F1-score**: **0.7754**

Investigation on 0.9 similarities group

0 revenue revenues

1 astounding astonishing

2 eighth seventh ninth sixth fifth

3 north west east south

4 concerning regarding

5 kilometers km kms

6 benefitted benefited

7 5pm 6pm

8 6th 8th 4th 7th 9th 5th 3rd 2nd 1st

9 southern northern

10 forty thirty twenty

11 wj vlb dca

12 descendents descendants

13 fourth third

14 jr sr

15 photos pictures

16 hundreds thousands tens

17 four seven five six three eight nine two

18 totally completely

19 predominantly predominately

20 incredible amazing

21 humankind mankind

22 disappeared vanished

23 forbids prohibits

24 northeast southeast southwest

25 disappear vanish

26 horrible terrible

By removing 9,17,and 20, we get the original 0.7785

]]>

- Analysis of feature
- [0.9,1)
**80**pairs, [0.8, 0.9)**1570**pairs,[0.7, 0.8)**8660**pairs, [0.6,0.7)**55896**pairs

- [0.9,1)

add up all features that have high similarity (>0.9)

But, similar features are intertwined

–> Use graph search to identify all connected components:

–> add up weight for each group and become new features

]]>

- Corpus used in Word2Vec, From Google News, 3M words
- Weight_i_new = Weight_i_old + SUM{W0_old*Similarity(i,0) +W1_old*Similarity(i,1) +…+Wn_old*Similarity(i,n)}
- ConditionalProbability_i = (Wi_new+1)/ (SUM(W0+W1+…Wn) + Vocab_size)

**Issue**

- speed issue of multiplying similarity matrix to feature vector (2E6 x 2E6 for only 4 categories)
- numpy matrix multiplication
- Matrix_new_weight = Matrix_old_weight * Matrix_Similarity

- Cython complies into c code

- numpy matrix multiplication
- calculated similarity sum deviate the original vector weight too much
- use factor to decrease the weight –> W1 = W1 + factor * Wnew

- similarity of features becomes negative –> filter out

Analysis on Similarity Matrix:

38% Positive; 54% Zero; 8% Negative

Max: 2057.21140249

Min: 0.0

Use scale factor to tune the vectors:

vector = W +scale_factor * Wsimilarity

Result

]]>categories: [‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’]

training set: (2034,), testing set: (1353,)

Customized:

Count:

TF-IDF:

Original:

count:

TF-IDF:

]]>

Build MNB using

A working example is shown below:

To eliminate zeros, we use *add-one* or *Laplace* *smoothing*, which simply adds one to each count (cf. Section 11.3.2 ):

where is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a *term* as opposed to the prior probability of a *class* which we estimate in Equation 116 on the document level.

We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .

docID | words in document | in China? | |||

training set | 1 | Chinese Beijing Chinese | yes | ||

2 | Chinese Chinese Shanghai | yes | |||

3 | Chinese Macao | yes | |||

4 | Tokyo Japan Chinese | no | |||

test set | 5 | Chinese Chinese Chinese Tokyo Japan | ? |

**Worked example.** For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors and and the following conditional probabilities:

The denominators are and because the lengths of and are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.

We then get:

Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo. **End worked example.**

**1. Text categorization Benchmark**

1. **Reuters-21578** –> 21,578 docs, 135 different topics

http://www.daviddlewis.com/resources/testcollections/reuters21578/

2. **20 Newsgroups** –>20,000 docs, 20 different topics

http://qwone.com/~jason/20Newsgroups/

**2. Practical text categorization **

**Literature **

]]>

Categorize Text with Naive Bayes and word2vec word embedding

**Literature Review**

**DataSet**

yelp review –> based on review to predict business category

pros: easy to get dataset;

cons: the business category seems to be obvious

Twitter thread –> based on thread content to predict the category

pros: makes more sense of category prediction

cons: hard to get labeled dataset;

**Implementation Plan**

- extract yelp dataset
- visualize dataset
- explore the number of categories

- Use naive Bayes classifier from python package to classify business
- implement Naive Bayes Classifier using python
- implement word2vec enhanced classifier
- prototype classification by category with classic Naive Bayes Classifier