- get dataset 20 newspaper info
- use sklearn to vectorize data text with TF-IDF
- Modify the conditional probability
- Build MNB
Build MNB using
A working example is shown below:
To eliminate zeros, we use add-one or Laplace smoothing, which simply adds one to each count (cf. Section 11.3.2 ):
where is the number of terms in the vocabulary. Add-one smoothing can be interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. Note that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in Equation 116 on the document level.
We have now introduced all the elements we need for training and applying an NB classifier. The complete algorithm is described in Figure 13.2 .
Table 13.1: Data for parameter estimation examples.
||words in document
||Chinese Beijing Chinese
||Chinese Chinese Shanghai
||Tokyo Japan Chinese
||Chinese Chinese Chinese Tokyo Japan
Worked example. For the example in Table 13.1 , the multinomial parameters we need to classify the test document are the priors and and the following conditional probabilities:
The denominators are and because the lengths of and are 8 and 3, respectively, and because the constant in Equation 119 is 6 as the vocabulary consists of six terms.
We then get:
Thus, the classifier assigns the test document to = China. The reason for this classification decision is that the three occurrences of the positive indicator Chinese in outweigh the occurrences of the two negative indicators Japan and Tokyo. End worked example.
1. Text categorization Benchmark
1. Reuters-21578 –> 21,578 docs, 135 different topics
2. 20 Newsgroups –>20,000 docs, 20 different topics
2. Practical text categorization
Proposed Project Name
Categorize Text with Naive Bayes and word2vec word embedding
yelp review –> based on review to predict business category
pros: easy to get dataset;
cons: the business category seems to be obvious
Twitter thread –> based on thread content to predict the category
pros: makes more sense of category prediction
cons: hard to get labeled dataset;
- extract yelp dataset
- visualize dataset
- explore the number of categories
- Use naive Bayes classifier from python package to classify business
- implement Naive Bayes Classifier using python
- implement word2vec enhanced classifier
- prototype classification by category with classic Naive Bayes Classifier
Reviewed 1 paper using Word2Vec to enhance Naive Bayes Classifier. The Title is shown below:
- Introduced semantic analysis into text classification. Word2Vec is shown to improve the classification accuracy.
- Applied distributed method for large-scale computation
- For each class, the same corpus is used. If find corpus with each different class, the result may be more accurate
Reviewed one paper using the tool (word2Vec) for determining the characteristic vocabulary.
The title of the paper is posted as below:
The author proposed a work flow to detect the characteristic vocabulary of the domain in question by using 1. a crawler to gather the text information and 2. word2vec to rank the similar words. The schematic of the work flow can be seen below:
Reviewed 1 paper focusing on text classification . The paper discussed about comparing the accuracy of classification between using web2Vec, Doc2Vec model and bag of words representation as the feature. The results show that web2Vec, Doc2Vec models offer higher accuracy.
Document->Text processing –> Word2Vec/Doc2Vec feature generation –> (IDF/TF-IDF weight adjustment) –> Train Classifier –> Evaluate performance
- Introduced neural network based Word2Vec and Doc2Vec Model
- Coupled word vector with weighting strategy like IDF and TF-IDF
- Training set and testing set used the same dataset, the accuracy is not persuasive
- Only logistic regression is used. Can try more classifiers
- Only bag of words is used for baseline. Can try more models, e.g. LDA
Jiang, Suqi, et al. “Integrating rich document representations for text classification.” 2016 IEEE Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2016.