[160624] Weekly Report of CS297

Summary

Reviewed 1 paper focusing on text classification . The paper discussed about comparing the accuracy of classification between using web2Vec, Doc2Vec model and bag of words representation as the feature. The results show that web2Vec, Doc2Vec models offer higher accuracy.

Process pipeline

Document->Text processing –> Word2Vec/Doc2Vec feature generation –> (IDF/TF-IDF weight adjustment) –> Train Classifier –> Evaluate performance

Pros:

  • Introduced neural network based Word2Vec and Doc2Vec Model
  • Coupled word vector with weighting strategy like IDF and TF-IDF

Cons:

  • Training set and testing set used the same dataset, the accuracy is not persuasive
  • Only logistic regression is used. Can try more classifiers
  • Only bag of words is used for baseline. Can try more models, e.g. LDA

Reference:

Jiang, Suqi, et al. “Integrating rich document representations for text classification.” 2016 IEEE Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2016.

Advertisements

[160603] Weekly Report of CS297

Summary

1 paper for the yelp dataset is reviewed . The topic of the paper is to find the valuable reviews in yelp and this paper is focused on the feature extraction and selection.

Achievement

learned feature extraction and selection techniques in yelp review dataset

  1. LDA topic modeling technique to classify text into topic
  2. learned Personality analysis technique to classify text into personality traits.

Paper Details

  1. The author select the following attributes for the model:
    1. reviewers average star rating, retrieved directly from dataset
    2. Topic modeling clustering, resulting in a vector of words with different weight
    3. reviewer’s personality profiles based on all its reviews
      1. Extraversion vs. Introversion (sociable, assertive, playful vs. aloof, reserved, shy)
      2. Emotional stability vs. Neuroticism (calm, unemotional vs. insecure, anxious)
      3. Agreeableness vs. Disagreeable (friendly, cooperative vs. antagonistic, faultfinding)
      4. Conscientiousness vs. Unconscientious (self-disciplined, organized vs. inefficient, careless) •
      5. Openness to experience (intellectual, insightful vs. shallow, unimaginative)
    4. Rating similarity:
      1. number of people given the exact same rating (1 – 5 stars)
      2. number of people given the similar rating(+/-1 star)
    5. User Geographical clusters: each geographical area form a point in the map
      1. number of points
      2. number of clusters created by the points
      3. minimum and maximum cluster sizes
    6. Friendship strength: the number of places that two users write a review for
      1. number of friends  of the user
      2. minimum and maximum number of friends in this place
    7. Uniqueness Score: the combination of a number of reviews by a user multiplied by the inverse of the number of reviews written for that place
  2. Multiple classifiers used are listed as below. The author uses WEKA to analyze the classifiers.
    1. Decision Trees
    2. Multi Level Perceptron
    3. Support Vector Machines
    4. Naive Bayes
    5. Multi-boost
  3. Evaluations and conclusions
    1. Naive Bayes and SVM produced the worst results and Decision tree the best
    2. LDA topics model features can be substitute for personality profile features
    3. Use information gain produced by decision tree and WEKA select attributes features to reduce attributes
    4. Final attributes are: extraversion, conscientiousness, openness to experience, number of friends, maximum number of friends in venue, close ratings, number of clusters, number of points, maximum cluster size
    5. The highest accuracy is 79% using J48 decision tree algorithm

Issues and Future works

  1. Some features are not cleared explained, e.g. minimum and maximum cluster sizes? maximum number of friends in venue?
  2. The author uses the “useful” field to determine the validity of  a review. If useful count > 4, then it’s valuable. It can be tuned to check the false positive and false negative rate change.
  3. The work of finding valuable review can be used in filtering fake reviews. Fake reviews have certain pattern and can be detected using linguistic features and behavior features. This will be reviewed in next week.

Reference

1. J. Koven, H. Siadati and C. Y. Lin, “Finding Valuable Yelp Comments by Personality, Content, Geo, and Anomaly Analysis,” 2014 IEEE International Conference on Data Mining Workshop, Shenzhen, 2014, pp. 1215-1218. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7022737&isnumber=7022545