Yelp Review Dataset Sentiment Analysis:

About:

Library:

Loading Dataset:

I used sample data for google collab document, due to resource constraints. This sampled csv file contains 100k rows and 9 columns. Download link

Removing quotes from columns:

Exploratory Data Analysis:

Position of EDA should be after pre-processing.

Stars Distribution:

Number of Unique Buisnesses Reviewed:

46706 Unique Businesses are reviewed, and most reviewed business contain 128 reviews. (Total number of reviews is 100.000)

Number of Unique Users:

82135 Unique Users posted reviews, and user with most reviews posted 55 reviews. (Total number of reviews is 100.000)

Labeling Positive and Negative Reviews:

In this part, we classify our reviews rating into two classes of 0 (Negative for rating less than and equal to 3) and 1 (Positive for rating more than 3) by creating new column as the "Target" for further processing.

Word Cloud:

Positive Reviews:
Negative Reviews:

Preprocessing:

Library:

Cleaning:

Part of the cleaning functions are taken from this link and this link.

Removing HTML tags:

Removing Punctuation:

Removing Extra Whitespace:

All Lowercase:

Converting Accented Characters:

Expanding Contractions:

Applying Cleaning Functions:

Labeling Positive and Negative Reviews:

Exploratory Data Analysis:

Stars Distribution:

Number of Unique Buisnesses Reviewed:

46706 Unique Businesses are reviewed, and most of the reviewed business contain 128 reviews. (Total number of reviews is 100.000)

Number of Unique Users:

82135 Unique Users posted reviews, and user with most reviews posted 55 reviews. (Total number of reviews is 100.000)

Labeling Positive and Negative Reviews:

In this part, we classify our reviews rating into two classes of 0 (Negative for rating less than and equal to 3) and 1 (Positive for rating more than 3) by creating new column as the "Target" for further processing.

Word Cloud:

Positive Reviews:
Negative Reviews:

Distinct Words:

Preprocessing:

Tokenization:

Library:

Tokenizing and Creating Sequence:

TensorFlow (Keras) Tokenizer class to automate the tokenization of our training data.

Vectorization:

Library:

Using uni and bigrams, for the bigrams minimum occurance of bigram should be 10.

Applying Vectorization

Training, Validation and Test Data Splitting:

Model Building:

LSTM:

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. LSTM is a good alternative for classification problems.

Library:

Train Test Spliting:

Building Model:

Optimization of parameters:

Model fit:

Epochs and batch size are minimized after trial.

Model Evaluation:

Confusion Matrix:

GloVe:

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations indicate linear substructures of the word vector space. I will be using GloVe twitter model to predict this classification problem.

Library:

Glove Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download) 100d is selected as a main parameter by grid search done in Microsoft Azure Studio.

Train Test Splitting:

Building Model:

Using glove.twitter.27B.100d

Creating Embedding Matrix:

Optimization of parameters:

Model Fit:

Epochs and batch size is minimized after trial.

Model Evaluation:

Confusion Matrix:

SVM:

Library:

Test Train Splitting:

Model Building:

Grid Search:
Models:

Model Evaluation:

Confusion Matrix:

Logistic Regression:

Library:

Train Test Splitting:

Model Building:

Grid Search:

Model Fit:

Model Evaluation:

Confusion Matrix:

Package Versions: