Yelp Review Dataset Sentiment Analysis:



Loading Dataset:

I used sample data for google collab document, due to resource constraints. This sampled csv file contains 100k rows and 9 columns. Download link

Removing quotes from columns:

Exploratory Data Analysis:

Position of EDA should be after pre-processing.

Stars Distribution:

Number of Unique Buisnesses Reviewed:

46706 Unique Businesses are reviewed, and most reviewed business contain 128 reviews. (Total number of reviews is 100.000)

Number of Unique Users:

82135 Unique Users posted reviews, and user with most reviews posted 55 reviews. (Total number of reviews is 100.000)

Labeling Positive and Negative Reviews:

In this part, we classify our reviews rating into two classes of 0 (Negative for rating less than and equal to 3) and 1 (Positive for rating more than 3) by creating new column as the "Target" for further processing.

Word Cloud:

Positive Reviews:
Negative Reviews:




Part of the cleaning functions are taken from this link and this link.

Removing HTML tags:

Removing Punctuation:

Removing Extra Whitespace:

All Lowercase:

Converting Accented Characters:

Expanding Contractions:

Applying Cleaning Functions:

Labeling Positive and Negative Reviews:

Exploratory Data Analysis:

Stars Distribution:

Number of Unique Buisnesses Reviewed:

46706 Unique Businesses are reviewed, and most of the reviewed business contain 128 reviews. (Total number of reviews is 100.000)

Number of Unique Users:

82135 Unique Users posted reviews, and user with most reviews posted 55 reviews. (Total number of reviews is 100.000)

Labeling Positive and Negative Reviews:

In this part, we classify our reviews rating into two classes of 0 (Negative for rating less than and equal to 3) and 1 (Positive for rating more than 3) by creating new column as the "Target" for further processing.

Word Cloud:

Positive Reviews:
Negative Reviews:

Distinct Words:




Tokenizing and Creating Sequence:

TensorFlow (Keras) Tokenizer class to automate the tokenization of our training data.



Using uni and bigrams, for the bigrams minimum occurance of bigram should be 10.

Applying Vectorization

Training, Validation and Test Data Splitting:

Model Building:


A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. LSTM is a good alternative for classification problems.


Train Test Spliting:

Building Model:

Optimization of parameters:

Model fit:

Epochs and batch size are minimized after trial.

Model Evaluation:

Confusion Matrix:


GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations indicate linear substructures of the word vector space. I will be using GloVe twitter model to predict this classification problem.


Glove Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download) 100d is selected as a main parameter by grid search done in Microsoft Azure Studio.

Train Test Splitting:

Building Model:

Using glove.twitter.27B.100d

Creating Embedding Matrix:

Optimization of parameters:

Model Fit:

Epochs and batch size is minimized after trial.

Model Evaluation:

Confusion Matrix:



Test Train Splitting:

Model Building:

Grid Search:

Model Evaluation:

Confusion Matrix:

Logistic Regression:


Train Test Splitting:

Model Building:

Grid Search:

Model Fit:

Model Evaluation:

Confusion Matrix:

Package Versions: