Projects

Portfolio Amazon

Sentiment Analysis for Amazon Movie Reviews Dataset

15.05.2020
Anılcan Atik

The Amazon Movie Dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

This dataset includes reviews (ratings, text, helpfulness votes) and product metadata (descriptions, category information, price, brand, and image features).

{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}

UCSB webpage can be usefull for more information about dataset.

Converting .json into .csv format:

In [1]:
import csv
              import json
              #download link = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz"
              input_file = "C:/Users/Joahn/Desktop/ML Intro Term Project/Movies_and_TV_5.json"
              input_json = open(input_file, "r", encoding="utf-8")

              output_file = "C:/Users/Joahn/Desktop/ML Intro Term Project/Movies_and_TV_5.csv"
              with open(output_file, "w", encoding="utf-8") as output_csv:
                  csv_writer = csv.writer(output_csv)
                  flag = 0
                  for line in input_json.readlines():
                      dic = json.loads(line)
                      # writing headline in the beginning
                      if flag == 0:
                          csv_writer.writerow(dic)
                          flag = 1
                      csv_writer.writerow(dic.values())

              print("Done")
              
Done
              

Eliminating the neutral review rates:

Since rating is 1 to 5; 1-2 ratings indicate negative response whereas 5-4 ratings indicate positive response.
Thus we aim to eliminate neutral rating 3, and re-label positive scores as +1, negative scores as -1.

In [77]:
import pandas as pd
              import string

              input_data = pd.read_csv("C:/Users/Joahn/Desktop/ML Intro Term Project/Movies_and_TV_5.csv")
              input_data['overall'] = input_data['overall'].astype(object) # fix datatype error
              input_data['reviewText'] = input_data['reviewText'].astype(object) # fix datatype error
              
In [78]:
input_data.head(3)
              
Out[78]:
reviewerID asin reviewerName helpful reviewText overall summary unixReviewTime reviewTime
0 ADZPIG9QOCDG5 0005019281 Alice L. Larson "alice-loves-books" [0, 0] This is a charming version of the classic Dick... 4.0 good version of a classic 1203984000 02 26, 2008
1 A35947ZP82G7JH 0005019281 Amarah Strack [0, 0] It was good but not as emotionally moving as t... 3.0 Good but not as moving 1388361600 12 30, 2013
2 A3UORV8A9D5L2E 0005019281 Amazon Customer [0, 0] Don't get me wrong, Winkler is a wonderful cha... 3.0 Winkler's Performance was ok at best! 1388361600 12 30, 2013
In [79]:
input_data.shape
              
Out[79]:
(1697533, 9)
In [80]:
dataset = {"reviewText": input_data["reviewText"], "overall": input_data["overall"]  }
              dataset = pd.DataFrame(data = dataset)
              dataset = dataset.dropna()
              
In [81]:
dataset.head(3)
              
Out[81]:
reviewText overall
0 This is a charming version of the classic Dick... 4.0
1 It was good but not as emotionally moving as t... 3.0
2 Don't get me wrong, Winkler is a wonderful cha... 3.0
In [82]:
dataset.shape
              
Out[82]:
(1697472, 2)

Introducing Positive and Negative Labels:

Eliminating the neutral reviews of "3",
positive label value is +1 and it includes 4 and 5 overall ratings;
while negative label value is -1 and in includes 1 and 2 overal ratings.

In [85]:
dataset = dataset[dataset["overall"] != "3.0"] # need datatype=object
              dataset["label"] = dataset["overall"].apply(lambda rating : +1 if str(rating) > '3' else -1)
              
In [86]:
dataset.head(3)
              
Out[86]:
reviewText overall label
0 This is a charming version of the classic Dick... 4.0 1
3 Henry Winkler is very good in this twist on th... 5.0 1
4 This is one of the best Scrooge movies out. H... 4.0 1
In [87]:
dataset.shape
              
Out[87]:
(1496953, 3)
In [88]:
dataset.count()
              
Out[88]:
reviewText    1496953
              overall       1496953
              label         1496953
              dtype: int64
In [89]:
print("Number of positive reviews are {}, while number of negative reviews are {} in the dataset".format((dataset.label == 1).sum(),(dataset.label == -1).sum()))
              
Number of positive reviews are 1291214, while number of negative reviews are 205739 in the dataset
              
  • There is a way less negative reviews compared to positive reviews in our data. That might create a problematic bias towards positive reviews.

  • I need to investigate further whether I should be using balanced learning model or choose the imbalanced learning model.

  • dataset_i = imbalanced
  • dataset_b = balanced
  • Due to the time and resource constraints I will be sampling dataset, and use small chunk of reviews in my model.

Dataset_i:

In [92]:
dataset_i = dataset.sample(frac = 0.03, replace = False, random_state=42)
              
In [93]:
dataset_i.count()
              
Out[93]:
reviewText    44909
              overall       44909
              label         44909
              dtype: int64
In [94]:
print("Number of positive reviews are {}, while number of negative reviews are {} in the dataset.".format((dataset_i.label == 1).sum(),(dataset_i.label == -1).sum()))
              
Number of positive reviews are 38819, while number of negative reviews are 6090 in the dataset.
              

Dataset_b:

  • I am planning to include 200,000 negative and 200,000 positive reviews in this dataset.
In [101]:
dataset_neg = dataset[dataset["label"] == -1]
              dataset_pos= dataset[dataset["label"] == +1]
              dataset_neg = dataset_neg.sample(frac = 0.1215, replace = False, random_state = 42)
              dataset_pos = dataset_pos.sample(frac= 0.01936, replace = False, random_state = 42)
              print("dataset_neg: {}, dataset_pos: {}.".format(dataset_neg.count(),dataset_pos.count()))
              dataset_b = pd.concat([dataset_neg,dataset_pos])
              print("dataset_b: {}".format(dataset_b.count()))
              
dataset_neg: reviewText    24997
              overall       24997
              label         24997
              dtype: int64, dataset_pos: reviewText    24998
              overall       24998
              label         24998
              dtype: int64.
              dataset_b: reviewText    49995
              overall       49995
              label         49995
              dtype: int64
              
In [104]:
dataset_b.head()
              
Out[104]:
reviewText overall label
715454 I was fortunate enough to watch this with John... 2.0 -1
201768 Good movie, bad format. If you people would st... 1.0 -1
62207 Awful, horribly stupid movie. One of THE most ... 1.0 -1
1658239 I SHOULD NOT HAVE BOUGHT IT, I DID SO BECAUSE... 1.0 -1
570367 I have always been a fan of Charlie's Angels, ... 2.0 -1

Data Cleaning Process:

  • In this data cleaning process, I used nltk wordnet library, and found bunch of cleaning functions, that fits my pre-processing goals.
In [105]:
from nltk.corpus import wordnet

              def get_wordnet_pos(pos_tag):
                  if pos_tag.startswith('J'):
                      return wordnet.ADJ
                  elif pos_tag.startswith('V'):
                      return wordnet.VERB
                  elif pos_tag.startswith('N'):
                      return wordnet.NOUN
                  elif pos_tag.startswith('R'):
                      return wordnet.ADV
                  else:
                      return wordnet.NOUN

              import string
              from nltk import pos_tag
              from nltk.corpus import stopwords
              from nltk.tokenize import WhitespaceTokenizer
              from nltk.stem import WordNetLemmatizer

              def clean_text(text):
                  # lower text
                  text = text.lower()
                  # tokenize text and remove puncutation
                  text = [word.strip(string.punctuation) for word in text.split(" ")]
                  # remove words that contain numbers
                  text = [word for word in text if not any(c.isdigit() for c in word)]
                  # remove stop words
                  stop = stopwords.words('english')
                  text = [x for x in text if x not in stop]
                  # remove empty tokens
                  text = [t for t in text if len(t) > 0]
                  # pos tag text
                  pos_tags = pos_tag(text)
                  # lemmatize text
                  text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
                  # remove words with only one letter
                  text = [t for t in text if len(t) > 1]
                  # join all
                  text = " ".join(text)
                  return(text)

              # clean text data_i
              dataset_i["review_clean"] = dataset_i["reviewText"].apply(lambda x: clean_text(x))
              # clean text data_b
              dataset_b["review_clean"] = dataset_b["reviewText"].apply(lambda x: clean_text(x))
              
In [106]:
dataset_i.head()
              
Out[106]:
reviewText overall label review_clean
181173 The movie "Scream 2" is basically just like "S... 1.0 -1 movie scream basically like scream neither mov...
1459925 Good Movie, I liked it, Great FlimI like the d... 4.0 1 good movie like great flimi like depictation c...
56386 Oliver Stone gives Cruise his best role as Ron... 5.0 1 oliver stone give cruise best role ron kovic b...
1236023 It was difficult making this tv series into a ... 4.0 1 difficult make tv series movie series primaril...
272125 In 1946 John Wayne (1907-79) was a big star as... 2.0 -1 john wayne big star result film stagecoach dar...
In [107]:
dataset_b.head()
              
Out[107]:
reviewText overall label review_clean
715454 I was fortunate enough to watch this with John... 2.0 -1 fortunate enough watch john scott shepherd aut...
201768 Good movie, bad format. If you people would st... 1.0 -1 good movie bad format people would stop pay ju...
62207 Awful, horribly stupid movie. One of THE most ... 1.0 -1 awful horribly stupid movie one overrate movie...
1658239 I SHOULD NOT HAVE BOUGHT IT, I DID SO BECAUSE... 1.0 -1 buy mel gibson suppose stop watch min never see
570367 I have always been a fan of Charlie's Angels, ... 2.0 -1 always fan charlie's angel absolutely adore dr...

WordCloud for dataset_b

Most Used Words in Negative Reviews:

In [108]:
from matplotlib import pyplot as plt
              neg_reviews = dataset_b[dataset_b.label == -1]
              neg_string = []
              for t in neg_reviews.review_clean:
                  neg_string.append(t)
              neg_string = pd.Series(neg_string).str.cat(sep=' ')
              from wordcloud import WordCloud

              wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
              plt.figure(figsize=(12,10))
              plt.imshow(wordcloud, interpolation="bilinear")
              plt.axis("off")
              plt.show()
              

Most Used Words in Negative Reviews:

In [109]:
from matplotlib import pyplot as plt
              pos_reviews = dataset_b[dataset_b.label == +1]
              pos_string = []
              for t in pos_reviews.review_clean:
                  pos_string.append(t)
              pos_string = pd.Series(pos_string).str.cat(sep=' ')
              from wordcloud import WordCloud

              wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
              plt.figure(figsize=(12,10))
              plt.imshow(wordcloud, interpolation="bilinear")
              plt.axis("off")
              plt.show()