Yelp is a local search search service for local businesses. People share their reviews about their experience with that business, which is a very crucial source of information. Customer’s feedback can help identify and prioritize strengths and weaknesses for further development of the business. I am interested in the customer reviews for restaurants near me, location Chicago, IL.
Thanks to internet, today we have an access to numerous sources where people willingly share their experience with different companies and services. We can use this opportunity to extract some valuable information and derive some actionable-insights to deliver best customer experience.
By scraping all those reviews we can collect a decent amount of quantitative and qualitative data, analyze it, and identify areas for improvement. Thankfully, python provides libraries to easily deal with these tasks. For web scraping I decided to use requests library, which does the job and is very simple to use. I have no prior experience in web scraping and I want to create my own data set to perform sentiment analysis.
We can easily study the structure of the website by inspecting the website in a web browser. After studying the structure of Yelp web-site, I came up with a list of possible data variables to collect:
- Reviewer’s Name
- Review
- Date
- Star Rating
- Restaurant Name
The requests module lets you easily download files from the web. You can install the requests module using:
>>> pip install requests
First, we go to the yelp website and search restaurants near me, location Chicago, IL.
Then, we will import all required libraries and create a pandas DataFrame.
import pandas as pd
import time as t
from lxml import html
import requests
reviews_df=pd.DataFrame()
Downloading the html page with requests.get()
>>>import requests
>>> searchlink= 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Chicago,+IL'
user_agent = ‘ Enter you user agent here ’
headers = {‘User-Agent’: user_agent}
You can get your user agent here.
You can simply copy paste the url to scrape restaurants reviews for any other location on the same review platform. All you need to do is to specify a link.
page = requests.get(searchlink, headers = headers)
parser = html.fromstring(page.content)
The requests.get will download the html page. Now, we have to find the links for multiple restaurants from the page.
businesslink=parser.xpath('//a[@class="biz-name js-analytics-click"]')
links = [l.get('href') for l in businesslink]
These links are not complete, and therefore we will have to add domain name to it.
u=[]
for link in links:
u.append('https://www.yelp.com'+ str(link))
Now we have all the restaurant names from the first page; there are 30 search results in each page. Now lets iterate through each page and get their reviews.
for item in u:
page = requests.get(item, headers = headers)
parser = html.fromstring(page.content)
The reviews are contained in a div with class name “review review — with-sidebar”. Lets grab all these divs.
xpath_reviews = ‘//div[@class=”review review — with-sidebar”]’
reviews = parser.xpath(xpath_reviews)
For each review we want to scrape the author name, review body, date, restaurant name, and star rating.
for review in reviews:
temp = review.xpath('.//div[contains(@class, "i-stars i- stars--regular")]')
rating = [td.get('title') for td in temp]
xpath_author = './/a[@id="dropdown_user-name"]//text()'
xpath_body = './/p[@lang="en"]//text()'
author = review.xpath(xpath_author)
date = review.xpath('.//span[@class="rating-qualifier"]//text()')
body = review.xpath(xpath_body)
heading= parser.xpath('//h1[contains(@class,"biz-page-title embossed-text-white")]')
bzheading = [td.text for td in heading]
We will create a dictionary for all these items and then we will add this dictionary in a pandas data frame.
review_dict = {‘restaurant’ : bzheading,
‘rating’: rating,
‘author’: author,
‘date’: date,
‘Review’: body,
}
reviews_df = reviews_df.append(review_dict, ignore_index=True)
Now we have all the reviews from one page. You can iterate through the pages by finding the maximum page number. The last page number is contained in a <a> tag with class name “available-number pagination-links_anchor”.
page_nums = '//a[@class="available-number pagination-links_anchor"]'
pg = parser.xpath(page_nums)
max_pg=len(pg)+1
Remember to add sleep time inside each for loop to make the script slow and respect the scraping policies of yelp.com. Sending too many requests can get your IP blocked.
import time as t
t.sleep(10)
With the above script, I have scraped a total of 23,869 reviews, for 450 restaurants and 20–60 reviews per restaurant.
Now lets open up a jupyter notebook and perform text mining and sentiment analysis.
First import some necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
I have saved the data in a file named all.csv
data=pd.read_csv(‘all.csv’)
Now lets look at the head and tail of the data frame.
We have 23,869 records with 5 columns. As we can see the data needs formatting. All the unnecessary symbols, tags, and spaces should be removed. There are also some Null/NaN values.
Drop all Null/NaN values from the data frame.
data.dropna()
Now using string slicing, we will remove the unnecessary symbols and spaces.
data['Review']=data.Review.str[2:-2]
data['author']=data.author.str[2:-2]
data['date']=data.date.str[12:-8]
data['rating']=data.rating.str[2:-2]
data['restaurant']=data.restaurant.str[16:-12]
data['rating']=data.rating.str[:1]
Now lets explore the data further.
We have maximum number of reviews with a 5 star rating with a total of 11,859 records and the lowest is 979 records for 1 star rating. But there are also some records with an unknown rating ‘t’. These records should be dropped.
data.drop(data[data.rating=='t'].index , inplace =True)
To understand the data more we can create a new feature call review_length. This column will store the number of characters in each review and we will eliminate any white spaces in the review.
data['review_length'] = data['Review'].apply(lambda x: len(x) - x.count(' '))
Now lets plot some graphs and understand the data.
hist = sns.FacetGrid(data=data, col='rating')
hist.map(plt.hist, 'review_length', bins=50)
We see that there are higher number of 4 and 5 star reviews. The distribution of review lengths are very similar for all ratings.
Now lets create a box plot for the same.
sns.boxplot(x='rating', y='review_length', data=data)Box plot for rating vs review length
From the box plot it looks like the 2 and 3 star ratings have higher review lengths than the reviews having 5 star rating. But there are many outliers for each star rating which is evident by the number of dots above the boxes. So the review length wont be a much useful feature for our sentiment analysis.
In order to determine whether a review is positive or negative we will focus only on the 1 star and 5 star ratings. Lets create a new data frame to store the 1 and 5 star ratings.
df = data[(data['rating'] == 1) | (data['rating'] == 5)]
df.shape
Output:(12838, 6)
Out of 23,869 records we now have 12,838 records for 1 and 5 star ratings.
In order to use these reviews for analysis, the review text must be formatted properly. Lets check with a sample to understand what we are dealing with.
Looks like there are a lot of punctuation symbols and some unknown codes like ‘\xa0’. ‘\xao’ is actually non-breaking space in Latin1 (ISO 8859–1), also chr(160). You should replace it with a space. Now lets create a function to remove all the punctuation, stop words and then perform lemmatization of the text.
A common method for text pre-processing in natural language processing is bag of words. Bag-of-words represents a list of words disregarding the grammar and their order. The bag-of-words model is commonly used where the occurrence of each word is used as a feature for training a classifier.
Lemmatization: Lemmatization is a process of grouping together the inflected forms of words so they can be analyzed as a single term, identified as lemma. Lemmatization will always return a dictionary form of a word. For example, the words: type, typed and typing will be considered as one word ‘type’. This feature will be very useful for our analysis.
import string # Imports the library
import nltk # Imports the natural language toolkit
nltk.download('stopwords') # Download the stopwords dataset
nltk.download('wordnet')
wn=nltk.WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords.words('english')[0:10]
Output: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
These stop words are often repeated and are neutral in nature. They don’t represent any positive or negative value and can be ignored.
def text_process_lemmatize(revw):
"""
Takes in a string of text, then performs the following:
1. Remove all punctuation
2. Remove all stopwords
3. create a list of the cleaned text
4. Return Lemmatize version of the list
"""
# Replace the xa0 with a space
revw=revw.replace('xa0',' ')
# Check characters to see if they are in punctuation
nopunc = [char for char in revw if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Now just remove any stopwords
token_text= [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
# perform lemmatization of the above list
cleantext= ' '.join(wn.lemmatize(word) for word in token_text)
return cleantext
Now lets process our review column into the function we just created.
df['LemmText']=df['Review'].apply(text_process_lemmatize)
We need to convert the list of lemmas in df[‘LemmText’] into vectors so that a machine learning algorithm and python can use and understand it. This process is know as Vectorizing. This process will create a matrix with each review as a row and each unique lemma as a column and will contain the number of occurrences of each lemma. We will use Count Vectorizer and N-grams process from the scikit-learn library. In that we will focus only on unigrams.
from sklearn.feature_extraction.text import CountVectorizer
ngram_vect = CountVectorizer(ngram_range=(1,1))
X_counts = ngram_vect.fit_transform(df['LemmText'])
Now lets create training and testing data sets using train_test_split from scikit-learn.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3)
#0.3 mean the training set will be 70% and test set will be 30%
We will use the Multinomial Naive Bayes for predicting the sentiment. 1 star ratings represent negative reviews and 5 star ratings represent positive reviews. Lets create a MultinomialNB model to fit with X_train and y_train.
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train,y_train)
Now lets predict on the test set X_test.
NBpredictions = nb.predict(X_test)
Now lets evaluate our model’s prediction against their actual star ratings from y_test.
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,NBpredictions))
print('\n')
print(classification_report(y_test,NBpredictions))
The model has an accuracy of 97% which is very good. This model can predict whether the customer liked or disliked the restaurant from the review he posted.
By Kaustubh Borole on October 9, 2018.
Exported from Medium on October 9, 2018.