Skip to content

This project uses BigQuery to explore the Google Analytics dataset, and build a Machine Learning Model to predict whether a visitor on the website will make a purchase or not.

Notifications You must be signed in to change notification settings

SohaT7/Purchase_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Purchase_Analysis

Table of Contents

Overview of the Analysis

Purpose:

This project aims to analyze the Google Analytics dataset to predict whether a visitor will make a purchase when active on the website. The machine learning model has been created witin BigQuery in Google Cloud Platform (GCP), using SQL.

About the Dataset:

The dataset is the Google Analytics dataset taken from the public datasets available in BigQuery. It contains data spanning web sessions for each day from August 01, 2016 till August 01, 2017. The dataset contains almost a million entries (903, 653 to be exact) - these entries each denote an active web session for a visitor to the website.

Tools Used:

  • Google Cloud Platform (GCP) - BigQuery, Vertex AI
  • SQL
  • Python
  • Vertex AI API
  • Notebooks API
  • Python Notebook
  • BigQuery Client

Description:

Part one of this project runs multiple queries on the dataset to retrieve some general information from Google Analytics. Part two of the project comprises of building a machine learning model to help predict whther a visitor will make a purchase while active on the website or not. The relevant queries for each of these can be found in the Queries folder.

Part ONE: Running Queries

Queries were run on the dataset to calculate the percentage of purchasers versus non-purchasers, top 10 products by revenue, top 10 countries by visitors, and number of visitors by Channel Grouping (also referred to as simply "Channel" herein) type.

Some queries were run in the BigQuery Editor while some were run in a Python notebook, using Vertex AI API and Notebooks API. To run a query within a notebook, Bigquery Client was used to send and receive messages from BigQuery. The data from the query was then displayed in a dataframe, and the table then pivoted to generate a bar plot.

Percentage of Purchasers versus Non-Purchasers

Subqueries were written for unique visitors, purchasers, and non-purchasers. Number of unique visitors were obtained by COUNTing all DISTINCT values for the fullVisitorId field. Number of purchasers were obtianed by COUNTing all DISTINCT fullVisitorId values where total transactions were NOT NULL. Conversely, number of non-purchasers were obtained by counting all distinct fullVisitorId vlaues where total transactions were NULL.

This was followed by running calculations using these new variables to get the percentage of purchasers and non-purchasers in our dataset. The following query was run in the BigQuery Editor.

purchasers_per

Top 10 Products by Revenue

The top 10 products by revenue were found by first UNNESTing the Hits array and then UNNESTing the Products array to get the fields for product name, product quantity, and local product revenue. The result was then GROUPed BY product name, so as to give revenue against each particular product. Product revenue was then ORDERed BY a DESCending order. This gave a list of product names with their revenues listed in a descending order, i.e. the product with the largest revenue will be listed at the top.

The following query, and the subsequent code, was run in a Python notebook:

The plot generated via the above code can be seen under the "Results" section.

Top 10 Countries by Visitors

The query written for the top 10 countries by visitors involved COUNTing all DISTINCT fullVisitorId values for each country (via GROUP BY), and then ordering the results in a descending order to list the country with the highest number of visitors at the top.

top10countries

Visitors by Channel

The query written for visitors by Channel Grouping involves COUNTing all DISTINCT fullVisitorId values for each Channel Grouping (via GROUP BY), and then ORDERing BY Channel Grouping in a descending order, so as to give the Channel Grouping type with the highest number of visitors at the top.

Part TWO: Machine Learning Model

The machine learning model created in BigQuery using SQL consisted of training the model, followed by evaluating the model, and finally making predictions for whether a visitor will make a purchase while active on the website or not. For the purposes of trainng, evaluating, and making predictions via the model, the first 9 months, following 2 months, and the last 1 month worth of data was used.

Training the Model

The Train_query.sql file contains the SQL script which trains the model. Since our model aims to predict whether a visitor will make a purchase or not, we use logistic regression. To train the model, the first 9 months of data, from August 01, 2016 till April 30, 2017, was used.

The label created for the model was named 'purchased', and gives a value of 0 when no purchase was made and 1 if a purchase was made. This variable was codified to give a value of 1 if the total transactions made were greater than 0, and 0 if they eqauted to 0, followed by a GROUP BY on fullVisitorId.

A variable 'unique_identifier' was created by concatanating the variables fullVisitorId and visitID, to give a completely unique ID. The variable visitID, which is an INTEGER, was first CAST into a STRING.

The variables used from the dataset thus far are given below:

  • fullVisitorId - STRING - A unique visitor ID
  • visitID - INTEGER - Identifies a particular session
  • totals.transactions - INTEGER - Total number of transactions made within a session

The features selected for the model were variables that seem to most closely relate to the activity of a purchase being made, and pertain to traffic type, browser type, device type, operating system, geographic location, activity on site, and advertisements. The list of features variables after formatting is given below, followed by a description of each:

variables_list

  • trafficSource.isTrueDirect AS traffic_direct - BOOLEAN - whether the traffic was direct (i.e. visitor typed the name of website in browser or arrived via a previously bookmarked page) or not
  • trafficSource.medium AS traffic_medium - STRING - medium of the traffic, e.g. referral, organic, etc
  • trafficSource.source AS traffic_source - STRING - source of the traffic, e.g. hostname, search engine name, URL, etc
  • channelGrouping - STRING - default channel group for a visitor's session, e.g. organic search, paid search, display, social, etc
  • device.browser AS device_browser - STRING - the browser used, e.g. Chrome or Firefox
  • device.deviceCategory AS device_category - STRING - type of device, e.g. mobile, desktop, or tablet
  • device.operatingSystem AS device_OS - operating sustem of the device being used, e.g. Windows or Macintosh
  • geoNetwork.region AS region - STRING - region from which the session originated
  • totals.bounces AS bounces - INTEGER - total number of bounced sessions. Value for a bounced session is 1, and NULL otherwise
  • totals.newVisits AS new_visits - INTEGER - Total number of new users visiting in a session. Value is 1 for a new visit, and NULL otherwise
  • totals.pageviews AS page_views - INTEGER - Total number of page views in a session
  • totals.timeOnSite AS time_on_site - INTEGER - Total time of a session recorded in seconds
  • trafficSource.adwordsClickInfo.gclId AS ad_id - STRING - Google Click ID
  • trafficSource.adwordsClickInfo.isVideoAd AS ad_video - BOOLEAN - whether it is a video ad
  • trafficSource.adwordsClickInfo.page AS ad_on_pg_num - INTEGER - webpage number where the ad was displayed for the visitor to come across
  • trafficSource.adwordsClickInfo.slot AS ad_slot - STRING - position on the webpage where the ad was placed

Evaluating the Model

The Evaluate_query.sql file contains the SQL script which evaluates the model created above. The model was evaluated on two months' worth of data, and produced the following metrics:

  • precision
  • recall
  • accuracy
  • f1_score
  • log_loss
  • roc_auc

The ROC (Receiver Operating Characteristic) curve helps visualize the relationship between the false positive rate (predicted 'True' but actually is 'False') and the true positive rate (predicted 'True' and is actually 'True' as well). The roc_auc gives the area under curve for a model's performance. The more the area under curve, the better the model (i.e. the more close together are the predicted and actual values, meaning the model is more accurate). For the purposes of this evaluation, roc_auc was codified as shown below:

roc_auc_code

Making Predictions

The Predict_query.sql file contians the SQL script to help the model make predictions on the remaining data. The model was used to make predictions on the data from sessions spanning the last month. It predicts whether a particular visitor will end up making a purchase on the website (given the value 1, otherwise 0) or not.

Results

Part ONE: Running Queries

The results for the queries run earlier are given below.

Percentage of Purchasers versus Non-Purchasers

Out of a total of 714,167 visitors, there are 10,002 purchasers and 710,928 non-purchasers. This makes up to about 1.4% purchasers and 99.5% non-purchasers.

r_purchasers_per

Top 10 Products by Revenue

The top 10 products by revenue are listed below. The product generating the highest revenue is "Google Men's Zip Hoodie".

r_top10products

Top 10 Countries by Visitors

The top 10 countries by visitors are listed below. The country with the highest number of visitors is United States, with 251,830 visitors.

r_top10countries

Visitors by Channel

Number of visitors for each Channel Grouping are displayed below. Highest number of visitors, i.e. 311,607 visitors, reach the website through the Channel Grouping "Organic Search". This is followed by "Social" with 212,374 visitors and "Direct" with 109,830 visitors.

r_visitors_channel

Part TWO: Machine Learning Model

Training the Model

The performance of the "loss" metric while training the model can be seen below. Arounf the 5th iteration, loss has been minimized as much as possible, given the model specifics.

loss_function

The duration for each iteration to run completely and the learning rate plotted against each iteration are shown below:

duration_learning_rate

Evaluating the Model

eval_results

The roc_auc gives a value of 0.92 (i.e. 92%), which is "great" model performance. The model has an accuracy level of 95.5%.

Making Predictions

Making predictions on the data produces 3 new fields:

  • predicted_purchased: whether the visitor makes a purchase (given the value of 1, and 0 otherwise)
  • predicted_purchased.label: the binary classifier ("1" for a purchase and "0" for no purchase)
  • predicted_purchased.prob: the probability of the particular event occurring as predicted by the model

Some example predictions are shown below:

predictions_0

predictions_1

Summary

This project explored the Google Analytics dataset in two parts. The first part comprised of running queries on the dataset to find the percentage of purchasers versus non-purchasers, top 10 products by revenue, top 10 countries by visitors, and channel grouping by visitors. The second part involved building a machine learning model in BigQuery which predicts whether a visitor on the website will make a purchase or not. The model has great performance with a roc_auc (Receiver Operating Characteristic Area Under Curve) value of 92% and an accuracy level of 95.5%.

Contact Information

Email: st.sohatariq@gmail.com

About

This project uses BigQuery to explore the Google Analytics dataset, and build a Machine Learning Model to predict whether a visitor on the website will make a purchase or not.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published