Skip to content

Dataset of Wine Reviews from Wine Enthusiast Magazine 🍇 🍷 🌏

License

Notifications You must be signed in to change notification settings

activatedgeek/winemag-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Winemag Dataset

This is the web spider for Winemag Reviews Dataset built with Scrapy.

Data

At this stage, the following attributes are being collected.

Field Type Description Example
url str Full URL to the review https://www.winemag.com/buying-guide/laurent-...-morgon/
title str Title/Name of the wine. WARNING: May include scraping errors. Laurent Gauthier 2016 Vieilles Vignes Côte du Py (Morgon)
rating int Wine rating on the 100-point scale 91
description str Review of the wine Wood aging has given spice to this rich, structured wine. Tannins and generous black fruits show through the still-young structure. This powerful wine, from one of the top vineyards in Morgon, will age well. Drink from 2020.
price float, NULL Price in $ 25
designation str, NULL Quality level of wine Vieilles Vignes Côte du Py
varietal str Grape Varietal/Blend name Gamay
country str Name of Country France
region str, NULL Region within a Country Beaujolais
subregion str, NULL Sub-region within a region Morgon
subsubregion str, NULL Detailed region
winery str Name of producer/winery Laurent Gauthier
vintage int, NULL Vintage (Year) of production 2016
alcohol float, NULL Alcohol By Volume (ABV) in % 13.5
category str Category of wine Red

Dependencies

  • Miniconda (4.5+)

  • Install the environment using

    conda env create -f environment.yaml

NOTE: Feel free to use any package manager as long as the dependencies are satisfied.

Usage

Start the crawler using,

scrapy crawl winemag -a start_page=1 -a end_page=10 \
                     -o winemag-1-10.csv

See Scrapy Command Line for more details.

This command will scrape pages 1 to 10 of the reviews.

WARNING: Careful with the scraping limits. You are advised to scrape only a few pages per spider per session.

With the current settings.py, it takes about ~320 hours (~20 hours each for 16 spiders) to collect ~250k reviews.

Access Data

Download the raw data here.

Notes

  • Some vintages have been wrongly parsed in situations where the title included more than one four-digit numbers.

License

MIT