This is the web spider for Winemag Reviews Dataset built with Scrapy.
At this stage, the following attributes are being collected.
Field | Type | Description | Example |
---|---|---|---|
url | str |
Full URL to the review | https://www.winemag.com/buying-guide/laurent-...-morgon/ |
title | str |
Title/Name of the wine. WARNING: May include scraping errors. | Laurent Gauthier 2016 Vieilles Vignes Côte du Py (Morgon) |
rating | int |
Wine rating on the 100-point scale | 91 |
description | str |
Review of the wine | Wood aging has given spice to this rich, structured wine. Tannins and generous black fruits show through the still-young structure. This powerful wine, from one of the top vineyards in Morgon, will age well. Drink from 2020. |
price | float , NULL |
Price in $ | 25 |
designation | str , NULL |
Quality level of wine | Vieilles Vignes Côte du Py |
varietal | str |
Grape Varietal/Blend name | Gamay |
country | str |
Name of Country | France |
region | str , NULL |
Region within a Country | Beaujolais |
subregion | str , NULL |
Sub-region within a region | Morgon |
subsubregion | str , NULL |
Detailed region | |
winery | str |
Name of producer/winery | Laurent Gauthier |
vintage | int , NULL |
Vintage (Year) of production | 2016 |
alcohol | float , NULL |
Alcohol By Volume (ABV) in % | 13.5 |
category | str |
Category of wine | Red |
-
Miniconda (4.5+)
-
Install the environment using
conda env create -f environment.yaml
NOTE: Feel free to use any package manager as long as the dependencies are satisfied.
Start the crawler using,
scrapy crawl winemag -a start_page=1 -a end_page=10 \
-o winemag-1-10.csv
See Scrapy Command Line for more details.
This command will scrape pages 1 to 10 of the reviews.
WARNING: Careful with the scraping limits. You are advised to scrape only a few pages per spider per session.
With the current settings.py, it takes about ~320 hours (~20 hours each for 16 spiders) to collect ~250k reviews.
Download the raw data here.
- Some vintages have been wrongly parsed in situations where the title included more than one four-digit numbers.
MIT