Winemag Dataset

This is the web spider for Winemag Reviews Dataset built with Scrapy.

Data

At this stage, the following attributes are being collected.

Field	Type	Description	Example
url	`str`	Full URL to the review	https://www.winemag.com/buying-guide/laurent-...-morgon/
title	`str`	Title/Name of the wine. WARNING: May include scraping errors.	Laurent Gauthier 2016 Vieilles Vignes Côte du Py (Morgon)
rating	`int`	Wine rating on the 100-point scale	91
description	`str`	Review of the wine	Wood aging has given spice to this rich, structured wine. Tannins and generous black fruits show through the still-young structure. This powerful wine, from one of the top vineyards in Morgon, will age well. Drink from 2020.
price	`float`, `NULL`	Price in $	25
designation	`str`, `NULL`	Quality level of wine	Vieilles Vignes Côte du Py
varietal	`str`	Grape Varietal/Blend name	Gamay
country	`str`	Name of Country	France
region	`str`, `NULL`	Region within a Country	Beaujolais
subregion	`str`, `NULL`	Sub-region within a region	Morgon
subsubregion	`str`, `NULL`	Detailed region
winery	`str`	Name of producer/winery	Laurent Gauthier
vintage	`int`, `NULL`	Vintage (Year) of production	2016
alcohol	`float`, `NULL`	Alcohol By Volume (ABV) in %	13.5
category	`str`	Category of wine	Red

Dependencies

Miniconda (4.5+)
Install the environment using
```
conda env create -f environment.yaml
```

NOTE: Feel free to use any package manager as long as the dependencies are satisfied.

Usage

Start the crawler using,

scrapy crawl winemag -a start_page=1 -a end_page=10 \
                     -o winemag-1-10.csv

See Scrapy Command Line for more details.

This command will scrape pages 1 to 10 of the reviews.

WARNING: Careful with the scraping limits. You are advised to scrape only a few pages per spider per session.

With the current settings.py, it takes about ~320 hours (~20 hours each for 16 spiders) to collect ~250k reviews.

Access Data

Download the raw data here.

Notes

Some vintages have been wrongly parsed in situations where the title included more than one four-digit numbers.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
scripts		scripts
winemag		winemag
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Winemag Dataset

Data

Dependencies

Usage

Access Data

Notes

License

About

Releases

Packages

Languages

License

activatedgeek/winemag-dataset

Folders and files

Latest commit

History

Repository files navigation

Winemag Dataset

Data

Dependencies

Usage

Access Data

Notes

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages