The goal of this project is to create a dataset of online retail transactions and customers for demonstrating streaming data applications. This project starts with a real dataset of online transactions (see credits section below). The data volumes in the original dataset are quite low and would not be very interesting for a streaming demo - you may have to wait for a long time to see a transaction. For a streaming demo, it is more interesting if user's can see transactions happening every second, therefore, the original data is modified.
The main processing logic is:
- remove records with missing customer ids
- remove records with negative quantities (see here for a better approach)
- add a line item number for each record
- create a copy of the entire data
- The first copy represents UK timezone
- add a suffix to the invoice number of
1
- The second copy repsenets US timezone
- add a suffix to the invoice number of
2
- shift the invoice datetime by 12 hours
- create customer details for each customer in the dataset
The original dataset transactions span one year (1st Dec 2010 to 9th Dec 2011). Note that we haven't changed the date of the transaction records. We can change the date of the record when we load the record (e.g. into Kafka). We do this at time of load because we want to set the date to the load date rather than the date at which the dataset is created.
The dataset can be created with:
- local python
- dockerised python
Note that the processing is performed entirely in memory.
# First grab this project
git clone https://github.com/ibm-cloud-streaming-retail-demo/dataset-generator
cd dataset-generator
# setup a virtualenv (optional but recommended)
cd dataset-generator
virtualenv venv
source venv/bin/activate
# setup the python libary dependency
$ pip3 install -r requirements.txt
# download and prepare the dataset
$ python3 create_dataset.py
docker run -it --rm -v "$PWD":/usr/src/myapp -w /usr/src/myapp amancevice/pandas:0.20.2-python3 bash -c "pip3 install -r requirements.txt && python3 create_dataset.py"
Some example records from the original dataset:
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 01/12/2010 08:26 | 2.55 | 17850 | United Kingdom |
536365 | 71053 | WHITE METAL LANTERN | 6 | 01/12/2010 08:26 | 3.39 | 17850 | United Kingdom |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 01/12/2010 08:26 | 2.75 | 17850 | United Kingdom |
The full dataset is available here: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx
See:
- research paper (see website for research paper license)
- website
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).