We have been using various datasets in the course that are either model/toy datasets or collected in conditions fairly remote to local relevance. With the newfound API and webscrapping skills that you learned, this project challenges you to create a dataset revolving around a topic, problem, or theme of your choice, clean it, properly document it, and submit it to the Kaggle dataset repository. Curating and sharing a dataset is an integral part of your skills and practice as a data scientist that should not be overlooked!
For project 3, your goal is three-fold:
- Define a domain, issue, and problem that you are interested in (preferably with local/regional relevance).
- Collect, clean, and submit the data the Kaggle datasets repository under the course's organization.
- Submit the data collection and a starter kernel in public associated to the published dataset.
To help you get started, please read this blog post by Kaggle.
For some good examples, tutorials, and steps to publish your dataset, read this page
For inspiration on how a company successfully published a dataset on Kaggle read this story.
More information and documentation on the Kaggle datasets platform see this page
- Gather and prepare your data using API or webscrapping. A ready-made dataset is NOT allowed.
- Make your data accessible and readable by using common open file formats like CSV.
- Take the time to describe your dataset thoroughly.
- Pick a clear, open license ensuring your dataset is reusable.
- Publish a kernel on your dataset to help others learn how they can work with the data. The kernel should show features of the dataset with a plot or two to showcase some variables in the dataset. Also raise potentially interesting questions that could be answered using this dataset.
- Put your data collection and cleaning scripts in a repo.
- The link to the dataset (and the associated starter kernel) on Kaggle.
- The link to the repo which includes the scripts you used to collect and clean the data.