Here you witness my data analysis project which consists of this repo, which scrapes data from public sources and models them into the database, and a visualization tool in the front.
First of all this follows the great idea of Simon Willison, who published a great blog entry about his experience of utilizing GitHub Actions as a scraping pipeline.
The data of the CORIDS database (named "H2020" in this repo) underlies the copyright of the © European Union 2022.
The data of the BMBF underlies copyright of © Bundesministerium für Bildung und Forschung.
This repository does the following:
- scrape data regulary from known source via cron timed GH Actions
- push the data through a minimal ETL pipeline
- upload the latest records to the Neo4j AuraDB
- keep track of data changes via Git history
You would need to setup a (free) Neo4j AuraDB and put the credentials into your repository secrets (NEO4J_USER
, NEO4j_PASSWORD
, NEO4j_URL
). Further you have to adapt the actions under .github/workflows/...
to your needs.