eu-data-workflow

Here you witness my data analysis project which consists of this repo, which scrapes data from public sources and models them into the database, and a visualization tool in the front.

First of all this follows the great idea of Simon Willison, who published a great blog entry about his experience of utilizing GitHub Actions as a scraping pipeline.

The data of the CORIDS database (named "H2020" in this repo) underlies the copyright of the © European Union 2022.

The data of the BMBF underlies copyright of © Bundesministerium für Bildung und Forschung.

Approach

This repository does the following:

scrape data regulary from known source via cron timed GH Actions
push the data through a minimal ETL pipeline
upload the latest records to the Neo4j AuraDB
keep track of data changes via Git history

Get running yourself

You would need to setup a (free) Neo4j AuraDB and put the credentials into your repository secrets (NEO4J_USER, NEO4j_PASSWORD, NEO4j_URL). Further you have to adapt the actions under .github/workflows/... to your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
aura_util.cypher		aura_util.cypher
foerderportal_newest_100.csv		foerderportal_newest_100.csv
format-csv.py		format-csv.py
h2020_convert_data.sh		h2020_convert_data.sh
h2020_euroSciVoc_out.csv		h2020_euroSciVoc_out.csv
h2020_import_entities.cypher		h2020_import_entities.cypher
h2020_import_relationships.cypher		h2020_import_relationships.cypher
h2020_latest_filter.awk		h2020_latest_filter.awk
h2020_legalBasis_out.csv		h2020_legalBasis_out.csv
h2020_organizations.csv		h2020_organizations.csv
h2020_organizations_out_formatted.csv		h2020_organizations_out_formatted.csv
h2020_organizations_out_formatted_latest.csv		h2020_organizations_out_formatted_latest.csv
h2020_projects.csv		h2020_projects.csv
h2020_projects_out_date_formatted.csv		h2020_projects_out_date_formatted.csv
h2020_projects_out_formatted.csv		h2020_projects_out_formatted.csv
h2020_projects_out_formatted_latest.csv		h2020_projects_out_formatted_latest.csv
h2020_topics_out.csv		h2020_topics_out.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eu-data-workflow

Approach

Get running yourself

About

Releases

Packages

Contributors 2

Languages

derrabauke/eu-data-workflow

Folders and files

Latest commit

History

Repository files navigation

eu-data-workflow

Approach

Get running yourself

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages