CSET OpenAlex metadata

This project augments OpenAlex works with additional metadata prepared by CSET. The dataset is available at Zenodo.

Metadata fields

For the complete list of metadata fields and their types, see schemas/metadata.json. Below, we describe how each field was collected in more detail:

Language ID

Our article linkage pipeline generates language ID labels for titles and abstracts using PYCLD2. We only include language IDs where PYCLD2 successfully output a language and marked the output as reliable.

Subject relevance predictions

We share outputs for subject classifiers (for more information on how these classifiers were trained and deployed, see our documentation) in the following fields:

is_cv - True if a computer vision classifier predicted the work was relevant
is_nlp - True if a natural language processing classifier predicted the work was relevant
is_robotics - True if a robotics classifier predicted the work was relevant
is_ai - True if an artificial intelligence classifier predicted the work was relevant, or if any of the computer vision, natural language processing, or robotics classifiers predicted the work was relevant
is_cyber - True if a cybersecurity classifier predicted the work was relevant

Updating the dataset

The dataset is updated monthly through the pipeline in cset_openalex_augmentation_dag.py. This pipeline runs the query in sql/metadata.sql to aggregate CSET metadata associated with each OpenAlex work, backs the results up within our internal data warehouse, and updates the data on Zenodo.

(For CSET staff) To update the artifacts used by this pipeline, run bash push_to_airflow.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
schemas		schemas
scripts		scripts
sql		sql
.eslintrc.js		.eslintrc.js
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sqlfluff		.sqlfluff
.stylelintrc		.stylelintrc
README.md		README.md
cset_openalex_augmentation_dag.py		cset_openalex_augmentation_dag.py
push_to_airflow.sh		push_to_airflow.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSET OpenAlex metadata

Metadata fields

Language ID

Subject relevance predictions

Updating the dataset

About

Releases

Packages

Contributors 3

Languages

georgetown-cset/cset_openalex

Folders and files

Latest commit

History

Repository files navigation

CSET OpenAlex metadata

Metadata fields

Language ID

Subject relevance predictions

Updating the dataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages