Article Linking

At CSET, we aim to produce a more comprehensive set of scholarly literature by ingesting multiple sources and then deduplicating articles. This repository contains CSET's current method of cross-dataset article linking. Note that we use "article" very loosely, although in a way that to our knowledge is fairly consistent across the datasets we draw from. Books, for example, are included. We currently include articles from arXiv, Web of Science, Papers With Code, Semantic Scholar, The Lens, and OpenAlex. Some of these sources are largely duplicative (e.g. arXiv is well covered by other corpora) but are included to aid in linking to additional metadata (e.g. arXiv fulltext).

For more information about the overall merged academic corpus, which is produced using several data pipelines including article linkage, see the ETO documentation.

Matching articles

To match articles, we need to extract the data that we want to use in matching and put it in a consistent format. The SQL queries specified in the sequences/generate_{dataset}_data.tsv files are run in the order they appear in those files. For OpenAlex we exclude documents with a type of Dataset, Peer Review, or Grant. Additionally, we take every combination of the Web of Science titles, abstracts, and pubyear so that a match on any of these combinations will result in a match on the shared WOS id. Finally, for Semantic Scholar, we exclude any documents that have a non-null publication type that is one of Dataset, Editorial, LettersAndComments, News, or Review.

For each article in arXiv, Web of Science, Papers With Code, Semantic Scholar, The Lens, and OpenAlex we normalized titles, abstracts, and author last names to remove whitespace, punctuation, and other artifacts thought to not be useful for linking. For the purpose of matching, we filtered out titles, abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles within or across datasets that shared at least one of the following (non-null) metadata fields:

Normalized title
Normalized abstract
Citations
DOI

as well as a match on one additional field above, or on

Publication year
Normalized author last names

to correspond to one article in the merged dataset. We also link articles based on vendor-provided cross-dataset links.

Generating merged articles

Given a set of articles that have been matched together, we generate a single "merged id" that is linked to all the "original" (vendor) ids of those articles. Some points from our implementation:

If articles that have been seen in a previous run and were previously assigned to different merged ids are now matched together, we assign them to a new merged id.
If a set of articles previously assigned to a given merged id loses articles (either because it is now assigned to a different merged id, or because it has been deleted from one of the input corpora), we give this set of articles a new merged id.
If a set of articles previously assigned to a given merged id gains articles without losing any old articles, we keep the old merged id for these articles.

This implementation is meant to ensure that downstream pipelines (e.g. model inference, canonical metadata assignment) always reflect outputs on current metadata for a given merged article regardless of downstream pipeline implementation.

Automation and output tables

We automate article linkage using Apache Airflow. linkage_dag.py contains our current implementation.

This dag is triggered from the Semantic Scholar ETL dag which runs once a month.
This dag triggers the Org Fixes dag.

The DAG generates two tables of analytic significance:

staging_literature.all_metadata_with_cld2_lid - captures metadata for all unmerged articles in a standard format. It also contains language ID predictions for titles and abstracts based on CLD2.
literature.sources - contains pairs of merged ids and original (vendor) ids linked to those merged ids.

Metadata selection for each merged article happens in a downstream DAG.

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
.github/workflows		.github/workflows
evaluation		evaluation
schemas		schemas
sequences		sequences
sql		sql
tests		tests
utils		utils
.flake8		.flake8
.pre-commit-config.yaml		.pre-commit-config.yaml
.sqlfluff		.sqlfluff
.sqlfluffignore		.sqlfluffignore
README.md		README.md
__init__.py		__init__.py
linkage_dag.py		linkage_dag.py
push_to_airflow.sh		push_to_airflow.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Article Linking

Matching articles

Generating merged articles

Automation and output tables

About

Releases 2

Packages

Contributors 7

Languages

georgetown-cset/article-linking

Folders and files

Latest commit

History

Repository files navigation

Article Linking

Matching articles

Generating merged articles

Automation and output tables

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Languages

Packages