This repo contains materials for the fields of study project (as described in "Multi-Label Field Classification for Scientific Documents using Expert and Crowd-sourced Knowledge"). This involves the following:
-
Our starting point is our merged corpus of publications, specifically its English-language text. We use it to learn FastText and tf-idf word vectors.
-
The second fundamental input is a taxonomy that defines fields of study, in a hierarchy of broad areas like "computer science" and more granular subfields like "machine learning". We derived the top level of this taxonomy from the taxonomy previously used by MAG, and create the lower layers ourselves (as described in the paper). For current purposes it's static. We call this the field taxonomy.
-
For each field in the taxonomy, we have various associated text extracted from Wikipedia (pages and their references). Using this field content and the word vectors learned from the merged corpus, we create embeddings for each field. We refer to these as FastText and tf-idf field embeddings.
-
We then identify in the field content every mention of another field. (For instance, the "computer science" content mentions "artificial intelligence," "machine learning," and many other fields.) The averages of the FastText field embeddings for these mentioned fields are the entity embeddings for each field.
-
Next, for each English publication in the merged corpus we create publication embeddings. Specifically, for each publication a FastText embedding, tf-idf embedding, and FastText field mention embedding (as immediately above, but for fields mentioned in the publication text).
-
Lastly, scoring: we compute the cosine similarities of the embeddings for publications and fields. This yields up to three cosine similarity (FastText, tf-idf, and mention FastText) for a publication-field pair. We average them to get a publication's field score.
Clone:
git clone --recurse-submodules https://github.com/georgetown-cset/fields-of-study-pipeline.git
On Linux with Python 3.8.10 via Miniconda, pip-install the requirements file:
cd fields-of-study-pipeline
~/miniconda3/bin/python3 -m venv venv
source venv/bin/activate
sudo apt-get install build-essential -y
pip install -r requirements.txt
Some assets are large, so we're using dvc.
GitHub is responsible for tracking the asset metadata in .dvc
files and DVC stores the assets themselves in GCS.
Retrieve them with dvc pull
.
dvc pull
cd assets/scientific-lit-embeddings/ && dvc pull && cd ../..
We have an instance
named fields
in us-east1-c. It's set up as above.
DVC is backed by storage in the gs://fields-of-study-model
bucket.
When retrieving the merged corpus (fos/corpus.py
) we use the
field_model_replication
BQ dataset and the gs://fields-of-study
bucket.
Retrieve English text in our merged corpus:
# writes 'assets/corpus/{lang}_corpus-*.jsonl.gz'
PYTHONPATH=. python scripts/download_corpus.py en
Embed the publication text:
# reads 'assets/corpus/{lang}_corpus-*.jsonl.gz'
# writes 'assets/corpus/{lang}_embeddings.jsonl'
PYTHONPATH=. python scripts/embed_corpus.py en
Calculate field scores from the publication embeddings:
# reads 'assets/corpus/{lang}_embeddings.jsonl'
# writes 'assets/corpus/{lang}_scores.tsv'
PYTHONPATH=. python scripts/score_embeddings.py en
Alternatively, embed + score without writing the publication embeddings to the disk:
# reads 'assets/corpus/{lang}_corpus-*.jsonl.gz'
# writes 'assets/corpus/{lang}_scores.jsonl'
PYTHONPATH=. python scripts/score_corpus.py en
We start by retrieving English in the merged corpus.
PYTHONPATH=. python scripts/download_corpus.py en
We learned English FastText and tf-idf vectors from these corpora. Documentation for this is
in assets/scientific-lit-embeddings
.
Outputs (annually):
- FastText vectors:
assets/{en,zh}_merged_fasttext.bin
- tf-idf vectors and vocab:
assets/{en,zh}_merged_tfidf.bin
and TODO
Outputs (~weekly):
- Preprocessed corpus:
assets/corpus/{lang}_corpus-*.jsonl.gz
The field taxonomy defines fields of study: their names, the parent/child relations among fields, and other metadata. At time of writing, we're using a field taxonomy derived from MAG. In the future, we might extend or otherwise update the taxonomy.
Outputs (static):
wiki-field-text/fields.tsv
For each field in the field taxonomy, we identified associated text (page content and references) in Wikipedia. This is
documented in the wiki-field-text
repo. Using this content and the word vectors learned from the merged corpus, we
created embeddings for each field.
Outputs (annually):
- FastText field embeddings:
assets/{en,zh}_field_fasttext.bin
- tf-idf field embeddings:
assets/{en,zh}_field_tfidf.bin
We identify in the field content every mention of another field. For instance, the "computer science" content mentions
"artificial intelligence," "machine learning," etc. We average over the FastText embeddings for mentioned fields to
generate FastText entity embeddings. This is documented in the wiki-field-text
repo.
Outputs (annually):
- FastText entity embeddings:
assets/{en,zh}_field_mention_fasttext.bin
We embed each English publication in the preprocessed corpora (1) using the FastText and tf-idf vectors (2), and by averaging the entity embeddings (3) for each field mentioned in the publication text.
PYTHONPATH=. python scripts/embed_corpus.py en
Outputs (~weekly):
- Publication embeddings:
assets/corpus/{lang}_embeddings.jsonl
For publication-field pairs, we take the cosine similarity of the publication and field embeddings, and then average over these cosine similarities yielding field scores.
PYTHONPATH=. python scripts/score_embeddings.py en
Outputs (~weekly):
- Publication field scores:
assets/corpus/{lang}_scores.jsonl
To add new queries to the sequence that is run after scores are ingested into BQ, add the query to the sql
directory
and put the filename in query_sequence.txt
in the position the query should be run. You can reference the
staging and production datasets in your query using {{staging_dataset}}
and {{production_dataset}}
. Because
the production dataset contains old data until the sequence of queries finishes running, you normally want to
reference the staging dataset.
To update the artifacts used by airflow, run bash push_to_airflow.sh
.
To view the dag, visit this link. To
trigger a run on only new or modified data, trigger the dag without configuration parameters. To rerun on all data,
trigger the dag with the configuration {"rerun": true}
.