Skip to content

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

License

Notifications You must be signed in to change notification settings

jpwahle/cs-insights-crawler

Repository files navigation



Actions Status Actions Status Actions Status License: MIT Code style: black


This is the official crawler implementation for the D3 Dataset in almost pure python. The crawler is also used for the cs-insights project.

Starting from version 1.0.2, this project is using semantic versioning, and supports SemanticScholar. For more info about the features supported, see the releases.

Installation & Setup

First install the package manager poetry:

pip install poetry

Then run:

poetry install

To start the crawling process, run:

poetry run cli main --s2_use_papers --s2_use_abstracts --s2_filter_dblp

For help run:

poetry run cli main --help

Code quality and tests

To maintain a consistent and well-tested repository, we use unit tests, linting, and typing checkers with GitHub actions. We use pytest for testing, pylint for linting, and pyright for typing. Every time code gets pushed to our repository these checks are executed and have to fullfill certain requirements before you can merge the code to our master branch.

Whenever you create a pull request against the default branch, GitHub actions will create a CI job executing unit tests and linting.

To run all tests that are tested during CI locally, run:

poetry run poe alltest

Contributing

Fork the repo, make changes and send a PR. We'll review it together!

Commit messages should follow Angular's conventions.

License

This project is licensed under the terms of MIT license. For more information, please see the LICENSE file.

Citation

If you use this repository, or use our tool for analysis, please cite our work:

Citation

If you use this repository, or use our tool for analysis, please cite our work:

@inproceedings{Wahle2022c,
  title        = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
  author       = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
  year         = {2022},
  month        = {July},
  booktitle    = {Proceedings of The 13th Language Resources and Evaluation Conference},
  publisher    = {European Language Resources Association},
  address      = {Marseille, France},
  doi          = {},
}

Also make sure to cite the following papers if you use SemanticScholar data:

@inproceedings{ammar-etal-2018-construction,
    title = "Construction of the Literature Graph in Semantic Scholar",
    author = "Ammar, Waleed  and
      Groeneveld, Dirk  and
      Bhagavatula, Chandra  and
      Beltagy, Iz",
    booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
    month = jun,
    year = "2018",
    address = "New Orleans - Louisiana",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N18-3011",
    doi = "10.18653/v1/N18-3011",
    pages = "84--91",
}
@inproceedings{lo-wang-2020-s2orc,
    title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
    author = "Lo, Kyle  and Wang, Lucy Lu  and Neumann, Mark  and Kinney, Rodney  and Weld, Daniel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.447",
    doi = "10.18653/v1/2020.acl-main.447",
    pages = "4969--4983"
}

About

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

Topics

Resources

License

Stars

Watchers

Forks

Languages