PubFetcher

A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages.

Fetch publications

The main resource is Europe PMC, but in case it cannot provide parts of the required content, then other repositories can be consulted. As last resort, there is support to scrape journal articles directly from publisher web sites -- over 50 site scraping rules are built in, mainly for journals in the biomedical and life sciences fields. To not overburden the used APIs and sites, PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired.

In addition to the main content of publications (title, abstract, full text), PubFetcher supports different keywords: the user-assigned keywords of the article, MeSH terms from PubMed and GO/EFO terms as mined by Europe PMC. Some extra metadata is saved, like journal title, publication date, etc, however the list of authors is currently missing. Content from higher quality resources is prioritised and good enough publication parts are not re-fetched. There is support for JavaScript while scraping, and content can be extracted from PDF files. Downloaded publications can be persisted to disk to a key-value store for later analysis, or exported to JSON.

Fetch web pages

In addition to publications, PubFetcher can scrape general web pages. This functionality is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules (around 25) to extract from these pages and it has fields to store the software license and programming language. If no rules are defined for a given web page, then an automatic extraction of the main content of the page is attempted.

Use as a library

In addition to being a command-line tool, PubFetcher can be incorporated as a library into other projects. The library is used in EDAMmap and Pub2Tools.

Install

Installation instructions can be found in INSTALL.md.

Documentation

Documentation for PubFetcher can be found at https://pubfetcher.readthedocs.io/en/latest/.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
cli		cli
core		core
dist		dist
docs		docs
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
COPYING		COPYING
INSTALL.md		INSTALL.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubFetcher

Fetch publications

Fetch web pages

Use as a library

Install

Documentation

About

Releases 3

Packages

Contributors 3

Languages

License

edamontology/pubfetcher

Folders and files

Latest commit

History

Repository files navigation

PubFetcher

Fetch publications

Fetch web pages

Use as a library

Install

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Languages

Packages