Skip to content

Latest commit

 

History

History
114 lines (87 loc) · 6.31 KB

README.md

File metadata and controls

114 lines (87 loc) · 6.31 KB

KGAP_LINCS-IDG

Knowledge graph analytics platform (KGAP) for LINCS and IDG Common Fund dataset integration. The initial biomedical application is Parkinson's disease drug target discovery, in alignment with the IDSL PRIDE initiative (Parkinson's Research through Integrative Data Experiments).

See also:

Publication LINK: "Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination", by Jeremy Yang, Christopher Gessner, Joel Duerksen, Daniel Biber, Jessica Binder, Brian Foote, Robin McEntire, Kyle Stirling, Ying Ding and David Wild, BMC Bioinformatics, 2022.

Graph database instructions

Dependencies for fast path

  • Neo4j Desktop Version 4.2 or newer (select version 4.2 when creating database, and plugin versions are automatically determined by Neo4j version)
  • 8GB+ ram
  • 5GB+ free disk space
  • datasets listed below

Dependencies in addition, for long path

  • Knime Version 4.3 (Knime will notify for automatic download any nodes required but not yet installed)
  • ability to run, load from dump and host temporarily Postgres 10.11 database
  • Docker version 20.10.1 or newer, if running the example script

Fast path

Note: These steps should work with little or no modifications on Macintosh, Unix, and Windows. Process was tested on a MacBookPro 2017 running MacOS Catalina

To restore this graph database from a dump file download this dump file, install the Neo4j Desktop Client Neo4j and launch it. Create a new 4.2 database (do not start the database). Click on the ... and then click Manage in the pop up menu, and then click on Open Terminal. Restore the database from the dump file with the command, changing PATHTOFILE as needed.

bin/neo4j-admin load --from=PATHTOFILE/dclneodb.dump --database=neo4j

Next exit the terminal and click Start to start the database. Click Open to use built in Neo4j browser to query and explore the database.

The long path (and more technical), to recreate the graph database from source datasets

Step 1: create node and relationship files

Note: Developed and used on Ubuntu 20.04, and then these directions were tested on a MacbookPro 2017 running MacOS Catalina.

  • Clone this repository in your home directory, the following process assumes ~/kgap_lincs-idg/ exists and is populated
  • Download, restore and bring online drugcentral_lincs
  • Download and install Knime Note: The workflow was created with version 4.3
  • Import the knime workflow kgap_lincs-idg/opt1_step1_create_neo4j_files, you will see and safely ignore this dialog message.
  • This workflow extracts and transforms data from three datasets, the file tcrd_targets.tsv in this repository, the online DrugCentral database, and the drugcentral_lincs database (loaded above).
  • In the knime workflow find the drugcentral_lincs PostgresSQL Connector and change the address/port to point to the drugcentral_lincs database where you are hosting it.
  • In Knime "Excecute all executable nodes" for this workflow, the relationship and node files (and the header files) will be stored in opt1_step2_docker_neo4j_load/e and opt1_step2_docker_neo4j_load/v respectively. The script opt1_step2_docker_neo4j_load/clean_ev is a utility script to delete the relationship and node files.

For a full view of the workflow see KGAP-ETL_KNIME_workflow.png.

Step 2: load relationships and nodes into Neo4j

Analytics

Searching for disease associated genes using KGAP.

Dependencies

  • Python 3.7+
  • Python packages neo4j, sklearn, psycopg2, matplotlib, pandas
  • Local instance of KGAP_LINCS-IDG graph database as described above.

The command-line program kgap_analysis.py may be used to query the database with a disease term, generate associated, scored genes, and perform a ROC validation analysis vs known genes. Credentials for the public instance of DrugCentral are included as defaults, while the Neo4j credentials must be valid for your local instance.

kgap_analysis.py -h
usage: kgap_analysis.py [-h] --indication_query INDICATION_QUERY
                        [--atc_query ATC_QUERY] [--odir ODIR]
                        [--dc_dbhost DC_DBHOST] [--dc_dbport DC_DBPORT]
                        [--dc_dbname DC_DBNAME] [--dc_dbusr DC_DBUSR]
                        [--dc_dbpw DC_DBPW] [--neo4j_uri NEO4J_URI]
                        [--neo4j_paramfile NEO4J_PARAMFILE] [-v]

KGAP LINCS-IDG search and ROC analysis

optional arguments:
  -h, --help            show this help message and exit
  --indication_query INDICATION_QUERY
                        DrugCentral indication query
  --atc_query ATC_QUERY
                        DrugCentral ATC L1 query
  --odir ODIR           output dir
  --dc_dbhost DC_DBHOST
                        DrugCentral DBHOST
  --dc_dbport DC_DBPORT
                        DrugCentral DBPORT
  --dc_dbname DC_DBNAME
                        DrugCentral DBNAME
  --dc_dbusr DC_DBUSR   DrugCentral DBUSR
  --dc_dbpw DC_DBPW     DrugCentral DBPW
  --neo4j_uri NEO4J_URI
                        Neo4j DB URI
  --neo4j_paramfile NEO4J_PARAMFILE
                        Neo4j parameter file
  -v, --verbose

The specific command used to generate the PD results in our paper is as follows.

kgap_analysis.py --indication_query "Parkinson" --atc_query "NERVOUS SYSTEM"