Skip to content
This repository has been archived by the owner on May 21, 2019. It is now read-only.

Latest commit

 

History

History
64 lines (47 loc) · 2.99 KB

File metadata and controls

64 lines (47 loc) · 2.99 KB

NLP-Exploration-in-Bioinformatics

An Natural Language Processing (NLP) exploration in Bioinformatic literatures in 2016 autumn semester. To try to answer the question "which gene associates with which disease through which bioevent", Turku Event Extraction System (TEES) was used to extract triples of Gene Name, Disease Name and Bioevent. Then we could retrieve the derived database by any of the genes, diseases or bioevents, and display the search result.

Bioevents

  • Gene_expression 表达
  • Regulation (Positive_regulation or Negative_regulation) 调控 (正或负)
  • Binding 绑定
  • Localization 定位
  • Transcription 转录
  • Phosphorylation 磷酸化

Procedure

Extract PubMed_ID, Title, and Abstract into a SQLite database from raw CSV file.

As for how to download abstracts from PubMed, an previous exercise would be a good illustration.

Read abstracts from SQLite database, then transform them to txt files for Named Entity Recognition (NER) by ABNER.

Using ABNER to recognize named entities.

The output of NER by ABNER was SGML files with a total of 5 tags:

  • PROTEIN
  • DNA
  • RNA
  • CELL_TYPE
  • CELL_LINE

What we need were abstracts which containing DNA/Gene information, and abstracts without "DNA" tag were filtered out.

Build a SQLite database to associate GeneID with official name of Ensembl gene for gene names nomlization.

Filter out abstractes without official gene names.

Prepare files for bioevents parse using TEES.

Step 8. S8_UsingTEES

Using TEES to parse bioevents.

Parse the output of TEES and save related information into a SQLite database.

P.S.

  • The names of genes in SQLite database were official gene names.
  • Positive_regulation and Negative_regulation were merged as Regulation

Retrieve data from built SQLite database by any of the genes, diseases or bioevents.

Results

Flowchart of NLP Procedures

Flowchart

An Search Example

Gene name: TP53

Bioevents: Gene_expression & Regulation Search Results