An Natural Language Processing (NLP) exploration in Bioinformatic literatures in 2016 autumn semester. To try to answer the question "which gene associates with which disease through which bioevent", Turku Event Extraction System (TEES) was used to extract triples of Gene Name, Disease Name and Bioevent. Then we could retrieve the derived database by any of the genes, diseases or bioevents, and display the search result.
- Gene_expression 表达
- Regulation (Positive_regulation or Negative_regulation) 调控 (正或负)
- Binding 绑定
- Localization 定位
- Transcription 转录
- Phosphorylation 磷酸化
Step 1. S1_ParseTextToSQL.py
Extract PubMed_ID, Title, and Abstract into a SQLite database from raw CSV file.
As for how to download abstracts from PubMed, an previous exercise would be a good illustration.
Step 2. S2_PrepareForNER.py
Read abstracts from SQLite database, then transform them to txt files for Named Entity Recognition (NER) by ABNER.
Step 3. S3_UsingABNER
Using ABNER to recognize named entities.
Step 4. S4_ParseNERResult.py
The output of NER by ABNER was SGML files with a total of 5 tags:
- PROTEIN
- DNA
- RNA
- CELL_TYPE
- CELL_LINE
What we need were abstracts which containing DNA/Gene information, and abstracts without "DNA" tag were filtered out.
Step 5. S5_ParseHumanGene.py
Build a SQLite database to associate GeneID with official name of Ensembl gene for gene names nomlization.
Step 6. S6_NormGeneName.py
Filter out abstractes without official gene names.
Step 7. S7_PrepareForEventFinding.py
Prepare files for bioevents parse using TEES.
Step 8. S8_UsingTEES
Using TEES to parse bioevents.
Step 9. S9_ParseEventResult.py
Parse the output of TEES and save related information into a SQLite database.
P.S.
- The names of genes in SQLite database were official gene names.
- Positive_regulation and Negative_regulation were merged as Regulation
Step 10. S10_RetrieveData.py
Retrieve data from built SQLite database by any of the genes, diseases or bioevents.
Gene name: TP53