NLP-Exploration-in-Bioinformatics

An Natural Language Processing (NLP) exploration in Bioinformatic literatures in 2016 autumn semester. To try to answer the question "which gene associates with which disease through which bioevent", Turku Event Extraction System (TEES) was used to extract triples of Gene Name, Disease Name and Bioevent. Then we could retrieve the derived database by any of the genes, diseases or bioevents, and display the search result.

Bioevents

Gene_expression 表达
Regulation (Positive_regulation or Negative_regulation) 调控 (正或负)
Binding 绑定
Localization 定位
Transcription 转录
Phosphorylation 磷酸化

Procedure

Step 1. S1_ParseTextToSQL.py

Extract PubMed_ID, Title, and Abstract into a SQLite database from raw CSV file.

As for how to download abstracts from PubMed, an previous exercise would be a good illustration.

Step 2. S2_PrepareForNER.py

Read abstracts from SQLite database, then transform them to txt files for Named Entity Recognition (NER) by ABNER.

Step 3. S3_UsingABNER

Using ABNER to recognize named entities.

Step 4. S4_ParseNERResult.py

The output of NER by ABNER was SGML files with a total of 5 tags:

PROTEIN
DNA
RNA
CELL_TYPE
CELL_LINE

What we need were abstracts which containing DNA/Gene information, and abstracts without "DNA" tag were filtered out.

Step 5. S5_ParseHumanGene.py

Build a SQLite database to associate GeneID with official name of Ensembl gene for gene names nomlization.

Step 6. S6_NormGeneName.py

Filter out abstractes without official gene names.

Step 7. S7_PrepareForEventFinding.py

Prepare files for bioevents parse using TEES.

Step 8. S8_UsingTEES

Using TEES to parse bioevents.

Step 9. S9_ParseEventResult.py

Parse the output of TEES and save related information into a SQLite database.

P.S.

The names of genes in SQLite database were official gene names.
Positive_regulation and Negative_regulation were merged as Regulation

Step 10. S10_RetrieveData.py

Retrieve data from built SQLite database by any of the genes, diseases or bioevents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Exploration-in-Bioinformatics

Bioevents

Procedure

Step 1. S1_ParseTextToSQL.py

Step 2. S2_PrepareForNER.py

Step 3. S3_UsingABNER

Step 4. S4_ParseNERResult.py

Step 5. S5_ParseHumanGene.py

Step 6. S6_NormGeneName.py

Step 7. S7_PrepareForEventFinding.py

Step 8. S8_UsingTEES

Step 9. S9_ParseEventResult.py

Step 10. S10_RetrieveData.py

Results

Flowchart of NLP Procedures

An Search Example

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
2016-12-21-小组会--Turku Event Extraction System使用流程.pptx		2016-12-21-小组会--Turku Event Extraction System使用流程.pptx
2016秋-基于自然语言处理方法从生物医学文献中提取知识.pptx		2016秋-基于自然语言处理方法从生物医学文献中提取知识.pptx
LICENSE		LICENSE
README.md		README.md
S10_RetrieveData.py		S10_RetrieveData.py
S1_ParseTextToSQL.py		S1_ParseTextToSQL.py
S2_PrepareForNER.py		S2_PrepareForNER.py
S3_UsingABNER		S3_UsingABNER
S4_ParseNERResult.py		S4_ParseNERResult.py
S5_ParseHumanGene.py		S5_ParseHumanGene.py
S6_NormGeneName.py		S6_NormGeneName.py
S7_PrepareForEventFinding.py		S7_PrepareForEventFinding.py
S8_UsingTEES		S8_UsingTEES
S9_ParseEventResult.py		S9_ParseEventResult.py
用于生物信息的自然语言处理Python环境配置.docx		用于生物信息的自然语言处理Python环境配置.docx

License

az7jh2/NLP-Exploration-in-Bioinformatics

Folders and files

Latest commit

History

Repository files navigation

NLP-Exploration-in-Bioinformatics

Bioevents

Procedure

Step 1. S1_ParseTextToSQL.py

Step 2. S2_PrepareForNER.py

Step 3. S3_UsingABNER

Step 4. S4_ParseNERResult.py

Step 5. S5_ParseHumanGene.py

Step 6. S6_NormGeneName.py

Step 7. S7_PrepareForEventFinding.py

Step 8. S8_UsingTEES

Step 9. S9_ParseEventResult.py

Step 10. S10_RetrieveData.py

Results

Flowchart of NLP Procedures

An Search Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages