curami-v2

A semi-automated curation tool to identify and harmonise erroneous attributes in the BioSamples database.

Requirements

Python 3.6 (need exact version, otherwise dependencies go crazy)
Neo4j 4

Installation

python3 -m venv env
pip3 install -r requirements.txt

Quickstart: How to run

create .env file in root directory and copy content of .env.docker into .env file
Run collection/biosamples_crawler.py to download all the samples to your local disk
Run preprocess/transform.py to generate downstream files. This generates unique_attributes, unique_values, etc.. files
Run preprocess/clean.py to normalise attributes and recalculate coexistence values.
Run analysis/summary.py to visualise attribute counts
Run analysis/pair_matching_edit_distance.py generate syntactically similar pairs based on edit distance
Run analysis/pair_matching_dictionary.py filter out pairs that have one dictionary matched attribute
Run analysis/pair_matching_word_base.py generate syntactically similar pairs based on word base format
Run analysis/pair_matching_values.py generate similar pairs based on values
Run analysis/pair_matching_synonym.py generate semantically similar pairs based on synonyms

Generating attribute coexistence probabilities

Run analysis/attribute_coexistence.py to calculate coexistence probabilities for attributes
Run integration/graph_builder.py build_gephi_coexistence_graph() to generate Gephi input file.
Use generated file to analyse attribute connections in Gephi

Sample Clustering

Run preprocess/generate_features.py to generate feature file for clustering
Run analysis/cluster.py to visualise clusters

Running web application in docker

Run run_docker_compose.sh. This will spin-up Neo4j and Flask webapp.
Load data into Neo4j by running analysis/graph_build.py build_neo4j_curation_graph()
Create a user in Neo4j by running web/auth_service.py
Go to localhost:5000 in browser and log in using created user and password

Collection

Similar to previous version, we start by collecting data. When we collect data from wwwdev environment its quite slow. If we try to increase the number of threads, server simply gives up responding. Therefore having a way to continue from last stopped position is handy. Currently we need to provide last collected file as an argument to achieve this.

First, retrieve all data through JSON API and save them in local file system.
We will save the whole sample without changing its data.

Collect data in small files and combine them to bigger file for easy handling
Generate summary statistics of data, to understand data

Preprocess

First we will try to preprocess data(attributes) as much as possible to combine them. Sample attributes contains different representations.

Camel case attributes (collection date, collection_date, Collection_Date, COLLECTION_DATE, Collection date)
Snake case attributes
Attribute words separated by space
Different uses of non-word characters (Age,'Age, cell-treatment, bmi (kg/m2), lot, lot#, cd8ccr7 cd45, %cd8ccr7+ cd45+, %cd8ccr7+ cd45-, %cd8ccr7-cd45 -, %cd8ccr7-cd45+)
Different uses of metrics in attributes (weight kg, weight_kg, height m, height_m, Survival (days), Overall Survival (months), overall survival (months), Time (min), Age (years))

We will convert attributes into two representations and compare them to filter best attributes

Simple case conversion

Remove leading non word characters (only - and ')
Dash followed by underscore
Remove double quotes
Replace backward slash with forward slash and remove spaces around forward slash
Replace underscore with space
Strip leading, trailing and extra spaces
Remove spaces inside parenthesis/brackets

Camel case to space separated words

Convert camel case to snake case
Then use all other steps in previous process

Then compare two processes for further matches

PIPELINE (automatic attribute curation) Transform using two methods compare two transformed entities, if they are the same good, if they are not check existing attribute for both of them, if there are existing attribute select that one, if there are no existing attributes sent through dictionary test, if dictionary test passed use that, if it failed use original attribute (should we try correct the spelling)

Analysis

After pre-processing step we have attributes that adhere to similar format. In analysis step, first, we need to find syntactically similar pairs, to identify spelling mistakes, pluralisation etc...

Use edit distance function to find syntactically similar pairs
Use dictionary to find spelling mistakes of syntactically similar pairs

Future

Add different scoring mechanisms, combine score to filter out best suggestions Analyse important attributes - principal component analysis Cluster samples (First into large clusters, then inside finer grade clusters) User interface

attributes with metric in parenthesis [age vs age (days)]
abbreviated values

In characteristics (attributes) section we have key value pairs. Once we ignored the IRI,
We have attribute and its related values. Our main goal is to minimise number of attributes and values.

How can we minimise number of attributes

Have a style guide: (eg. instead of using "-" to separate words use " ")
Misspelled words: rename back to correct spelling, no abbreviations
Find duplicated attributes: this might require a look into corresponding values
Ontology mapping: use zooma to get ontology, this might be a long shot

What can we do

Find coocurrence of attributes
Find synonyms, improve with ontology
Cluster similar samples
learn from curated samples ? what type of curation
Having an attribute and values dictionary
Identify submitter, feedback about the errors, require changes ??
Can we identify metrics

Clustering

We have 40,000 different attributes. With 40k attributes clustering become almost impossible.

Dimensionality reduction is one popular technique to remove noisy (i.e. irrelevant) and redundant attributes (AKA features). Dimensionality reduction techniques can be categorized mainly into feature extraction and feature selection. In feature extraction approach, features are projected into a new space with lower dimensionality. Examples of feature extraction technique include Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Singular Value Decomposition (SVD), to name a few. On the other hand, the feature selection approach aims to select a small subset of features that minimize redundancy and maximize relevance to the target (i.e. class label). Popular feature selection techniques include: Information Gain, Relief, Chi Squares, Fisher Score, and Lasso, to name a few

source venv/bin/activate pip3 freeze > requirements.txt pip3 install -r requirements.txt python3 setup.py sdist

Contributors

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
curami		curami
resources		resources
.dockerignore		.dockerignore
.env.docker		.env.docker
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
requirements.txt.new		requirements.txt.new
run_docker.sh		run_docker.sh
run_docker_compose.sh		run_docker_compose.sh
run_flask_app.sh		run_flask_app.sh
setup.py		setup.py
setup.sh		setup.sh
story.md		story.md
style_guide.md		style_guide.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

curami-v2

Requirements

Installation

Quickstart: How to run

Generating attribute coexistence probabilities

Sample Clustering

Running web application in docker

Collection

Preprocess

Analysis

Future

How can we minimise number of attributes

What can we do

Clustering

Contributors

License

About

Releases

Packages

Languages

License

EBIBioSamples/curami-v2

Folders and files

Latest commit

History

Repository files navigation

curami-v2

Requirements

Installation

Quickstart: How to run

Generating attribute coexistence probabilities

Sample Clustering

Running web application in docker

Collection

Preprocess

Analysis

Future

How can we minimise number of attributes

What can we do

Clustering

Contributors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages