Sequence Tagging for Biomedical Extractive Question Answering

Source codes and resources for Sequence Tagging for Biomedical Extractive Question Answering.

Please cite:

@article{yoon-etal-2022-sequence,
   author = {Yoon, Wonjin and Jackson, Richard and Lagerberg, Aron and Kang, Jaewoo},
   title = "{Sequence tagging for biomedical extractive question answering}",
   journal = {Bioinformatics},
   year = {2022},
   month = {06},
   issn = {1367-4803},
   doi = {10.1093/bioinformatics/btac397},
}

Wonjin Yoon, Richard Jackson, Aron Lagerberg, Jaewoo Kang, Sequence tagging for biomedical extractive question answering, Bioinformatics, 2022;, btac397, https://doi.org/10.1093/bioinformatics/btac397

Naturally posed biomedical questions

In the paper, we investigated the characteristics of naturally posed biomedical questions as a preliminary study. As a wish to explore the nature of biomedical question answering with the community, we made the collection of biomedical questions available for download in this repository.

See the online preview of the questions here

The questions are from Clinical Questions Collection (CQC) (D’Alessandro et al., 2004; Ely et al., 1999, 1997) and PubMed queries (Herskovic et al., 2007), with our query screening algorithm (Please check Section 3 of our paper and appendix).

From our analysis in the sampled question subset, (where an equal number of questions are randomly sampled from two sources of questions), about half of the questions are "Others", which means they are not answerable with current settings. The resons can be various including:

One or more conditions are missing: (as suggested in the ConditionalQA (Sun et al., 2021))
The question is too abstractive or requires a long answer that cannot be extracted from a paragraph: (ex. "what is the history of calcium?" or "what can i do?")
Incompleteness in our query screening algorithm: (ex. "who diagnostic criteria" or "Will and Joondeph")

I am interested in exploring the characteristics of biomedical questions or building QA datasets of naturally posed biomedical questions. (As a future work)

The work on the characteristics of naturally posed biomedical questions in this paper is a preliminary study and I think there is huge room for improvement. Any researchers who want to collaborate with me on this topic are welcome! To discuss, please contact me: wonjin.info (_at_) gmail.com. If you are planning to attend the 10th BioASQ (2022), we can discuss it there!

How to train/evaluate

Note: We recently uploaded a pre-release version of pre-processing code under `preprocessing` folder.

For datasets (including biomedical questions), please download them from: here
To use the BioASQ dataset, you need to register in the BioASQ website which authorizes the use of the dataset.

For example training and evaluation bash script: please see train_list_example.sh

Detailed instructions

Requirements and Installation

We tested our codes using python3.8 on a GPU-enabled Linux (Ubuntu 18.04) machine.
Please install pytorch(>=1.7.1, <2.0) from https://pytorch.org/get-started/previous-versions/ before following this section.

pip install --use-feature=2020-resolver -r requirements.txt

Note that pip updated dependency resolver on Oct 2020: Announcement. Please include --use-feature=2020-resolver to make the installation work.

Training

The following instructions are example scripts for training LM on the BioASQ8b dataset.

In the first part, we set environmental variables. (We assume that you downloaded and decompressed the datasets in $HOME/DATA )

export SEED=0
export VERSION=8

# To train LMs on List-question dataset
export LANGMODEL_DIR=<Path to BioLM: it should be trained on SQuAD>
export DATA_DIR=$HOME/DATA/BioASQ-SeqTagQA/${VERSION}b-list-20201030
export MAX_EPOCH=400
export LEARN_RATE=5e-6
export OUTPUT_DIR=bioasq8b-list
mkdir $OUTPUT_DIR

The following lines will run the training of the model.:

python run_eqa.py --model_struct=linear \
 --do_train=true --do_eval=true --do_predict=false \
 --model_name_or_path=$LANGMODEL_DIR \
 --per_device_train_batch_size=18 --max_seq_length=512 \
 --train_file=${DATA_DIR}/train_dev.tsv --validation_file=${DATA_DIR}/test.tsv --test_file=${DATA_DIR}/test.tsv \
 --output_dir=$OUTPUT_DIR --save_total_limit=100 \
 --learning_rate=${LEARN_RATE} --seed=${SEED} \
 --num_train_epochs=${MAX_EPOCH} \
 --save_steps=4000

Testing

Then select the latest checkpoint and predict the test dataset using the checkpoint.

export CKPT=84000
export SUB_OUTPUT_DIR=${OUTPUT_DIR}/checkpoint-${CKPT}

python run_eqa.py --model_struct=linear \
  --do_train=false --do_eval=false --do_predict=true \
  --model_name_or_path=$SUB_OUTPUT_DIR \
  --per_device_train_batch_size=18 --per_device_eval_batch_size=32 --max_seq_length=512 \
  --test_file=${DATA_DIR}/test.tsv \
  --output_dir=$SUB_OUTPUT_DIR --save_total_limit=50 \
  --learning_rate=${LEARN_RATE} --seed=${SEED} \
  --save_steps=4000

Please note that the parameter model_name_or_path is changed to the selected checkpoint.

Then detokenize the output and transform it to BioASQ format.

# To Detokenize the output
python detokenize-list.py \
  --test_path=${DATA_DIR}/test.tsv \
  --predictions_path=${SUB_OUTPUT_DIR}/predictions.txt \
  --original_test_path=${DATA_DIR}/original_test.json \
  --output_dir=${SUB_OUTPUT_DIR}

Finally, evaluate the output with the official BioASQ evaluation library. Please download it from here. The library requires java.

# Example script of using official BioASQ evaluation library
export EVAL_LIB=<Cloned official eval code path>/Evaluation-Measures
java -Xmx10G -cp $CLASSPATH:$EVAL_LIB/flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 \
  $HOME/DATA/BioASQ-golden/${VERSION}B_total_golden.json ${SUB_OUTPUT_DIR}/NER_result_BioASQ.json >> $OUTPUT_DIR/total_BioASQ_eval.log

For inquiries, please contact wonjin.info (_at_) gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sequence Tagging for Biomedical Extractive Question Answering

Naturally posed biomedical questions

I am interested in exploring the characteristics of biomedical questions or building QA datasets of naturally posed biomedical questions. (As a future work)

How to train/evaluate

Note: We recently uploaded a pre-release version of pre-processing code under `preprocessing` folder.

Detailed instructions

Requirements and Installation

Training

Testing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sequence Tagging for Biomedical Extractive Question Answering

Naturally posed biomedical questions

I am interested in exploring the characteristics of biomedical questions or building QA datasets of naturally posed biomedical questions. (As a future work)

How to train/evaluate

Note: We recently uploaded a pre-release version of pre-processing code under preprocessing folder.

Detailed instructions

Requirements and Installation

Training

Testing

Note: We recently uploaded a pre-release version of pre-processing code under `preprocessing` folder.