This is the code for paper "Evaluating Transfer Learning for Simplifying GitHub READMEs" in the proceedings of 31th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023).
Please kindly find our preprint paper on Arxiv.
In this paper, we harvested a new software-related simplification dataset. A transformer
model is trained both Wikipedia to Simple Wikipedia dataset as well as our newly proposed dataset.
We experimented with transfer learning, which generates better results.
In order to run the code, please install packages include:
- PyGithub
- nltk
- pytorch
- numpy
- scipy
- BeautifulSoup
- pytorch-transformers
- pytorch-beam-search
This folder contains code for harvesting data from GitHub.
aligner/
contains all steps for preprocessing collected data and perform alignment task.
The BERT checkpoint for doing the alignment task was from Jiang et al. (Neural CRF Model for Sentence Alignment in Text Simplification). We only make some modifications on it to fit for our case. The BERT checkpoint can be accessed throught this link.
E.g. use the following command to align sentences.
python main.py --ipath=../data/db_eliminated_duplicate.txt --bert=../BERT_wiki --opath="../data/output.txt"
simplification/
contains training, evaluation and generation steps.
To train the model, use command
python3 train.py --config=training_config.json --model=model_config.json --save_path=to_path --data_source=wiki
You can specify the model configuration in the model_config.json
file. Hyperparameters are adjustable in
train_config.json
. Two data sources are available for training, namely wiki and software simplification corpus.
To generate simplified sentences using a model checkpoint, use command
python3 generate.py --model=model_checkpoint --path=src_sentence_file --beam=5 --to_path=write_path
After generating the simplified sentence, you could use BLEU score to evaluate the model performance, use
command
python3 evaluate.py --candidate=generated_sentences_file --reference=reference_sentences_file
To visit the model checkpoints for our survey, go to this link.
BLEU score is not ideal for evaluating the simplification system. In this research, we performed a survey. Please find the survey annotation and scores in the corresponding folder.
If you want to use our findings, please cite out paper.
@inproceedings{gao2023evaluating,
title={Evaluating Transfer Learning for Simplifying GitHub READMEs},
author={Gao, Haoyu and Treude, Christoph and Zahedi, Mansooreh},
booktitle={Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
pages={1548--1560},
year={2023}
}