This repository contains the code and data for the paper:
Title: | The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues |
---|---|
Authors: | Anaïs Tack & Chris Piech |
Abstract: | How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: ∆ ability = −0.75; GPT-3: ∆ ability = −0.93). |
The code in this repository depends on the ParlAI framework, the OpenAI API, the Hugging Face transformers library, and the Stan library.
pip install -r src/requirements.txt
The data in this repository depends on student-teacher utterances coming from two datasets.
Because of copyright reasons, these texts were removed from the repository and replaced by the tag {COPYRIGHTED-TEXT}
.
In order to repopulate the data, you must:
Download the Teacher-Student Chatroom Corpus. Put the
*.tsv
files into data/0_datasets/tscc/.Download the Educational Uptake Dataset. Put
uptake_data.csv
into data/0_datasets/uptake/.Run the following commands to repopulate the data with missing utterances and prompts.
python -m src.utils.repopulate -t TSCC -d data/0_datasets/tscc python -m src.utils.repopulate -t EduUptake -d data/0_datasets/uptake
Note
Please cite both datasets when using the data in your research. See data/0_datasets/tscc/ and data/0_datasets/uptake/.
Download the pre-trained models into
downloads/models/
.python -m src.parlai.scripts.download_models downloads/ blender/blender_90M blender/blender_400Mdistill blender/blender_3B blender/blender_9B
Run a Blender model on the data. For example:
python -m src.parlai.scripts.run -t TSCC -d data/0_datasets/tscc/ -M downloads/models -m blender/blender_9B -O results/ python -m src.parlai.scripts.run -t EduUptake -d data/0_datasets/uptake/ -M downloads/models -m blender/blender_9B -O results/
Run a GPT-3 model on the data. For example:
python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t TSCC -d data/0_datasets/tscc/ -O results/ python -m src.parlai.scripts.run -m src.parlai.models.gpt3:GPT3Davinci -o src/parlai/opts/gpt3.json -t EduUptake -d data/0_datasets/uptake/ -O results/
Detect outliers among human raters.
python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --per-rater
Estimate pedagogical abilities after outlier removal.
python -m src.stan.bradley_terry data/2_comparisons/items.jsonl --outliers data/2_comparisons/outliers.yaml
More information can be found in this paper. When using the data or code in your research or publication, please cite this paper as well.
@inproceedings{tack_ai_2022,
title = {The {{AI Teacher Test}}: {{Measuring}} the {{Pedagogical Ability}} of {{Blender}} and {{GPT-3}} in {{Educational Dialogues}}},
booktitle = {The 15th {{International Conference}} on {{Educational Data Mining}}},
author = {Tack, Ana{\"i}s and Piech, Chris},
year = {2022},
pages = {accepted},
copyright = {All rights reserved}
}
This research was funded by a fellowship of the BAEF (Belgian American Educational Foundation) and by a grant from Stanford HAI.
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added
- Publication of data and code for the EDM 2022 conference