This repository contains code for using GROOViST: A Metric for Grounding Objects in Visual Storytelling—In proceedings of EMNLP 2023.
Evaluating the degree to which textual stories are grounded in the corresponding image sequences is essential for the Visual Storytelling task. We propose GROOViST, based on insights obtained from existing open-source metrics (CLIPScore, RoViST-VG). Our analyses shows that GROOViST effectively measures the extent to which a story is grounded in an image sequence.
Currently, GROOViST can be used off-the-shelf for evaluating <image-sequence, story>
pairs of three Visual Storytelling datasets — VIST, AESOP, VWP. For a new/custom dataset, all the following steps can be adapted accordingly.
Install python (e.g., 3.11
) and other dependencies provided in requirements.txt. E.g., using:
pip install -r requirements.txt
For the sequence(s) of interest, GROOViST requires B
image regions per image in the sequence(s) (e.g., B=10
). Please refer to this doc for preparing them.
For the sequence(s) of interest, GROOViST works with the noun phrases in the stories. Use the following command for extracting noun phrases from stories:
python extract_nphrases.py --input_file data/sample_stories.json --output_file data/sample_nphrases.json
python groovist.py --dataset VIST --input_file data/sample_nphrases.json --output_file data/sample_scores.json
🔗 If you find this work useful, please consider citing it:
@inproceedings{surikuchi-etal-2023-groovist,
title = "{GROOV}i{ST}: A Metric for Grounding Objects in Visual Storytelling",
author = "Surikuchi, Aditya and Pezzelle, Sandro and Fern{\'a}ndez, Raquel",
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.202",
pages = "3331--3339"
}