Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., rabbit grazing on grass). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes.
Watch our video below:
Example results from the GQA dataset.
This is a PyTorch implementation for Exploring Long Tail Visual Relationship Recognition with Large Vocabulary. Paper is accepted for ICCV 2021.
This code is for the GQA-LT and VG8K-LT datasets.
We borrowed the framework from Detectron.pytorch and Large-scale Visual Relationship Understanding for this project, so there are a lot overlaps between these two and ours.
Method | Backbone | many | medium | few | all |
---|---|---|---|---|---|
Baseline [1] | VGG16 | 70.5 | 36.2 | 3.5 | 51.9 |
Baseline [1] + ViLHub | VGG16 | 69.8 | 42.1 | 9.5 | 53.9 |
Focal Loss [2] | VGG16 | 69.6 | 38.0 | 4.7 | 52.1 |
Focal Loss [2] + ViLHub | VGG16 | 69.8 | 41.7 | 8.1 | 53.7 |
WCE [2] | VGG16 | 39.3 | 36.5 | 16.2 | 35.5 |
WCE + ViLHub [2] | VGG16 | 35.2 | 39.5 | 18.8 | 34.2 |
- Python 3
- Python packages
- pytorch 0.4.0 or 0.4.1.post2 (not guaranteed to work on newer versions)
- torchvision 0.1.8
- cython
- matplotlib
- numpy
- scipy
- opencv
- pyyaml
- packaging
- pycocotools
- tensorboardX
- tqdm
- pillow
- scikit-image
- gensim
- An NVIDIA GPU and CUDA 8.0 or higher. Some operations only have gpu implementation.
To make things easier we provided the environment file environment.yml
created by running the command conda env export -f environment.yml
.
To clone the environmentt you can simply run conda env create -f environment.yml
from the project root directory.
Compile the CUDA code in the Detectron submodule and in the repo:
cd $ROOT/lib
sh make.sh
Create a data folder at the top-level directory of the repository:
# ROOT=path/to/cloned/repository
cd $ROOT
mkdir data
Download it here. Unzip it under the data folder. You should see a gvqa
folder unzipped there. It contains seed folder called seed0
that contains .json annotations that suit the dataloader used in this repo.
Download it here. Unzip it under the data folder. You should see a vg8k
folder unzipped there. It contains seed folder called seed3
that contains .json annotations that suit the dataloader used in this repo.
Create a folder named word2vec_model
under data
. Download the Google word2vec vocabulary from here. Unzip it under the word2vec_model
folder and you should see GoogleNews-vectors-negative300.bin
there.
Create a folder for all images:
# ROOT=path/to/cloned/repository
cd $ROOT/data/gvqa
mkdir images
Download GQA images from the here
Create a folder for all images:
# ROOT=path/to/cloned/repository
cd $ROOT/data/vg8k
mkdir VG_100K
Download Visual Genome images from the official page. Unzip all images (part 1 and part 2) into VG_100K/
. There should be a total of 108249 files.
Download pre-trained object detection models here. Unzip it under the root directory and you should see a detection_models
folder there.
Download our trained models here. Unzip it under the root folder and you should see a trained_models
folder there.
The final directories for data and detection models should look like:
|-- data
| |-- vg
| | |-- VG_100K <-- (contains Visual Genome images)
| | |-- seed3 <-- (contains annotations)
| | | |-- rel_annotations_train.json
| | | |-- rel_annotations_val.json
| | | |-- ...
| |-- gvqa
| | |-- images <-- (contains GQA training images)
| | |-- seed0 <-- (contains annotations)
| | |-- annotations_train.json
| | |-- annotations_val.json
| | |-- ...
| |-- word2vec_model
| | |-- GoogleNews-vectors-negative300.bin
|-- trained_models
| |-- e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only
| | |-- gvqa
| | |-- Mar02-02-16-02_gpu214-10_step_with_prd_cls_v3
| | |-- ckpt
| | |-- best.pth
| |-- ...
DO NOT CHANGE anything in the provided config files(configs/xx/xxxx.yaml) even if you want to test with less or more than 8 GPUs. Use the environment variable CUDA_VISIBLE_DEVICES
to control how many and which GPUs to use. Remove the
--multi-gpu-test
for single-gpu inference.
NOTE: May require at least 64GB RAM to evaluate on the GQA test set
We use three evaluation metrics:
- Per-class accuracy (sbj, obj, rel)
- Overall accuracy (sbj, obj, rel)
- Overall triplet accuracy
- Accuracy over frequency bands (many, medium, few, and all) using exact matching
- Accuracy over frequency bands (many, medium, few, and all) using synset matching
- Average word similarity between GT and detection for [word2vec_gn, word2vec_vg, lch, wup, lin, path, res, jcn] similarities
python tools/test_net_rel.py --dataset gvqa --cfg configs/gvqa/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_hubness.yaml --do_val --load_ckpt Outputs/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_hubness/gvqa/Mar11-07-01-07_gpu210-18_step_with_prd_cls_v3/ckpt/best.pth --use_gt_boxes --use_gt_labels --seed 0
NOTE: May require at least 64GB RAM to evaluate on the Visual Genome test set
We use three evaluation metrics:
- Per-class accuracy (sbj, obj, rel)
- Overall accuracy (sbj, obj, rel)
- Overall triplet accuracy
- Accuracy over frequency bands (many, medium, few, and all) using exact matching
python tools/test_net_rel.py --dataset vg8k --cfg configs/vg8k/e2e_relcnn_VGG16_8_epochs_vg8k_y_loss_only_hubness.yaml --do_val --load_ckpt Outputs/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_hubness/vg8k/Mar11-07-01-07_gpu210-18_step_with_prd_cls_v3/ckpt/best.pth --use_gt_boxes --use_gt_labels --seed 0
The section provides the command-line arguments to train our relationship detection models given the pre-trained object detection models described above.
DO NOT CHANGE variable NUM_GPUS
in the provided config files(configs/xx/xxxx.yaml) even if you want to train with less or more than 8 GPUs. Use the environment variable CUDA_VISIBLE_DEVICES
to control how many and which GPUs to use.
With the following command lines, the training results (models and logs) should be in $ROOT/Outputs/xxx/
where xxx
is the .yaml file name used in the command without the ".yaml" extension. If you want to test with your trained models, simply run the test commands described above by setting --load_ckpt
as the path of your trained models.
To train our relationship network using a VGG16 backbone with the ViL-Hubless loss, run
python tools/train_net_step_rel.py --dataset gvqa --cfg configs/gvqa/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_hubness100k.yaml --nw 8 --use_tfboard --seed 0
To train our relationship network using a VGG16 backbone without the ViL-Hubless loss, run
python tools/train_net_step_rel.py --dataset gvqa --cfg configs/gvqa/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_baseline.yaml --nw 8 --use_tfboard --seed 0
To train our relationship network using a VGG16 backbone with the RelMix augmentation, run
python tools/train_net_step_rel.py --dataset gvqa --cfg configs/gvqa/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_baseline_relmix.yaml --nw 8 --use_tfboard --seed 0
To run models with different ViL-Hubless scales create a new config file under configs/gvqa/
(by copying the file configs/gvqa/e2e_relcnn_VGG16_8_epochs_gvqa_y_loss_only_hubness.yaml
) and change the variable TRAIN.HUBNESS_SCALE
to the desired value.
Also confirm the ViL-Hubless loss is activated by making sure the variable TRAIN.HUBNESS
is set to True
To train our relationship network using a VGG16 backbone with the ViL-Hubless loss, run
python tools/train_net_step_rel.py --dataset vg8k --cfg configs/vg8k/e2e_relcnn_VGG16_8_epochs_vg8k_y_loss_only_hubness100k.yaml --nw 8 --use_tfboard --seed 3
To train our relationship network using a VGG16 backbone without the ViL-Hubless loss, run
python tools/train_net_step_rel.py --dataset vg8k --cfg configs/vg8k/e2e_relcnn_VGG16_8_epochs_vg8k_y_loss_only_baseline.yaml --nw 8 --use_tfboard --seed 3
To train our relationship network using a VGG16 backbone with the RelMix augmentation, run
python tools/train_net_step_rel.py --dataset vg8k --cfg configs/vg8k/e2e_relcnn_VGG16_8_epochs_vg8k_y_loss_only_baseline_relmix.yaml --nw 8 --use_tfboard --seed 3
To run models with different ViL-Hubless scales create a new config file under configs/vg8k/
(by copying the file configs/vg8k/e2e_relcnn_VGG16_8_epochs_vg8k_y_loss_only_hubness.yaml
) and change the variable TRAIN.HUBNESS_SCALE
to the desired value.
Also confirm the ViL-Hubless loss is activated by making sure the variable TRAIN.HUBNESS
is set to True
This repository uses code based on the Large-scale Visual Relationship Understanding source code by Zhang Ji, as well as code from the Detectron.pytorch repository by Roy Tseng.
If you use this code in your research, please use the following BibTeX entry.
@misc{abdelkarim2020longtail,
title={Exploring Long Tail Visual Relationship Recognition with Large Vocabulary},
author={Sherif Abdelkarim and Aniket Agarwal and Panos Achlioptas and Jun Chen and Jiaji Huang and Boyang Li and Kenneth Church and Mohamed Elhoseiny},
year={2020},
eprint={2004.00436},
archivePrefix={arXiv},
primaryClass={cs.CV}
}