Skip to content

[CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloading the trained model checkpoints, and example notebooks / gradio demo that show how to use the model.

License

Notifications You must be signed in to change notification settings

xk-huang/segment-caption-anything

Repository files navigation

Segment and Caption Anything

The repository contains the official implementation of "Segment and Caption Anything"

Project Page, Paper

teaser

tl;dr

  1. Despite the absence of semantic labels in the training data, SAM implies high-level semantics sufficient for captioning.
  2. SCA (b) is a lightweight augmentation of SAM (a) with the ability to generate regional captions.
  3. On top of SAM architecture, we add a fixed pre-trained language mode, and a optimizable lightweight hybrid feature mixture whose training is cheap and scalable.
anything-mode-00 anything-mode-01
anything-mode-02 anything-mode-03

News

  • [01/31/2024] Update the paper and the supp. Release code v0.0.2: bump transformers to 4.36.2, support mistral series, phi-2, zephyr; add experiments about SAM+Image Captioner+V-CoT, and more.
  • [12/05/2023] Release paper, code v0.0.1, and project page!

Environment Preparation

Please check docs/ENV.md.

Model Zoo

Please check docs/MODEL_ZOO.md

Gradio Demo

Please check docs/DEMO.md

Running Training and Inference

Please check docs/USAGE.md.

Experiments and Evaluation

Please check docs/EVAL.md

License

The trained weights are licensed under the Apache 2.0 license.

Acknowledgement

Deeply appreciate these wonderful open source projects: transformers, accelerate, deepspeed, detectron2, hydra, timm, gradio.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@inproceedings{huang2024segment,
  title={Segment and caption anything},
  author={Huang, Xiaoke and Wang, Jianfeng and Tang, Yansong and Zhang, Zheng and Hu, Han and Lu, Jiwen and Wang, Lijuan and Liu, Zicheng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13405--13417},
  year={2024}
}

About

[CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloading the trained model checkpoints, and example notebooks / gradio demo that show how to use the model.

Resources

License

Stars

Watchers

Forks

Packages

No packages published