Skip to content

[NeurIPS2022] Official implementation of the paper 'Green Hierarchical Vision Transformer for Masked Image Modeling'.

License

Notifications You must be signed in to change notification settings

LayneH/GreenMIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GreenMIM

This is the official PyTorch implementation of the NeurIPS 2022 paper Green Hierarchical Vision Transformer for Masked Image Modeling. GreenMIM consists of two key desgins, Group Window Attention and Sparse Convolution. It offers 2.7x faster pre-training and competitive performance on hierarchical vision transformers, e.g., Swin/Twins Transformers.

Group Attention Scheme.

Method Overview.

Citation

If you find our work interesting or use our code/models, please cite:

@article{huang2022green,
  title={Green Hierarchical Vision Transformer for Masked Image Modeling},
  author={Huang, Lang and You, Shan and Zheng, Mingkai and Wang, Fei and Qian, Chen and Yamasaki, Toshihiko},
  journal={Thirty-Sixth Conference on Neural Information Processing Systems},
  year={2022}
}

News

  • 2023.01: We have refactor the structure of this codebase, supporting most, if not any, vision transformer backbones with various input resolutions. Checkout our implementation of GreenMIM with Twins Transformer here.

Catalogs

  • Pre-trained checkpoints
  • Pre-training code for Swin Transformer and Twins Transformer
  • Fine-tuning code

Pre-trained Models

Swin-Base (Window 7x7) Swin-Base (Window 14x14) Swin-Large (Window 14x14)
pre-trained checkpoint Download Download Download

Pre-training

The pre-training scripts are given in the scripts/ folder. The scripts with names start with 'run*' are for non-slurm users while the others are for slurm users.

For Non-Slurm Users

To train a Swin-B with on a single node with 8 GPUs.

PORT=23456 NPROC=8 bash scripts/run_greenmim_swin_base.sh

For Slurm Users

To train a Swin-B with on a single node with 8 GPUs.

bash scripts/srun_greenmim_swin_base.sh [Partition] [NUM_GPUS] 

Fine-tuning on ImageNet-1K

Model #Params Pre-train Resolution Fine-tune Resolution Config Acc@1 (%)
Swin-B (Window 7x7) 88M 224x224 224x224 Config 83.8
Swin-L (Window 14x14) 197M 224x224 224x224 Config 85.1

Currently, we directly use the code of SimMIM for fine-tuning, please follow their instructions to use the configs. NOTE that, due to the limited computing resource, we use a batch size of a batch size of 768 (48 x 16) for fine-tuning.

Acknowledgement

This code is based on the implementations of MAE, SimMIM, BEiT, SwinTransformer, Twins Transformer, and DeiT.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

About

[NeurIPS2022] Official implementation of the paper 'Green Hierarchical Vision Transformer for Masked Image Modeling'.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published