总结了2017年至今在TSGV方向上的相关工作。 Temporal Sentence Grounding in Videos (TSGV) Natural Language Video Localization (NLVL) Video Moment Retrieval (VMR) 该任务的目标是给定一段语言描述,在一个未经裁剪的长视频中定位出该语言所描述的视频片段。
- 数据集
- 相关工作
- 参考
Dataset | Video Source | Domain |
---|---|---|
TACoS | Kitchen | Cooking |
Charades-STA | Homes | Indoor Activity |
ActivityNet Captions | Youtube | Open |
DiDeMo | Flickr | Open |
MAD | Movie | Open |
- A survey of temporal activity localization via language in untrimmed videos. in ICCST 2020
- A survey on natural language video localization. in ArXiv 2021
- A survey on temporal sentence grounding in videos. in ArXiv 2021
- The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions. in ArXiv 2022
Sliding window-based method adopts a multi-scale sliding windows (SW) to generate proposal candidates.
- CTRL: Tall: Temporal activity localization via language query. in ICCV 2017. code
- MCN: Localizing moments in video with natural language. in ICCV 2017
- ROLE: Crossmodal moment localization in videos. in ACM MM 2018
- ACRN: Attentive moment retrieval in videos. in SIGIR 2018
- MAC: Mac: Mining activity concepts for language-based temporal localization. in WACV 2019. code
- MCF: Multi-modal circulant fusion for video-to-language and backward. in IJCAI 2018
- MLLC: Localizing moments in video with temporal language. in EMNLP 2018
- TCMN: Exploiting temporal relationships in video moment localization with natural language. in ACM MM 2019. code
- ASST: An attentive sequence to sequence translator for localizing video clips by natural language. in TMM 2020. code
- SLTA: Cross-modal video moment etrieval with spatial and language-temporal attention. in ICMR 2019. code
- MMRG: Multi-modal relational graph for cross-modal video moment retrieval. in CVPR 2021
- I$^2$N: Interaction-integrated network for natural language moment localization. in TIP 2021
Proposal generated (PG) method alleviates the computation burden of SW-based methods and generates proposals conditioned on the query.
- Text-to-Clip: Text-toclip video retrieval with early fusion and re-captioning. in ArXiv 2018
- QSPN: Multilevel language and vision integration for text-to-clip retrieval. in AAAI 2019
- SAP: Semantic proposal for activity localization in videos via sentence query. in AAAI 2019
- BPNet: Boundary proposal network for two-stage natural language video localization. in AAAI 2021
- APGN: Adaptive proposal generation network for temporal sentence localization in videos. in EMNLP 2021
- LP-Net: Natural language video localization with learnable moment proposals. in EMNLP 2021
- CMHN: Video moment localization via deep cross-modal hashing in TIP 2021.
Anchor-based methods incorporates proposal generation into answer prediction and maintains the proposals with various learning modules.
- TGN: Temporally grounding natural sentence in video. in EMNLP 2018. code
- MAN: Man: Moment alignment network for natural language moment retrieval via iterative graph adjustmen. in CVPR 2019
- SCDM: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. in NeurIPS 2019. code
- CMIN: Cross-modal interaction networks for query-based moment retrieval in videos. in SIGIR 2019. code
- SCDM$^*$: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. in TPAMI 2020
- CMIN$^*$: Moment retrieval via cross-modal interaction networks with query reconstruction. in TIP 2020
- CBP: Temporally grounding language queries in videos by contextual boundary-aware prediction. in AAAI 2020. code
- FIAN: Finegrained iterative attention network for temporal language localization in videos. in ACM MM 2020
- HDRR: Hierarchical deep residual reasoning for temporal moment localization. ACM MM Asia 2021
- MIGCN: Multi-modal interaction graph convolutional network for temporal language localization in videos. in TIP 2021
- CSMGAN: Jointly cross- and self-modal graph attention network for query-based moment localization. in ACM MM 2020
- RMN: Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. in COLING 2020
- IA-Net: Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. in EMNLP 2021
- DCT-Net: Dct-net: A deep co-interactive transformer network for video temporal grounding. in IVC 2021
- 2D-TAN: Learning 2d temporal adjacent networks formoment localization with natural language. in AAAI 2020. code
- MATN: Multi-stage aggregated transformer network for temporal language localization in videos. in CVPR 2021
- SMIN: Structured multi-level interaction network for video moment localization via language query. in CVPR 2021
- RaNet: Relation-aware video reading comprehension for temporal language grounding. in EMNLP 2021. code
- FVMR: Fast video moment retrieval. in ICCV 2021
- MS 2DTAN: Multi-scale 2d temporal adjacency networks for moment localization with natural language. in TPAMI 2021. code
- PLN: Progressive localization networks for language-based moment localization. in ArXiv 2021
- CLEAR: Coarseto-fine semantic alignment for cross-modal moment localization. in TIP 2021
- STCM-Net: Stcm-net: A symmetrical one-stage network for temporal language localization in videos. in Neurocomputing 2022
- VLG-Net: Vlg-net: Videolanguage graph matching network for video grounding. in ICCV Workshop 2021
- SV-VMR: Diving into the relations: Leveraging semantic and visual structures for video moment retrieval. in ICME 2021
- MMN: Negative sample matters: A renaissance of metric learning for temporal grounding. in AAAI 2022. code
Regression-based method computes a time pair ($t_s$, $t_e$) and compares the computed pair with ground-truth ($τ_s$, $τ_e$) for model optimization.
- ABLR: To find where you talk: Temporal sentence localization in video with attention based location regression. in AAAI 2019. code
- ExCL: ExCL: Extractive Clip Localization Using Natural Language Descriptions. in NAACL 2019
- DEBUG: DEBUG: A dense bottomup grounding approach for natural language video localization. in EMNLP 2019
- GDP: Rethinking the bottom-up framework for query-based video localization. in AAAI 2020
- CMA:: A simple yet effective method for video temporal grounding with cross-modality attention. in ArXiv 2020
- DRN: Dense regression network for video grounding. in CVPR 2020. code
- LGI: Local-global video-text interactions for temporal grounding. in CVPR 2020. code
- CPNet: Proposal-free video grounding with contextual pyramid network. in AAAI 2021
- DeNet: Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. in CVPR 2021
- SSMN: Single-shot semantic matching network for moment localization in videos. in ACM TOMCCAP 2021
- HVTG: Hierarchical visual-textual graph for temporal activity localization via language. in ECCV 2020. code
- PMI: Learning modality interaction for temporal sentence localization and event captioning in videos. in ECCV 2020
- DRFT: End-to-end multi-modal video temporal grounding. in NeurIPS 2021
Span-based methods aim to predict the probability of each video snippet/frame being the start and end positions of target moment.
- ExCL: ExCL: Extractive Clip Localization Using Natural Language Descriptions. in NAACL 2019
- L-Net: Localizing natural language in videos. in AAAI 2019
- VSLNet: Span-based localizing network for natural language video localization. in ACL 2020. code
- VSLNet-L$^*$: Natural language video localization: A revisit in span-based question answering framework. in TPAMI 2021
- TMLGA: Proposal-free temporal moment localization of a natural-language query in video using guided attention. in WACV 2020. code
- SeqPAN: Parallel attention network with sequence matching for video grounding. in Findings of ACL 2021. code
- CPN: Cascaded prediction network via segment tree for temporal video grounding. in CVPR 2021
- IVG: Interventional video grounding with dual contrastive learning. in CVPR 2021. code
- Local-enhanced interaction for temporal moment localization. in ICMR 2021
- CI-MHA: Cross interaction network for natural language guided video moment retrieval. in SIGIR 2021
- MQEI: Multi-level query interaction for temporal language grounding. in TITS 2021
- ACRM: Frame-wise crossmodal matching for video moment retrieval. in TMM 2021
- ABIN: Temporal textual localization in video via adversarial bi-directional interaction networks. in TMM 2021
- CSTI: Collaborative spatial-temporal interaction for language-based moment retrieval. in WCSP 2021
- DORi: Dori: Discovering object relationships for moment localization of a natural language query in a video. in WACV 2021
- CBLN: Context-aware biaffine localizing network for temporal sentence grounding. in CVPR 2021. code
- PEARL: Natural language video moment localization through query-controlled temporal convolution. in WACV 2022
RL-based method formulates TSGV as a sequence decision making problem, and utilizes deep reinforcement learning techniques to solve it.
- RWM-RL: Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. in AAAI 2019
- SM-RL: Language-driven temporal activity localization: A semantic matching reinforcement learning model. in CVPR 2019
- TSP-PRL: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. in AAAI 2020
- AVMR: Adversarial video moment retrieval by jointly modeling ranking and localization. in ACM MM 2020
- STRONG: Strong: Spatio-temporal reinforcement learning for cross-modal video moment localization. in ACM MM 2020
- TripNet: Tripping through time: Efficient localization of activities in videos. in BMVC 2020
- MABAN: Maban: Multi-agent boundary-aware network for natural language moment retrieval. in TIP 2021
- FIFO: Find and focus: Retrieve and localize video events with natural language queries. in ECCV 2018. code
- DPIN: Dual path interaction network for video moment localization. in ACM MM 2020
- Sscs: Support-set based cross-supervision for video grounding. in ICCV 2021
- DepNet: Dense events grounding in video. in AAAI 2021
- SNEAK: Sneak: Synonymous sentences-aware adversarial attack on natural language video localization. in ArXiv 2021. code
- BSP: Boundary-sensitive pre-training for temporal localization in videos. in ICCV 2021. code
- GTR: On pursuit of designing multi-modal transformer for video grounding. in EMNLP 2021. code
Under weakly-supervised setting, TSGV methods only need video-query pairs but not the annotations of starting/end time.
- TGA: Weakly supervised video moment retrieval from text queries. in CVPR 2019
- WSLLN: WSLLN:weakly supervised natural language localization networks. in EMNLP 2019
- Coarse-to-Fine: Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. in ArXiv 2020
- VLANet: Vlanet: Video-language alignment network for weakly-supervised video moment retrieva. in ECCV 2020
- BAR: Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. in ACM MM 2020
- CCL: Counterfactual contrastive learning for weakly-supervised vision-language grounding. in NeurIPS 2020
- AsyNCE: Asynce: Disentangling false-positives for weakly-supervised video grounding. in ACM MM 2021
- Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. in ACM MM 2021
- FSAN: Fine-grained semantic alignment network for weakly supervised temporal language grounding. in Findings of EMNLP 2021
- CRM: Cross-sentence temporal and semantic relations in video activity localisation. in ICCV 2021
- LCNet: Local correspondence network for weakly supervised temporal sentence grounding. in TIP 2021
- Regularized two granularity loss function for weakly supervised video moment retrieval. in TMM 2021
- WSTAN: Weakly supervised temporal adjacent network for language grounding. in TMM 2021
- LoGAN: Logan: Latent graph co-attention network for weakly-supervised video moment retrieval. in WACV 2021
- Weakly supervised dense event captioning in videos. in NeurIPS 2018
- SCN: Weakly-supervised video moment retrieval via semantic completion network. in AAAI 2020
- EC-SL: Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. in CVPR 2021
- MARN: Weakly-supervised multilevel attentional reconstruction network for grounding textual queries in videos. in ArXiv 2020
- Towards bridging video and language by caption generation and sentence localization. in ACM MM 2021
- RTBPN: Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. in ACM MM 2020. code
- S$^4$TLG: Self-supervised learning for semi-supervised temporal language grounding. in ArXiv 2021
- PSVL: Zero-shot natural language video localization. in ICCV 2021. code
- Learning video moment retrieval without a single annotated video. in TCSVT 2021