TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

1University of Melbourne    2HeyGen Research    3Nanyang Technological University
*Work done during an internship at HeyGen Research.    Corresponding author.
This work has been deployed to production. For more related research, please visit HeyGen Research and HeyGen Avatar-V.

Highlights

Below are TransVLM detection results across all evaluated datasets. Non-transition segments are played at speed; transition segments are slowed to 1/5× for clarity. Each video shows four colored bars overlaid on the frames, indicating per-method detection results.

Top Left: Ground Truth Top Right: TransNetV2 [9]
Bottom Left: AutoShot [1] Bottom Right: TransVLM (Ours) Non-transition frames:

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs.

Teaser

Limitations of existing shot transition detection methods

Limitations of existing shot transition detection methods. Predicted transitions are denoted by colored lines: gray for ground truth, red for a state-of-the-art SBD method (AutoShot [1]), blue for a top-tier VLM (Qwen3-VL [2]), and green for our proposed TransVLM. While SBD models excel at detecting normal cuts (a) but fail on gradual and special transitions (c, d, e), VLMs can perceive gradual and special transitions but miss normal cuts. Neither of them can detect the subtle cuts (b). In contrast, TransVLM robustly detects all these transitions.

Framework

TransVLM Framework Overview

Overview of the TransVLM framework. Our framework comprises three core components. (a) Model Architecture: we explicitly inject optical flow as a motion prior via a parameter-efficient strategy. By exclusively expanding the input projection weights of the Vision Patch Embed Layer, the model directly processes concatenated frames of color and optical flow. Crucially, this extracts joint appearance-motion representations without inflating the visual token sequence length, thereby incurring zero additional computational burden on the language model. (b) Data Synthesis: given an arbitrary sequence of clean shots, our scalable data engine automatically synthesizes videos with diverse transitions, simultaneously generating precisely aligned segment-level JSON labels for model training. (c) Arbitrary Video Inference: a temporal sliding-window strategy partitions the input stream into overlapping segments to generate local predictions, which are subsequently aggregated into a continuous global prediction via temporal Non-Maximum Suppression (NMS).

STD Benchmark

Unlike SBD datasets that provide ambiguous shot boundaries, our benchmark re-annotates six public datasets and combines them with synthesized videos to build a comprehensive STD benchmark of 5,215 videos (~100.3 hours, 45,239 transitions), evaluating diverse transition types. All transitions are equipped with segment-level ground-truth labels. We categorize transitions by duration: Cut (<0.1s), Normal (≤1s), and Long (>1s).

Dataset Domain Original Label Total Videos Total Transitions Transition Types
CountDur. (h) CountDur. (s) CutNormalLong
RAI [3]TV ShowsPoint101.641,036304.475718891
BBC [4]DocumentariesPoint119.004,943703.74,255582106
AutoShot (Test) [1]Short VideosPoint2002.012,065545.01,0081,00453
ClipShots (Test) [5]Web VideosPoint50032.856,9231,548.14,7981,830295
MovieShots2 (Test) [6]MoviesPoint28220.7214,7679,566.413,436710621
SportsShot (Val) [7]SportsPoint2409.375,045944.43,8991,06482
STD Synthesis DataWeb VideosSegment3,97224.6710,46018,249.33,5931,6155,252
STD BenchmarkDiverseSegment5,215100.2645,23931,861.331,7466,9936,500

Comparison with Baselines

Evaluation Metrics

We propose a rigorous evaluation suite comprising both segment-level and frame-level metrics, parameterized by a temporal tolerance τ (0.0 – 0.5 s) to handle annotator subjectivity at gradual transition boundaries: Segment F1 (instance retrieval), Frame F1 (temporal coverage), Absolute Boundary Error (ABE) in seconds, and Real-Time Factor (RTF) for inference efficiency.

Quantitative Comparison

We compare TransVLM against five paradigms: PySceneDetect [8], TransNetV2 [9], AutoShot [1], Gemini series [10], and Qwen3-VL series [2]. We report mean segment- and frame-level metrics across the defined τ values. TransVLM substantially outperforms heuristic methods, specialized networks, and significantly larger general-purpose VLMs. Bold = best, underline = second-best.

Method Public Data Synthetic Data RTF
SegmentFrameABE SegmentFrameABE
PRF1 PRF1 PRF1 PRF1
PySceneDetect [8]
  Adaptive0.6560.7160.6840.6450.3720.4571.930.9240.1720.2900.9220.0360.0690.280.09
  Content0.6270.7590.6860.6030.3830.4531.080.9090.1280.2250.9680.0290.0560.610.09
  Hash0.5660.7780.6540.5490.4160.4572.110.8800.1180.2080.9400.0270.0520.850.09
  Hist0.4520.7410.5590.4190.3980.3941.190.5770.3790.4550.7620.1170.1971.040.09
  Threshold0.3640.0310.0570.3060.0140.0273.080.7730.0390.0740.7490.0080.0161.460.09
TransNetV2 [9]0.7270.7800.7520.7310.4270.5281.870.2750.1490.1940.4170.0340.0630.610.07
AutoShot [1]0.7070.8040.7510.7090.4400.5321.780.3790.2480.2990.5300.0580.1020.510.03
Gemini Series [10]
  2.5 Pro0.5580.5270.5420.4530.3610.4013.620.3380.8510.4650.6380.7600.6860.820.81
  3 Pro0.5270.5730.5490.4690.3430.3932.180.4790.7680.5880.7110.4820.5680.881.32
Qwen3-VL Instruct Series [2]
  4B0.2350.0880.1240.1340.1740.14855.000.7170.3060.4280.4580.6180.5258.880.31
  8B0.2220.2970.2460.1320.3070.18434.390.5860.5970.5910.7410.2040.3151.600.34
  32B0.3090.4730.3700.2140.2790.2413.960.8950.6230.7350.9110.3610.5161.350.98
  30B-A3B (MoE)0.2180.3000.2420.1710.2220.1929.610.8060.5930.6830.7490.3550.4821.860.35
Qwen3-VL Thinking Series [2]
  4B0.4490.0790.1340.2650.0510.0861.500.8390.2180.3460.7480.0870.1561.681.03
  8B0.4500.1440.2170.2970.0940.1434.250.8540.3890.5340.9230.1470.2521.291.31
  32B0.4030.2600.3150.3080.1600.2091.090.9000.6080.7260.9430.3230.4791.163.30
  30B-A3B (MoE)0.3740.2170.2750.2910.1270.1750.980.8900.6100.7240.9150.3310.4851.180.92
TransVLM (Ours)0.7620.8060.7830.5740.5620.5681.580.9080.8820.8950.9460.9300.9380.110.50

Per-τ Comparison & Curves

Use the panels below to inspect detailed per-method numbers across each dataset / temporal tolerance τ, and visualize the corresponding metric curves as a function of τ. In each per-τ table, bold marks the best result and underline marks the second-best.

Per-τ Table

Metric Curve vs. τ

metric curve

References

  1. Zhu et al. AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection. CVPRW 2023.
  2. Bai et al. Qwen3-VL Technical Report. 2025.
  3. Baraldi et al. A Deep Siamese Network for Scene Detection in Broadcast Videos (RAI dataset). ACM MM 2015.
  4. Baraldi et al. Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video (BBC). CAIP 2015.
  5. Tang et al. Fast Video Shot Transition Localization with Deep Structured Models (ClipShots). ACCV 2018.
  6. Rao et al. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation (MovieShots). CVPR 2020.
  7. MCG. SportsShot: A Large-Scale Sports Shot Boundary Detection Dataset. 2024.
  8. Castellano et al. PySceneDetect. Open-source software, 2014–.
  9. Soucek and Lokoc. TransNet V2: An Effective Deep Network Architecture for Fast Shot Transition Detection. 2020.
  10. Google DeepMind. Gemini Technical Report. 2023.
  11. Zhang et al. NeuFlow v2: Push High-Efficiency Optical Flow to the Limit. 2024.