IntentVC Challenge at ACM MM 2025
ok
type
Post
status
Published
date
Aug 15, 2025
slug
intentvc
summary
IntentVC Challenge at ACM MM 2025 - Second place winner
tags
Multimodal
category
icon
password
comment
publish date
IntentVC Challenge at ACM MM 2025 - Second place winner
IntentVC Challenge at ACM MM 2025

The IntentVC Challenge aims to solve a core problem in traditional video captioning: captions are too general and lack specificity.
Traditional approach
- Standard methods generate “one-size-fits-all” captions that just describe the overall scene.
- Example (Figure a):
"A child rides a bicycle with an adult walking alongside on a sunny day in a neighborhood."
→ This does describe the video, but it doesn’t highlight any particular details that a user might care about.
IntentVC approach
- The challenge introduces user-controllable, intention-oriented captions.
- This means the system can generate captions based on what the user wants to focus on:
- Example (Figure b), focus on Object = Person →
- Example (Figure c), focus on Object = Bicycle →
"A young child wearing a helmet learns to ride a bicycle, guided by an adult for support."
"A small bicycle with training wheels is ridden by a child, carefully supported by an adult along a sidewalk."
Why it matters
- Personalization: Users can get captions tailored to their needs.
- Control: Captions can highlight specific objects, actions, or contexts.
- Practicality: This makes the system more useful for real applications like accessibility tools, education, or healthcare.
Result: Second Place Winner

The ranking is based on average scores across multiple evaluation metrics for video captioning:
- BLEU@4 – measures n-gram overlap with reference captions.
- METEOR – captures synonym/semantic matching.
- CIDEr – measures consensus with multiple human references.
- ROUGE-L – measures longest common subsequence overlap.
- Our model outperformed most others in semantic quality (METEOR, ROUGE-L), showing the captions were more natural, fluent, and semantically aligned with human references.
- Our model also achieved a BLEU@4 score of 53.08, matching the references very precisely in wording and phrasing, and a CIDEr score of 248.58, capturing the important details that humans agree on.
Our Solution
What problem does it solve?
LVLMs can handle spatial grounding (images) and temporal understanding (videos) separately, but they struggle to track a fine-grained target over time and produce captions focused on that target per user intent. IntentVCNet directly tackles this spatiotemporal disconnect.
Core method (two tricks + one ensemble)
IntentVCNet bridges the “spatiotemporal gap” in LVLMs by combining text prompts with coordinates and visual box prompts, plus a lightweight Box Adapter, to generate intent-controlled, object-focused video captions; it tops strong baselines on the IntentVC benchmark.

- Prompt Combination
- Text side: Write the target’s normalized box coordinates per frame into the instruction (e.g., [x1,y1,x2,y2]) so the LLM “knows where it is” over time.
- Vision side: Overlay a red bounding box on each frame so the vision encoder looks exactly at the target.
These two prompts jointly enforce target grounding and intent focus.
- Box Adapter (parameter-efficient, pluggable)
- Insert a module in deeper ViT layers: extract RoI features (RoI-Align), then cross-attend them back into global features.
- Freeze the vision backbone; train only the Box Adapter and LoRA on the LLM. This keeps general knowledge while adding fine-grained spatial interaction.
- Multi-model Ensemble (consensus voting)
- Use two bases: InternVL3 (great for high-res, shorter clips) and InternVideo2.5 (token compression, better for longer clips).
- Route short vs. long videos (e.g., threshold ≈ 74 frames), then vote by sentence similarity to fuse outputs.
Why it works (intuition)
- Spatiotemporal alignment: Coordinates tell the LLM where each frame’s target is; red boxes force the vision encoder to look there.
- Minimal-intrusion tuning: Freeze the backbones; train small adapters to add fine-grained object interaction without losing generality.
- Specialize, then agree: Let the short-clip and long-clip experts do their thing, then pick the consensus.

Training & evaluation (key setup)
- Frozen vision backbone; LLM uses LoRA (rank=128).
- AdamW, init LR 2e-5, cosine decay.
- Resolution: 448×448. Frames: train 32–48 (random), infer 48 fixed.
- Env: PyTorch 2.1.1 / CUDA 12.1 / 4×H100 80 GB.
- Metrics: BLEU@4, METEOR, CIDEr, ROUGE-L (per IntentVC challenge).
Results (high level)
- Beats strong baselines (VAST, Qwen2.5-VL, InternVideo2.5, InternVL3) across the four metrics; vs. InternVideo2.5, CIDEr improves by +37.71.
Ablations
- Text coords alone and visual red-box alone both help; using both together adds only a small extra gain (some redundancy), so better to keep them in different models and fuse in the ensemble.
- Box Adapter matters: placing it in the last 5 ViT layers works best (CIDEr ≈ 223.01); too many layers hurts.
- Short/long routing + fusion > any single model.
Please do not hesitate to contact me if you have any questions.
Loading...