IntentVC Challenge at ACM MM 2025

type

Post

status

Published

date

Aug 15, 2025

slug

intentvc

summary

IntentVC Challenge at ACM MM 2025 - Second place winner

IntentVC Challenge at ACM MM 2025

😀

The IntentVC Challenge aims to solve a core problem in traditional video captioning: captions are too general and lack specificity.

Traditional approach

Standard methods generate “one-size-fits-all” captions that just describe the overall scene.

Example (Figure a):

"A child rides a bicycle with an adult walking alongside on a sunny day in a neighborhood."

→ This does describe the video, but it doesn’t highlight any particular details that a user might care about.

IntentVC approach

The challenge introduces user-controllable, intention-oriented captions.

This means the system can generate captions based on what the user wants to focus on:

Example (Figure b), focus on Object = Person →

"A young child wearing a helmet learns to ride a bicycle, guided by an adult for support."

Example (Figure c), focus on Object = Bicycle →

"A small bicycle with training wheels is ridden by a child, carefully supported by an adult along a sidewalk."

Why it matters

Personalization: Users can get captions tailored to their needs.

Control: Captions can highlight specific objects, actions, or contexts.

Practicality: This makes the system more useful for real applications like accessibility tools, education, or healthcare.

Result: Second Place Winner

The ranking is based on average scores across multiple evaluation metrics for video captioning:

BLEU@4 – measures n-gram overlap with reference captions.

METEOR – captures synonym/semantic matching.

CIDEr – measures consensus with multiple human references.

ROUGE-L – measures longest common subsequence overlap.

Our model outperformed most others in semantic quality (METEOR, ROUGE-L), showing the captions were more natural, fluent, and semantically aligned with human references.

Our model also achieved a BLEU@4 score of 53.08, matching the references very precisely in wording and phrasing, and a CIDEr score of 248.58, capturing the important details that humans agree on.

Our Solution

Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5374737

Github: https://github.com/thqiu0419/IntentVCNet

What problem does it solve?

LVLMs can handle spatial grounding (images) and temporal understanding (videos) separately, but they struggle to track a fine-grained target over time and produce captions focused on that target per user intent. IntentVCNet directly tackles this spatiotemporal disconnect.

Core method (two tricks + one ensemble)

IntentVCNet bridges the “spatiotemporal gap” in LVLMs by combining text prompts with coordinates and visual box prompts, plus a lightweight Box Adapter, to generate intent-controlled, object-focused video captions; it tops strong baselines on the IntentVC benchmark.

Prompt Combination

Text side: Write the target’s normalized box coordinates per frame into the instruction (e.g., [x1,y1,x2,y2]) so the LLM “knows where it is” over time.

Vision side: Overlay a red bounding box on each frame so the vision encoder looks exactly at the target.

These two prompts jointly enforce target grounding and intent focus.

Box Adapter (parameter-efficient, pluggable)

Insert a module in deeper ViT layers: extract RoI features (RoI-Align), then cross-attend them back into global features.

Freeze the vision backbone; train only the Box Adapter and LoRA on the LLM. This keeps general knowledge while adding fine-grained spatial interaction.

Multi-model Ensemble (consensus voting)

Use two bases: InternVL3 (great for high-res, shorter clips) and InternVideo2.5 (token compression, better for longer clips).

Route short vs. long videos (e.g., threshold ≈ 74 frames), then vote by sentence similarity to fuse outputs.

Why it works (intuition)
Spatiotemporal alignment: Coordinates tell the LLM where each frame’s target is; red boxes force the vision encoder to look there.
Minimal-intrusion tuning: Freeze the backbones; train small adapters to add fine-grained object interaction without losing generality.
Specialize, then agree: Let the short-clip and long-clip experts do their thing, then pick the consensus.

Training & evaluation (key setup)

Frozen vision backbone; LLM uses LoRA (rank=128).

AdamW, init LR 2e-5, cosine decay.

Resolution: 448×448. Frames: train 32–48 (random), infer 48 fixed.

Env: PyTorch 2.1.1 / CUDA 12.1 / 4×H100 80 GB.

Metrics: BLEU@4, METEOR, CIDEr, ROUGE-L (per IntentVC challenge).

Results (high level)

Beats strong baselines (VAST, Qwen2.5-VL, InternVideo2.5, InternVL3) across the four metrics; vs. InternVideo2.5, CIDEr improves by +37.71.

Ablations

Text coords alone and visual red-box alone both help; using both together adds only a small extra gain (some redundancy), so better to keep them in different models and fuse in the ensemble.

Box Adapter matters: placing it in the last 5 ViT layers works best (CIDEr ≈ 223.01); too many layers hurts.

Short/long routing + fusion > any single model.

💡

Please do not hesitate to contact me if you have any questions.