CaReBench

Fine-grained Benchmark for Video Captioning and Retrieval


1State Key Laboratory for Novel Software Technology, Nanjing University, 2Shanghai AI Laboratory

comparison of captions

Introduction

🌟 CaReBench is a fine-grained benchmark comprising 1,000 high-quality videos with detailed human-annotated captions, including manually separated spatial and temporal descriptions for independent spatiotemporal bias evaluation.

carebench

📊 ReBias and CapST Metrics are designed specifically for retrieval and captioning tasks, providing a comprehensive evaluation framework for spatiotemporal understanding in video-language models.


âš¡ CaRe: A Unified Baseline for fine-grained video retrieval and captioning, achieving competitive performance through two-stage Supervised Fine-Tuning (SFT). CaRe excels in both generating detailed video descriptions and extracting robust video features.

care_model

🚀 State-of-the-art performance on both detailed video captioning and fine-grained video retrieval. CaRe outperforms CLIP-based retrieval models and popular MLLMs in captioning tasks.

performance

Benchmark Data

Video Retrieval Leaderboard

R1: Recall@1          R5: Recall@5          R10: Recall@10         

By default, this leaderboard is sorted by R@1 score. To view other sorted results, please click on the corresponding cell.

# Model Params Date General Retrieval Spatial Retrieval Temporal Retrieval
Text to Video Video to Text Text to Video Video to Text Text to Video Video to Text
R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10 R1 R5 R10
CaRe

Ours

7B 2025/3/15 77.0 95.6 98.7 79.0 96.8 99.1 76.8 96.3 98.7 78.1 95.8 99.3 50.7 85.3 94.4 53.4 86.3 94.0
Qwen2-VL*

Alibaba

7B 2024/8/30 76.6 95.3 98.7 77.4 95.6 98.7 78.2 95.5 98.5 75.4 95.0 98.1 51.9 84.8 94.9 52.7 85.4 95.2
InternVideo2stage2

Shanghai AI Lab

1B 2024/4/25 72.5 93.7 97.3 69.5 94.6 97.8 72.4 94.2 97.4 62.7 90.5 95.9 46.0 80.8 91.9 46.6 82.5 92.5
InternVL2*

Shanghai AI Lab

8B 2024/7/4 72.1 92.6 96.8 73.6 93.4 97.4 76.8 94.2 97.7 75.7 95.2 98.0 48.1 76.8 89.0 47.6 78.2 90.3
Tarsier*

ByteDance Research

7B 2024/7/4 71.0 93.8 97.8 70.6 94.2 98.0 70.2 94.0 98.2 67.4 93.5 97.4 50.1 84.1 92.8 50.0 84.7 94.9
MiniCPM-V 2.6*

OpenBMB

8B 2024/8/6 71.0 92.2 97.0 69.3 92.8 97.1 71.7 93.6 98.0 67.6 92.3 97.7 50.5 82.9 92.1 46.1 80.9 93.3
LLaVA NeXT Video*

LLaVA NeXT Team

7B 2024/5/10 66.9 89.4 96.0 62.7 89.2 95.4 68.0 92.0 96.2 65.0 90.0 95.9 43.3 76.9 88.9 40.1 75.4 88.7
LanguageBind

Peking University

528M 2023/10/7 64.3 91.0 96.3 59.5 88.0 95.0 64.7 90.8 96.8 61.0 87.2 94.5 39.8 77.3 90.5 42.2 77.6 91.7
Long-CLIP L/14

Shanghai AI Lab

428M 2024/3/22 62.7 88.8 95.7 60.3 88.8 94.9 65.6 90.9 96.0 61.0 88.3 94.4 33.2 68.8 81.6 34.5 71.9 86.6
Long-CLIP B/14

Shanghai AI Lab

150M 2024/3/22 59.2 85.3 92.1 55.8 84.7 92.9 62.5 86.0 92.7 53.8 84.1 92.7 32.0 65.4 79.3 29.7 67.3 84.1
CLIP L/14

OpenAI

428M 2021/2/26 51.2 83.4 90.6 54.7 86.9 93.6 49.0 81.9 91.4 55.4 85.6 93.0 33.5 70.3 84.0 39.7 76.2 87.9
CLIP B/16

OpenAI

150M 2021/2/26 45.7 79.6 89.1 48.4 82.4 90.8 45.6 79.0 89.2 47.6 80.9 90.8 30.3 65.1 79.8 35.8 71.0 85.8
InternVL2

Shanghai AI Lab

8B 2024/7/4 34.6 67.1 80.2 35.1 68.5 82.0 40.4 72.9 83.8 40.3 73.0 85.7 29.3 62.5 77.4 27.1 59.8 75.9
Qwen2-VL

Alibaba

7B 2024/8/30 30.9 64.7 79.1 32.9 69.6 82.7 28.1 61.3 76.1 31.6 65.6 80.4 24.3 61.5 78.4 26.4 59.2 76.1
Tarsier

ByteDance Research

7B 2024/7/4 26.8 64.6 83.5 32.3 68.0 84.4 40.5 74.0 88.1 41.9 75.0 87.4 26.8 64.6 83.5 32.3 68.0 84.4
LLaVA NeXT Video

LLaVA NeXT Team

7B 2024/5/10 22.4 51.5 65.3 25.2 54.4 67.7 34.1 63.1 76.0 31.1 63.7 75.1 18.6 48.1 62.4 20.7 47.1 62.4
MiniCPM-V 2.6

OpenBMB

8B 2024/8/6 8.2 26.9 38.4 16.7 39.9 55.8 6.6 25.2 35.7 13.3 38.2 53.5 11.8 35.8 52.2 16.6 47.4 64.4

Date indicates the release date of open-source models          * Contrastively trained MLLM

Video Captioning Leaderboard

F1: F1 Score          R: Recall          P: Precision         

By default, this leaderboard is sorted by Overall Action F1 score. To view other sorted results, please click on the corresponding cell.

# Model Params Date Overall Personal Care Socializing & Relaxing Sports & Exercise Household Activities
Action Object Action Object Action Object Action Object Action Object
F1 R P F1 R P F1 R P F1 R P F1 R P F1 R P F1 R P F1 R P F1 R P F1 R P
CaRestage-I

Ours

7B 2025/3/15 35.3 26.9 51.3 32.4 22.9 55.7 33.9 25.4 50.8 32.1 22.6 55.3 32.4 24.0 49.8 31.3 22.2 53.1 42.8 33.7 58.5 33.2 23.2 58.4 31.5 24.4 44.7 33.6 23.8 57.1
CaRe

Ours

7B 2025/3/15 35.1 26.6 51.4 31.7 21.8 57.8 34.4 25.6 52.6 30.9 21.1 57.2 32.2 24.0 48.8 31.5 21.9 55.6 42.3 33.3 58.1 31.8 21.3 62.6 30.9 23.4 45.3 32.6 23.0 55.8
MiniCPM-V 2.6

OpenBMB

7B 2024/8/6 31.1 22.3 51.2 30.5 21.9 50.5 30.2 21.3 52.0 28.9 19.7 53.6 26.9 18.6 48.8 29.4 21.0 48.8 38.1 29.7 53.1 32.0 23.7 49.3 28.5 20.0 49.5 32.2 23.3 52.1
Qwen2-VL

Alibaba

72B 2024/8/30 30.5 22.6 47.1 24.2 15.8 51.9 29.6 22.1 45.0 24.5 16.3 49.4 28.1 20.6 44.2 22.5 14.7 47.8 37.3 28.5 53.9 24.6 15.8 56.3 26.4 18.6 45.4 26.5 17.4 55.7
Qwen2-VL

Alibaba

7B 2024/8/30 28.8 22.9 39.0 24.0 15.9 49.1 28.4 23.9 34.9 23.7 15.8 47.7 27.5 20.8 40.3 23.0 15.1 47.8 33.0 26.6 43.6 24.9 16.2 53.1 25.7 20.2 35.1 24.8 16.8 47.2
InternVL2.5

Shanghai AI Lab

72B 2024/7/4 28.2 20.3 46.4 30.5 24.8 39.5 24.6 16.7 46.7 28.7 22.4 40.0 25.9 18.3 44.4 28.6 23.3 37.3 36.0 27.8 51.0 34.0 28.2 42.7 24.9 17.5 43.2 30.8 25.7 38.5
Tarsier

ByteDance

7B 2024/3/26 27.1 18.4 51.1 31.1 23.4 46.5 25.4 16.5 55.0 30.0 22.2 45.9 26.5 18.0 50.4 30.0 22.6 44.4 32.0 22.8 53.3 33.4 24.9 50.7 22.8 15.3 44.7 31.2 23.9 45.1
LLaVA NV

LLaVA

7B 2023/12/21 26.6 18.7 45.9 24.7 17.9 39.8 27.5 20.1 43.7 21.7 15.5 36.2 25.0 17.4 44.1 24.1 17.3 39.9 29.4 21.1 48.4 26.8 19.6 42.3 24.3 16.2 48.1 26.3 19.5 40.4
InternVL2.5

Shanghai AI Lab

7B 2024/7/4 26.0 18.6 43.2 29.1 23.5 38.2 22.0 15.1 41.1 26.4 20.4 37.2 24.0 16.8 41.6 28.4 22.7 37.9 34.0 26.1 48.8 31.6 26.4 39.4 22.3 15.3 40.6 29.6 24.4 37.7
InternVL2

Shanghai AI Lab

7B 2024/4/25 23.3 18.8 30.7 22.9 17.1 34.9 22.2 18.4 28.0 20.4 15.1 31.6 23.0 17.9 32.3 23.1 17.3 34.6 27.9 23.4 34.5 24.9 18.3 38.7 18.4 14.7 24.8 22.7 17.1 33.8

Date indicates the release date of open-source models

BibTeX

@misc{xu2025carebenchfinegrainedbenchmarkvideo,
  title={CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval}, 
  author={Yifan Xu and Xinhao Li and Yichun Yang and Desen Meng and Rui Huang and Limin Wang},
  year={2025},
  eprint={2501.00513},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2501.00513}, 
}