Fine-grained Benchmark for Video Captioning and Retrieval

Introduction

🌟 CaReBench is a fine-grained benchmark comprising 1,000 high-quality videos with detailed human-annotated captions, including manually separated spatial and temporal descriptions for independent spatiotemporal bias evaluation.

📊 ReBias and CapST Metrics are designed specifically for retrieval and captioning tasks, providing a comprehensive evaluation framework for spatiotemporal understanding in video-language models.

⚡ CaRe: A Unified Baseline for fine-grained video retrieval and captioning, achieving competitive performance through two-stage Supervised Fine-Tuning (SFT). CaRe excels in both generating detailed video descriptions and extracting robust video features.

🚀 State-of-the-art performance on both detailed video captioning and fine-grained video retrieval. CaRe outperforms CLIP-based retrieval models and popular MLLMs in captioning tasks.

Video Retrieval Leaderboard

R1: Recall@1 R5: Recall@5 R10: Recall@10

By default, this leaderboard is sorted by R@1 score. To view other sorted results, please click on the corresponding cell.

Model	Params	Date	General Retrieval						Spatial Retrieval						Temporal Retrieval
			Text to Video			Video to Text			Text to Video			Video to Text			Text to Video			Video to Text
			R1	R5	R10	R1	R5	R10	R1	R5	R10	R1	R5	R10	R1	R5	R10	R1	R5	R10
CaRe Ours	7B	2025/3/15	77.0	95.6	98.7	79.0	96.8	99.1	76.8	96.3	98.7	78.1	95.8	99.3	50.7	85.3	94.4	53.4	86.3	94.0
Qwen2-VL* Alibaba	7B	2024/8/30	76.6	95.3	98.7	77.4	95.6	98.7	78.2	95.5	98.5	75.4	95.0	98.1	51.9	84.8	94.9	52.7	85.4	95.2
InternVideo2_stage2 Shanghai AI Lab	1B	2024/4/25	72.5	93.7	97.3	69.5	94.6	97.8	72.4	94.2	97.4	62.7	90.5	95.9	46.0	80.8	91.9	46.6	82.5	92.5
InternVL2* Shanghai AI Lab	8B	2024/7/4	72.1	92.6	96.8	73.6	93.4	97.4	76.8	94.2	97.7	75.7	95.2	98.0	48.1	76.8	89.0	47.6	78.2	90.3
Tarsier* ByteDance Research	7B	2024/7/4	71.0	93.8	97.8	70.6	94.2	98.0	70.2	94.0	98.2	67.4	93.5	97.4	50.1	84.1	92.8	50.0	84.7	94.9
MiniCPM-V 2.6* OpenBMB	8B	2024/8/6	71.0	92.2	97.0	69.3	92.8	97.1	71.7	93.6	98.0	67.6	92.3	97.7	50.5	82.9	92.1	46.1	80.9	93.3
LLaVA NeXT Video* LLaVA NeXT Team	7B	2024/5/10	66.9	89.4	96.0	62.7	89.2	95.4	68.0	92.0	96.2	65.0	90.0	95.9	43.3	76.9	88.9	40.1	75.4	88.7
LanguageBind Peking University	528M	2023/10/7	64.3	91.0	96.3	59.5	88.0	95.0	64.7	90.8	96.8	61.0	87.2	94.5	39.8	77.3	90.5	42.2	77.6	91.7
Long-CLIP L/14 Shanghai AI Lab	428M	2024/3/22	62.7	88.8	95.7	60.3	88.8	94.9	65.6	90.9	96.0	61.0	88.3	94.4	33.2	68.8	81.6	34.5	71.9	86.6
Long-CLIP B/14 Shanghai AI Lab	150M	2024/3/22	59.2	85.3	92.1	55.8	84.7	92.9	62.5	86.0	92.7	53.8	84.1	92.7	32.0	65.4	79.3	29.7	67.3	84.1
CLIP L/14 OpenAI	428M	2021/2/26	51.2	83.4	90.6	54.7	86.9	93.6	49.0	81.9	91.4	55.4	85.6	93.0	33.5	70.3	84.0	39.7	76.2	87.9
CLIP B/16 OpenAI	150M	2021/2/26	45.7	79.6	89.1	48.4	82.4	90.8	45.6	79.0	89.2	47.6	80.9	90.8	30.3	65.1	79.8	35.8	71.0	85.8
InternVL2 Shanghai AI Lab	8B	2024/7/4	34.6	67.1	80.2	35.1	68.5	82.0	40.4	72.9	83.8	40.3	73.0	85.7	29.3	62.5	77.4	27.1	59.8	75.9
Qwen2-VL Alibaba	7B	2024/8/30	30.9	64.7	79.1	32.9	69.6	82.7	28.1	61.3	76.1	31.6	65.6	80.4	24.3	61.5	78.4	26.4	59.2	76.1
Tarsier ByteDance Research	7B	2024/7/4	26.8	64.6	83.5	32.3	68.0	84.4	40.5	74.0	88.1	41.9	75.0	87.4	26.8	64.6	83.5	32.3	68.0	84.4
LLaVA NeXT Video LLaVA NeXT Team	7B	2024/5/10	22.4	51.5	65.3	25.2	54.4	67.7	34.1	63.1	76.0	31.1	63.7	75.1	18.6	48.1	62.4	20.7	47.1	62.4
MiniCPM-V 2.6 OpenBMB	8B	2024/8/6	8.2	26.9	38.4	16.7	39.9	55.8	6.6	25.2	35.7	13.3	38.2	53.5	11.8	35.8	52.2	16.6	47.4	64.4

Date indicates the release date of open-source models * Contrastively trained MLLM

Video Captioning Leaderboard

F1: F1 Score R: Recall P: Precision LLM Judge: Deepseek-V3-1226

By default, this leaderboard is sorted by Overall Action F1 score. To view other sorted results, please click on the corresponding cell.

Model	Params	Date	Overall						Personal Care						Socializing & Relaxing						Sports & Exercise						Household Activities
			Action			Object			Action			Object			Action			Object			Action			Object			Action			Object
			F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P	F1	R	P
CaRe_stage-I Ours	7B	2025/3/15	35.3	26.9	51.3	32.4	22.9	55.7	33.9	25.4	50.8	32.1	22.6	55.3	32.4	24.0	49.8	31.3	22.2	53.1	42.8	33.7	58.5	33.2	23.2	58.4	31.5	24.4	44.7	33.6	23.8	57.1
CaRe Ours	7B	2025/3/15	35.1	26.6	51.4	31.7	21.8	57.8	34.4	25.6	52.6	30.9	21.1	57.2	32.2	24.0	48.8	31.5	21.9	55.6	42.3	33.3	58.1	31.8	21.3	62.6	30.9	23.4	45.3	32.6	23.0	55.8
MiniCPM-V 2.6 OpenBMB	7B	2024/8/6	31.1	22.3	51.2	30.5	21.9	50.5	30.2	21.3	52.0	28.9	19.7	53.6	26.9	18.6	48.8	29.4	21.0	48.8	38.1	29.7	53.1	32.0	23.7	49.3	28.5	20.0	49.5	32.2	23.3	52.1
Qwen2-VL Alibaba	72B	2024/8/30	30.5	22.6	47.1	24.2	15.8	51.9	29.6	22.1	45.0	24.5	16.3	49.4	28.1	20.6	44.2	22.5	14.7	47.8	37.3	28.5	53.9	24.6	15.8	56.3	26.4	18.6	45.4	26.5	17.4	55.7
Qwen2-VL Alibaba	7B	2024/8/30	28.8	22.9	39.0	24.0	15.9	49.1	28.4	23.9	34.9	23.7	15.8	47.7	27.5	20.8	40.3	23.0	15.1	47.8	33.0	26.6	43.6	24.9	16.2	53.1	25.7	20.2	35.1	24.8	16.8	47.2
InternVL2.5 Shanghai AI Lab	72B	2024/7/4	28.2	20.3	46.4	30.5	24.8	39.5	24.6	16.7	46.7	28.7	22.4	40.0	25.9	18.3	44.4	28.6	23.3	37.3	36.0	27.8	51.0	34.0	28.2	42.7	24.9	17.5	43.2	30.8	25.7	38.5
Tarsier ByteDance	7B	2024/3/26	27.1	18.4	51.1	31.1	23.4	46.5	25.4	16.5	55.0	30.0	22.2	45.9	26.5	18.0	50.4	30.0	22.6	44.4	32.0	22.8	53.3	33.4	24.9	50.7	22.8	15.3	44.7	31.2	23.9	45.1
LLaVA NV LLaVA	7B	2023/12/21	26.6	18.7	45.9	24.7	17.9	39.8	27.5	20.1	43.7	21.7	15.5	36.2	25.0	17.4	44.1	24.1	17.3	39.9	29.4	21.1	48.4	26.8	19.6	42.3	24.3	16.2	48.1	26.3	19.5	40.4
InternVL2.5 Shanghai AI Lab	7B	2024/7/4	26.0	18.6	43.2	29.1	23.5	38.2	22.0	15.1	41.1	26.4	20.4	37.2	24.0	16.8	41.6	28.4	22.7	37.9	34.0	26.1	48.8	31.6	26.4	39.4	22.3	15.3	40.6	29.6	24.4	37.7
InternVL2 Shanghai AI Lab	7B	2024/4/25	23.3	18.8	30.7	22.9	17.1	34.9	22.2	18.4	28.0	20.4	15.1	31.6	23.0	17.9	32.3	23.1	17.3	34.6	27.9	23.4	34.5	24.9	18.3	38.7	18.4	14.7	24.8	22.7	17.1	33.8

Date indicates the release date of open-source models

BibTeX

@misc{xu2025carebenchfinegrainedbenchmarkvideo, title={CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval}, author={Yifan Xu and Xinhao Li and Yichun Yang and Desen Meng and Rui Huang and Limin Wang}, year={2025}, eprint={2501.00513}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.00513}, }

CaReBench