OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Leaderboard

Leaderboard on Private data

Performance of LMMs on English tasks

Rank Method Venue Open-source LLM Size Average Recognition Referring Spotting Extraction Parsing Calculation Understanding Reasoning
1 Gemini-2.5-Pro🥇 - No - 59.3 70.9 45.8 13.4 93.7 26.9 84.6 75.8 63.0
2 Llama-3.1-Nemotron-Nano-VL-8B-V1🥈 - Yes 8B 56.4 62.9 61.3 68.6 88.2 10.0 44.1 75.3 41.0
3 Gemini1.5-Pro🥉 Arxiv 2024 Yes No 51.6 59.1 41.2 6.6 89.5 22.4 54.7 78.8 60.3
4 GPT-4o Arxiv 2024 No - 47.6 58.6 23.4 0.0 87.4 23.1 51.6 74.4 62.3
5 Claude3.5-sonnet - No - 47.5 52.9 24.9 2.5 86.9 23.8 61.4 74.4 53.0
6 Step-1V - No - 46.8 56.7 27.4 2.6 86.3 33.3 42.6 76.6 48.7
7 InternVL3-14B - Yes 14B 46.8 55.8 24.5 2.1 89.3 21.0 59.5 72.0 50.0
8 Ovis2-8B - Yes 7B 46.1 54.2 20.9 0.0 83.6 24.2 54.7 74.1 57.3
9 InternVL3-8B - Yes 8B 45.3 49.7 22.3 0.2 86.8 22.4 57.0 70.7 53.0
10 GPT-4o-mini - No - 44.1 55.3 21.8 0.0 85.4 20.6 45.2 75.5 49.0
11 SAIL-VL-1.6-8B Arxiv 2025 Yes 8B 43.1 56.7 24.1 2.2 79.3 22.8 45.4 69.2 45.3
12 InternVL2.5-26B Arxiv 2024 Yes 20B 42.6 53.5 21.4 0 84.0 21.4 51.5 67.5 41.5
13 Qwen2-Vl-7B Arxiv 2024 Yes 8B 42.3 47.0 42.0 1.5 90.2 13.7 36.4 71.1 36.6
14 Qwen2.5-VL-7B Arxiv 2025 Yes 8B 41.8 51.5 24.5 3.1 64.8 13.1 53.3 78.6 45.5
15 InternVL2-26B SCIS 2024 Yes 20B 41.8 56.0 21.2 0 80.5 23.9 40.3 72.1 40.7
16 MiniCPM-o-2.6 - Yes 8B 41.6 54.1 24.7 0.3 74.4 17.6 39.2 75.7 47.0
17 Deepseek-VL2-Small Arxiv 2024 Yes 16B 41.0 56.6 23.7 0 86.4 18.9 30.6 72.2 39.5
18 InternVL2.5-8B Arxiv 2024 Yes 8B 40.5 48.9 21.2 0 82.1 20.3 41.2 67.8 42.3
19 Pixtral-12B Arxiv 2024 Yes 12B 38.4 45.1 21.8 0 71.6 21.7 30.4 77.3 39.5
20 Phi-4-MultiModal Arxiv 2025 Yes 5.6B 38.1 58.4 19.0 0 53.5 38.7 28.7 66.8 39.8
21 Ovis1.6-3B Arxiv 2024 Yes 3B 38.0 48.5 19.5 0 69.2 20.7 22.1 74.6 49.5
22 GLM-4v-9B Arxiv 2024 Yes 9B 37.1 52.7 20.6 0 79.4 15.9 21.5 74.7 32.0
23 InternVL2-8B SCIS 2024 Yes 8B 36.1 43.0 21.6 0 70.2 19.2 35.6 65.9 33.6
24 Molmo-7B CVPR 2025 Yes 8B 33.9 40.8 19.5 0 51.7 10.0 33.9 67.0 48.0
25 XComposer2-4KHD NIPS 2025 Yes 7B 33.9 39.5 12.0 0 69.7 26.0 20.2 68.2 35.8
26 LLaVA-OV-7B Arxiv 2024 Yes 8B 33.7 45.4 18.5 0 60.0 15.5 32.0 59.0 39.3
27 MiniCPM-V-2.6 Arxiv 2024 Yes 8B 33.0 52.2 18.6 0.3 45.8 19.6 20.9 68.9 37.3
28 Cambrian-1-8B NIPS 2025 Yes 8B 32.3 44.0 19.0 0 52.3 19.0 20.7 64.0 39.3
29 Kimi-VL-A3B-16B Arxiv 2025 Yes 16B 32.1 49.1 13.5 0 28.8 21.9 37.6 69.4 36.2
30 LLaVA-Next-8B - Yes 8B 28.5 41.4 17.0 0 49.0 12.9 16.1 60.9 30.5
31 Idefics3-8B NeurIPS 2024 Workshop Yes 8B 26.0 37.4 13.0 0 28.9 19.4 21.1 65.4 21.8
32 Eagle-X5-7B ICLR 2025 Yes 8B 25.7 34.6 18.5 0 9.7 18.5 24.0 63.1 37.0
33 Qwen-VL-chat Arxiv 2023 Yes 8B 25.7 34.1 12.6 0.1 42.6 19.5 18.4 58.3 20.3
34 Qwen-VL Arxiv 2023 Yes 8B 24.8 35.9 4.2 0 38.7 28.5 13.8 60.1 16.9
35 Deepseek-VL-7B Arxiv 2024 Yes 7B 24.5 33.5 13.7 0 19.1 11.7 24.8 60.5 32.5
36 Monkey CVPR 2024 Yes 8B 24.2 31.5 0.1 0 34.4 26.3 17.7 61.4 22.4
37 DocOwl2 Arxiv 2024 Yes 7B 23.4 25.4 7.5 0 47.1 26.2 8.3 52.8 19.5
38 TextMonkey Arxiv 2024 Yes 8B 23.4 39.8 1.6 0 27.6 24.8 10.2 62.3 21.2
39 VILA1.5-8B CVPR 2024 Yes 8B 23.2 36.0 14.5 0 26.0 17.4 20.3 44.7 27.0
40 EMU2-chat CVPR 2024 Yes 37B 20.2 34.3 0 0 20.4 21.3 20.3 47.1 18.3
41 CogVLM-chat NIPS 2024 Yes 7B 19.9 40.8 0 0 1.6 18.6 10.9 60.2 26.8
42 Yi-VL-6B Arxiv 2024 Yes 6B 19.7 31.1 4.0 0 23.4 22.5 18.1 43.0 15.5
43 mPLUG-Owl3 Arxiv 2024 Yes 8B 16.5 34.9 17.0 0 12.0 14.9 24.1 50.7 25.5
44 Janus-1.3B CVPR 2025 Yes 1.3B 14.3 32.6 0 0 12.0 14.9 24.1 50.7 25.5
45 UReader EMNLP finding 2023 Yes 7B 14.1 20.9 0 0 0 20.7 11.3 39.0 20.8
46 LLaVAR Arxiv 2023 Yes 13B 12.4 13.8 0 0 8.3 15.2 4.4 42.4 15.0

Performance of LMMs on Chinese tasks

Rank Method Venue Open-source LLM Size Average Recognition Extraction Parsing Understanding Reasoning
1 Gemini-2.5-Pro🥇 - No - 62.2 72.0 74.0 35.2 90.0 39.7
2 Ovis2-8B🥈 - Yes 7B 56.0 61.0 67.7 43.6 82.0 25.6
3 Gemini1.5-Pro🥉 Arxiv 2024 No - 55.5 71.4 63.8 30.5 82.0 29.9
4 Kimi-VL-A3B-16B Arxiv 2025 Yes 16B 54.1 54.0 71.1 32.5 84.0 28.7
5 Step-1V - No - 53.4 65.2 64.9 33.1 78.0 25.5
6 InternVL3-14B - Yes 14B 52.8 62.1 59.5 33.2 80.0 29.2
7 GLM-4v-9B Arxiv 2024 Yes 9B 51.7 60.6 65.2 32.4 82.0 18.2
8 Qwen2.5-VL-7B Arxiv 2025 Yes 8B 49.5 24.4 78.9 33.1 82.0 29.0
9 InternVL3-8B - Yes 8B 49.0 57.7 55.8 29.9 72.0 29.4
10 Claude3.5-sonnet - No - 48.4 34.2 62.5 35.2 78.0 32.2
11 DeepSeek-VL2-Small Arxiv 2024 Yes 16B 48.1 51.6 56.3 27.8 79.6 25.3
12 MiniCPM-V-2.6 Arxiv 2024 Yes 8B 47.7 53.1 53.2 32.8 76.0 23.4
13 MiniCPM-o-2.6 - Yes 8B 47.7 54.0 62.4 24.1 68.0 29.8
14 GPT-4o Arxiv 2024 No - 45.7 41.7 52.1 29.0 76.0 29.4
15 Qwen2-Vl-7B Arxiv 2024 Yes 8B 44.7 23.7 63.5 27.9 80.0 28.5
16 InternVL2.5-8B Arxiv 2024 Yes 8B 42.8 42.8 47.9 27.3 80.0 23.5
17 SAIL-VL-1.6-8B Arxiv 2025 Yes 8B 42.6 35.8 41.5 35.7 76.0 23.9
18 InternVL2.5-26B Arxiv 2024 Yes 20B 41.9 40.2 42.7 25.6 74.0 27.0
19 InternVL2-8B SCIS 2024 Yes 8B 41.3 35.2 42.8 26.1 78.0 24.4
20 Llama-3.1-Nemotron-Nano-VL-8B-V1 - Yes 8B 40.1 38.2 54.9 26.6 66.0 14.8
21 InternVL2-26B SCIS 2024 Yes 20B 38.1 20.4 50.7 29.0 76.0 14.5
22 GPT-4o-mini - No - 37.4 20.0 53.6 27.9 66.0 19.6
23 Phi-4-MultiModal Arxiv 2025 Yes 5.6B 37.3 30.5 40.5 42.7 56.0 16.9
24 XComposer2-4KHD NIPS 2025 Yes 8B 32.4 12.9 38.6 37.5 60.0 13.1
25 Ovis1.6-3B Arxiv 2024 Yes 3B 31.7 22.5 33.3 31.5 54.0 17.0
26 Monkey CVPR 2024 Yes 8B 21.5 1.5 28.4 29.1 40.0 8.3
27 TextMonkey Arxiv 2024 Yes 8B 21.5 10.5 15.2 30.2 44.0 7.6
28 Cambrian-1-8B NIPS 2025 Yes 8B 18.5 2.4 19.8 26.7 36.0 7.6
29 LLaVA-OV-7B Arxiv 2024 Yes 8B 17.4 5.4 13.6 20.3 34.0 13.6
30 mPLUG-Owl3 Arxiv 2024 Yes 8B 16.5 1.6 27.4 27.3 16.0 10.0
31 Pixtral-12B Arxiv 2024 Yes 12B 16.0 6.2 22.3 11.4 26.0 14.0
32 Qwen-VL-chat Arxiv 2023 Yes 8B 16.5 9.1 3.6 18.9 44.0 7.1
33 Idefics3-8B NeurIPS 2024 Workshop Yes 8B 15.6 2.9 29.0 12.3 26.0 7.9
34 Qwen-VL Arxiv 2023 Yes 8B 15.6 4.3 0 30.6 38.0 5.1
35 Molmo-7B CVPR 2025 Yes 8B 15.0 3.4 29.8 6.6 24.0 11.1
36 DocOwl2 Arxiv 2024 Yes 7B 14.4 1.0 17.8 29.4 20.0 3.9
37 Deepseek-VL-7B Arxiv 2024 Yes 7B 13.7 3.2 14.7 10.7 30.0 9.8
38 CogVLM-chat NIPS 2024 Yes 7B 12.8 2.4 16.2 22.5 20.0 3.1
39 Eagle-X5-7B ICLR 2025 Yes 8B 12.3 1.9 16.1 13.6 22.0 8.1
40 VILA1.5-8B CVPR 2024 Yes 8B 11.0 1.4 9.1 22.2 16.0 6.4
41 Yi-VL-6B Arxiv 2024 Yes 6B 10.4 1.6 6.4 28.8 10.0 5.3
42 LLaVA-Next-8B - Yes 8B 9.2 2.8 0.9 14.9 20.0 7.4
43 UReader EMNLP finding 2023 Yes 7B 9.0 0.3 2.0 28.1 12.0 2.4
44 LLaVAR Arxiv 2023 Yes 13B 8.6 2.2 2.0 27.1 10.0 1.9
45 EMU2-chat CVPR 2024 Yes 37B 8.2 1.2 3.0 29.3 4.0 3.6
46 Janus-1.3B CVPR 2025 Yes 1.3B 7.5 4.1 2.2 10.4 14.0 6.7

We aim to update this benchmark every quarter. We sincerely welcome community contributions. If you have open-source models on Hugging Face or accessible APIs, sharing them with us would greatly help improve and expand the leaderboard. You can contact us at: ling_fu@hust.edu.cn

We have observed that some methods adopt absolute encoding for prompt inputs when tackling specialized tasks. For example, Qwen2.5VL uses a format like {"bbox_2d": [x1, y1, x2, y2], "text_content": "xxx"} for text spotting. After modifying the prompt accordingly, Qwen2.5VL-7B achieved a text spotting score of 51.6 on public data, showing a significant improvement compared to the default prompt currently used in OCRBench v2. We encourage you to share the evaluation results using prompts adapted to your model's input format. This will help us further improve and refine the leaderboard.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu1, Zhebin Kuang1, Jiajun Song1, Mingxin Huang2, Biao Yang1, Yuzhe Li1, Linghao Zhu1, Qidi Luo1, Xinyu Wang3, Hao Lu1, Zhang Li1, Guozhi Tang4, Bin Shan4, Chunhui Lin4, Qi Liu4, Binghong Wu4, Hao Feng4, Hao Liu4, Can Huang4, Jingqun Tang4, Wei Chen1, Lianwen Jin2, Yuliang Liu1, Xiang Bai1

1Huazhong University of Science and Technology, 2South China University of Technology, 3University of Adelaide, 4ByteDance

overview examples

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4X more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

BibTeX

@misc{fu2024ocrbenchv2improvedbenchmark,
    title={OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning}, 
    author={Ling Fu and Zhebin Kuang and Jiajun Song and Mingxin Huang and Biao Yang and Yuzhe Li and Linghao Zhu and Qidi Luo and Xinyu Wang and Hao Lu and Zhang Li and Guozhi Tang and Bin Shan and Chunhui Lin and Qi Liu and Binghong Wu and Hao Feng and Hao Liu and Can Huang and Jingqun Tang and Wei Chen and Lianwen Jin and Yuliang Liu and Xiang Bai},
    year={2024},
    eprint={2501.00321},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2501.00321}, 
}