OCRBench v2

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu1, Zhebin Kuang1, Jiajun Song1, Mingxin Huang2, Biao Yang1, Yuzhe Li1, Linghao Zhu1, Qidi Luo1, Xinyu Wang3, Hao Lu1, Zhang Li1, Guozhi Tang4, Bin Shan4, Chunhui Lin4, Qi Liu4, Binghong Wu4, Hao Feng4, Hao Liu4, Can Huang4, Jingqun Tang4, Wei Chen1, Lianwen Jin2, Yuliang Liu1, Xiang Bai1

1Huazhong University of Science and Technology, 2South China University of Technology, 3University of Adelaide, 4ByteDance

overview examples

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4X more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

Leaderboard

Leaderboard

Performance of LMMs on English tasks of public data

Rank Method LLM Size Recognition Referring Spotting Extraction Parsing Calculation Understanding Reasoning Average
1 Llama Nemotron Nano VL 8B🥇 8B 70.2 69.1 61.8 81.4 39.2 31.9 73.1 54.7 60.2
2 InternVL3-14B🥈 14B 67.3 36.9 11.2 89.0 38.4 38.4 79.2 60.5 52.6
3 Gemini-Pro🥉 - 61.2 39.5 13.5 79.3 39.2 47.7 75.5 59.3 51.9
4 Qwen2-VL-7B 7B 72.1 47.9 17.5 82.5 25.5 25.4 78.4 61.5 51.4
5 InternVL2.5-26B 26B 65.6 26.1 1.6 86.9 36.2 37.4 78.3 62.9 49.4
6 InternVL3-8B 8B 68.6 30.4 8.8 85.3 34.0 27.1 77.5 60.3 49.0
7 Ovis2-8B 7B 73.2 24.6 0.7 62.4 44.8 40.6 72.7 62.6 47.7
8 InternVL2-26B 26B 63.4 26.1 0.0 76.8 37.8 32.3 79.4 58.9 46.8
9 Step-1V - 67.8 31.3 7.2 73.6 37.2 27.8 69.8 58.6 46.7
9 Qwen2.5-VL-7B 7B 68.8 25.7 1.2 80.2 30.4 38.2 73.2 56.2 46.7
10 GPT-4o - 61.2 26.7 0.0 77.5 36.3 43.4 71.1 55.5 46.5

Performance of LMMs on Chinese tasks of public data

Rank Method LLM Size Recognition Extraction Parsing Understanding Reasoning Average
1 InternVL3-14B🥇 14B 66.2 64.8 33.5 63.4 50.6 55.7
2 Qwen2.5-VL-7B🥈 8B 75.3 61.4 41.8 59.3 40.4 55.6
3 Internvl3-8B🥉 8B 68.9 62.0 31.6 57.9 47.3 53.5
4 Ovis2-8B 7B 72.2 50.8 37.7 47.9 37.4 49.2
5 InternVL2.5-8B 8B 52.8 52.8 28.6 56.4 40.5 46.2
6 Kimi-VL-A3B-16B 16B 57.2 52.5 31.5 52.5 31.4 45.5
7 InternVL2.5-26B 26B 32.4 56.1 32.6 56.3 43.6 44.2
8 Gemini-Pro - 52.5 47.3 30.9 51.5 33.4 43.1
9 Qwen2-VL-7B 7B 51.3 51.4 21.6 52.5 37.5 42.9
10 Deepseek-VL2-Small 16B 60.9 50.6 28.3 53.0 20.5 42.7

Performance of LMMs on English tasks of private data

Rank Method LLM Size Recognition Referring Spotting Extraction Parsing Calculation Understanding Reasoning Average
1 Gemini1.5-Pro🥇 - 59.1 41.2 6.6 89.5 22.4 54.7 78.8 60.3 51.6
2 GPT-4o🥈 - 58.6 23.4 0.0 87.4 23.1 51.6 74.4 62.3 47.6
3 Claude3.5-sonnet🥉 - 52.9 24.9 2.5 86.9 23.8 61.4 74.4 53.0 47.5
4 Step-1V - 56.7 27.4 2.6 86.3 33.3 42.6 76.6 48.7 46.8
4 InternVL3-14B 14B 55.8 24.5 2.1 89.3 21.0 59.5 72.0 50.0 46.8
5 Ovis2-8B 7B 54.2 20.9 0.0 83.6 24.2 54.7 74.1 57.3 46.1
6 InternVL3-8B 8B 49.7 22.3 0.2 86.8 22.4 57.0 70.7 53.0 45.3
7 GPT-4o-mini - 55.3 21.8 0.0 85.4 20.6 45.2 75.5 49.0 44.1
8 SAIL-VL-1.6-8B 8B 56.7 24.1 2.2 79.3 22.8 45.4 69.2 45.3 43.1
9 InternVL2.5-26B 26B 53.5 21.4 0.0 84.0 21.4 51.5 67.5 41.5 42.6
10 Qwen2-VL-7B 7B 47.0 42.0 1.5 90.2 13.7 36.4 71.1 36.6 42.3

Performance of LMMs on Chinese tasks of private data

Rank Method LLM Size Recognition Extraction Parsing Understanding Reasoning Average
1 Ovis2-8B🥇 7B 61.0 67.7 43.6 82.0 25.6 56.0
2 Gemini1.5-Pro🥈 - 71.4 63.8 30.5 82.0 29.9 55.5
3 Kimi-VL-A3B-16B🥉 16B 54.0 71.1 32.5 84.0 28.7 54.1
4 Step-1V - 65.2 64.9 33.1 78.0 25.5 53.4
5 InternVL3-14B 14B 62.1 59.5 33.2 80.0 29.2 52.8
6 GLM-4v-9B 9B 60.6 65.2 32.4 82.0 18.2 51.7
7 Qwen2.5-VL-7B 8B 24.4 78.9 33.1 82.0 29.0 49.5
8 InternVL3-8B 8B 57.7 55.8 29.9 72.0 29.4 49.0
9 Claude3.5-sonnet - 34.2 62.5 35.2 78.0 32.2 48.4
10 DeepSeek-VL2-Small 16B 51.6 56.3 27.8 79.6 25.3 48.1

This benchmark includes both public and private data. To reduce potential contamination from internet-scale pretraining corpora, private datasets are not publicly released.We sincerely welcome community contributions. If you have open-source models on Hugging Face or accessible APIs, sharing them with us would greatly help improve and expand the leaderboard.

BibTeX

@misc{fu2024ocrbenchv2improvedbenchmark,
    title={OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning}, 
    author={Ling Fu and Biao Yang and Zhebin Kuang and Jiajun Song and Yuzhe Li and Linghao Zhu and Qidi Luo and Xinyu Wang and Hao Lu and Mingxin Huang and Zhang Li and Guozhi Tang and Bin Shan and Chunhui Lin and Qi Liu and Binghong Wu and Hao Feng and Hao Liu and Can Huang and Jingqun Tang and Wei Chen and Lianwen Jin and Yuliang Liu and Xiang Bai},
    year={2024},
    eprint={2501.00321},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2501.00321}, 
}