OCRBench v2

Leaderboard on Private data

Select Period:

Performance of LMMs on English tasks

Rank	Method	Venue	Open-source	LLM Size	Average	Recognition	Referring	Spotting	Extraction	Parsing	Calculation	Understanding	Reasoning
1	Gemini-2.5-Pro🥇	-	No	-	59.3	70.9	45.8	13.4	93.7	26.9	84.6	75.8	63.0
2	Llama-3.1-Nemotron-Nano-VL-8B-V1🥈	-	Yes	8B	56.4	62.9	61.3	68.6	88.2	10.0	44.1	75.3	41.0
3	Gemini1.5-Pro🥉	Arxiv 2024	Yes	No	51.6	59.1	41.2	6.6	89.5	22.4	54.7	78.8	60.3
4	GPT-4o	Arxiv 2024	No	-	47.6	58.6	23.4	0.0	87.4	23.1	51.6	74.4	62.3
5	Claude3.5-sonnet	-	No	-	47.5	52.9	24.9	2.5	86.9	23.8	61.4	74.4	53.0
6	Step-1V	-	No	-	46.8	56.7	27.4	2.6	86.3	33.3	42.6	76.6	48.7
7	InternVL3-14B	-	Yes	14B	46.8	55.8	24.5	2.1	89.3	21.0	59.5	72.0	50.0
8	Ovis2-8B	-	Yes	7B	46.1	54.2	20.9	0.0	83.6	24.2	54.7	74.1	57.3
9	InternVL3-8B	-	Yes	8B	45.3	49.7	22.3	0.2	86.8	22.4	57.0	70.7	53.0
10	GPT-4o-mini	-	No	-	44.1	55.3	21.8	0.0	85.4	20.6	45.2	75.5	49.0
11	SAIL-VL-1.6-8B	Arxiv 2025	Yes	8B	43.1	56.7	24.1	2.2	79.3	22.8	45.4	69.2	45.3
12	InternVL2.5-26B	Arxiv 2024	Yes	20B	42.6	53.5	21.4	0	84.0	21.4	51.5	67.5	41.5
13	Qwen2-Vl-7B	Arxiv 2024	Yes	8B	42.3	47.0	42.0	1.5	90.2	13.7	36.4	71.1	36.6
14	Qwen2.5-VL-7B	Arxiv 2025	Yes	8B	41.8	51.5	24.5	3.1	64.8	13.1	53.3	78.6	45.5
15	InternVL2-26B	SCIS 2024	Yes	20B	41.8	56.0	21.2	0	80.5	23.9	40.3	72.1	40.7
16	MiniCPM-o-2.6	-	Yes	8B	41.6	54.1	24.7	0.3	74.4	17.6	39.2	75.7	47.0
17	Deepseek-VL2-Small	Arxiv 2024	Yes	16B	41.0	56.6	23.7	0	86.4	18.9	30.6	72.2	39.5
18	InternVL2.5-8B	Arxiv 2024	Yes	8B	40.5	48.9	21.2	0	82.1	20.3	41.2	67.8	42.3
19	Pixtral-12B	Arxiv 2024	Yes	12B	38.4	45.1	21.8	0	71.6	21.7	30.4	77.3	39.5
20	Phi-4-MultiModal	Arxiv 2025	Yes	5.6B	38.1	58.4	19.0	0	53.5	38.7	28.7	66.8	39.8
21	Ovis1.6-3B	Arxiv 2024	Yes	3B	38.0	48.5	19.5	0	69.2	20.7	22.1	74.6	49.5
22	GLM-4v-9B	Arxiv 2024	Yes	9B	37.1	52.7	20.6	0	79.4	15.9	21.5	74.7	32.0
23	InternVL2-8B	SCIS 2024	Yes	8B	36.1	43.0	21.6	0	70.2	19.2	35.6	65.9	33.6
24	Molmo-7B	CVPR 2025	Yes	8B	33.9	40.8	19.5	0	51.7	10.0	33.9	67.0	48.0
25	XComposer2-4KHD	NIPS 2025	Yes	7B	33.9	39.5	12.0	0	69.7	26.0	20.2	68.2	35.8
26	LLaVA-OV-7B	Arxiv 2024	Yes	8B	33.7	45.4	18.5	0	60.0	15.5	32.0	59.0	39.3
27	MiniCPM-V-2.6	Arxiv 2024	Yes	8B	33.0	52.2	18.6	0.3	45.8	19.6	20.9	68.9	37.3
28	Cambrian-1-8B	NIPS 2025	Yes	8B	32.3	44.0	19.0	0	52.3	19.0	20.7	64.0	39.3
29	Kimi-VL-A3B-16B	Arxiv 2025	Yes	16B	32.1	49.1	13.5	0	28.8	21.9	37.6	69.4	36.2
30	LLaVA-Next-8B	-	Yes	8B	28.5	41.4	17.0	0	49.0	12.9	16.1	60.9	30.5
31	Idefics3-8B	NeurIPS 2024 Workshop	Yes	8B	26.0	37.4	13.0	0	28.9	19.4	21.1	65.4	21.8
32	Eagle-X5-7B	ICLR 2025	Yes	8B	25.7	34.6	18.5	0	9.7	18.5	24.0	63.1	37.0
33	Qwen-VL-chat	Arxiv 2023	Yes	8B	25.7	34.1	12.6	0.1	42.6	19.5	18.4	58.3	20.3
34	Qwen-VL	Arxiv 2023	Yes	8B	24.8	35.9	4.2	0	38.7	28.5	13.8	60.1	16.9
35	Deepseek-VL-7B	Arxiv 2024	Yes	7B	24.5	33.5	13.7	0	19.1	11.7	24.8	60.5	32.5
36	Monkey	CVPR 2024	Yes	8B	24.2	31.5	0.1	0	34.4	26.3	17.7	61.4	22.4
37	DocOwl2	Arxiv 2024	Yes	7B	23.4	25.4	7.5	0	47.1	26.2	8.3	52.8	19.5
38	TextMonkey	Arxiv 2024	Yes	8B	23.4	39.8	1.6	0	27.6	24.8	10.2	62.3	21.2
39	VILA1.5-8B	CVPR 2024	Yes	8B	23.2	36.0	14.5	0	26.0	17.4	20.3	44.7	27.0
40	EMU2-chat	CVPR 2024	Yes	37B	20.2	34.3	0	0	20.4	21.3	20.3	47.1	18.3
41	CogVLM-chat	NIPS 2024	Yes	7B	19.9	40.8	0	0	1.6	18.6	10.9	60.2	26.8
42	Yi-VL-6B	Arxiv 2024	Yes	6B	19.7	31.1	4.0	0	23.4	22.5	18.1	43.0	15.5
43	mPLUG-Owl3	Arxiv 2024	Yes	8B	16.5	34.9	17.0	0	12.0	14.9	24.1	50.7	25.5
44	Janus-1.3B	CVPR 2025	Yes	1.3B	14.3	32.6	0	0	12.0	14.9	24.1	50.7	25.5
45	UReader	EMNLP finding 2023	Yes	7B	14.1	20.9	0	0	0	20.7	11.3	39.0	20.8
46	LLaVAR	Arxiv 2023	Yes	13B	12.4	13.8	0	0	8.3	15.2	4.4	42.4	15.0

Performance of LMMs on Chinese tasks

Rank	Method	Venue	Open-source	LLM Size	Average	Recognition	Extraction	Parsing	Understanding	Reasoning
1	Gemini-2.5-Pro🥇	-	No	-	62.2	72.0	74.0	35.2	90.0	39.7
2	Ovis2-8B🥈	-	Yes	7B	56.0	61.0	67.7	43.6	82.0	25.6
3	Gemini1.5-Pro🥉	Arxiv 2024	No	-	55.5	71.4	63.8	30.5	82.0	29.9
4	Kimi-VL-A3B-16B	Arxiv 2025	Yes	16B	54.1	54.0	71.1	32.5	84.0	28.7
5	Step-1V	-	No	-	53.4	65.2	64.9	33.1	78.0	25.5
6	InternVL3-14B	-	Yes	14B	52.8	62.1	59.5	33.2	80.0	29.2
7	GLM-4v-9B	Arxiv 2024	Yes	9B	51.7	60.6	65.2	32.4	82.0	18.2
8	Qwen2.5-VL-7B	Arxiv 2025	Yes	8B	49.5	24.4	78.9	33.1	82.0	29.0
9	InternVL3-8B	-	Yes	8B	49.0	57.7	55.8	29.9	72.0	29.4
10	Claude3.5-sonnet	-	No	-	48.4	34.2	62.5	35.2	78.0	32.2
11	DeepSeek-VL2-Small	Arxiv 2024	Yes	16B	48.1	51.6	56.3	27.8	79.6	25.3
12	MiniCPM-V-2.6	Arxiv 2024	Yes	8B	47.7	53.1	53.2	32.8	76.0	23.4
13	MiniCPM-o-2.6	-	Yes	8B	47.7	54.0	62.4	24.1	68.0	29.8
14	GPT-4o	Arxiv 2024	No	-	45.7	41.7	52.1	29.0	76.0	29.4
15	Qwen2-Vl-7B	Arxiv 2024	Yes	8B	44.7	23.7	63.5	27.9	80.0	28.5
16	InternVL2.5-8B	Arxiv 2024	Yes	8B	42.8	42.8	47.9	27.3	80.0	23.5
17	SAIL-VL-1.6-8B	Arxiv 2025	Yes	8B	42.6	35.8	41.5	35.7	76.0	23.9
18	InternVL2.5-26B	Arxiv 2024	Yes	20B	41.9	40.2	42.7	25.6	74.0	27.0
19	InternVL2-8B	SCIS 2024	Yes	8B	41.3	35.2	42.8	26.1	78.0	24.4
20	Llama-3.1-Nemotron-Nano-VL-8B-V1	-	Yes	8B	40.1	38.2	54.9	26.6	66.0	14.8
21	InternVL2-26B	SCIS 2024	Yes	20B	38.1	20.4	50.7	29.0	76.0	14.5
22	GPT-4o-mini	-	No	-	37.4	20.0	53.6	27.9	66.0	19.6
23	Phi-4-MultiModal	Arxiv 2025	Yes	5.6B	37.3	30.5	40.5	42.7	56.0	16.9
24	XComposer2-4KHD	NIPS 2025	Yes	8B	32.4	12.9	38.6	37.5	60.0	13.1
25	Ovis1.6-3B	Arxiv 2024	Yes	3B	31.7	22.5	33.3	31.5	54.0	17.0
26	Monkey	CVPR 2024	Yes	8B	21.5	1.5	28.4	29.1	40.0	8.3
27	TextMonkey	Arxiv 2024	Yes	8B	21.5	10.5	15.2	30.2	44.0	7.6
28	Cambrian-1-8B	NIPS 2025	Yes	8B	18.5	2.4	19.8	26.7	36.0	7.6
29	LLaVA-OV-7B	Arxiv 2024	Yes	8B	17.4	5.4	13.6	20.3	34.0	13.6
30	mPLUG-Owl3	Arxiv 2024	Yes	8B	16.5	1.6	27.4	27.3	16.0	10.0
31	Pixtral-12B	Arxiv 2024	Yes	12B	16.0	6.2	22.3	11.4	26.0	14.0
32	Qwen-VL-chat	Arxiv 2023	Yes	8B	16.5	9.1	3.6	18.9	44.0	7.1
33	Idefics3-8B	NeurIPS 2024 Workshop	Yes	8B	15.6	2.9	29.0	12.3	26.0	7.9
34	Qwen-VL	Arxiv 2023	Yes	8B	15.6	4.3	0	30.6	38.0	5.1
35	Molmo-7B	CVPR 2025	Yes	8B	15.0	3.4	29.8	6.6	24.0	11.1
36	DocOwl2	Arxiv 2024	Yes	7B	14.4	1.0	17.8	29.4	20.0	3.9
37	Deepseek-VL-7B	Arxiv 2024	Yes	7B	13.7	3.2	14.7	10.7	30.0	9.8
38	CogVLM-chat	NIPS 2024	Yes	7B	12.8	2.4	16.2	22.5	20.0	3.1
39	Eagle-X5-7B	ICLR 2025	Yes	8B	12.3	1.9	16.1	13.6	22.0	8.1
40	VILA1.5-8B	CVPR 2024	Yes	8B	11.0	1.4	9.1	22.2	16.0	6.4
41	Yi-VL-6B	Arxiv 2024	Yes	6B	10.4	1.6	6.4	28.8	10.0	5.3
42	LLaVA-Next-8B	-	Yes	8B	9.2	2.8	0.9	14.9	20.0	7.4
43	UReader	EMNLP finding 2023	Yes	7B	9.0	0.3	2.0	28.1	12.0	2.4
44	LLaVAR	Arxiv 2023	Yes	13B	8.6	2.2	2.0	27.1	10.0	1.9
45	EMU2-chat	CVPR 2024	Yes	37B	8.2	1.2	3.0	29.3	4.0	3.6
46	Janus-1.3B	CVPR 2025	Yes	1.3B	7.5	4.1	2.2	10.4	14.0	6.7

We aim to update this benchmark every quarter. We sincerely welcome community contributions. If you have open-source models on Hugging Face or accessible APIs, sharing them with us would greatly help improve and expand the leaderboard. You can contact us at: ling_fu@hust.edu.cn

We have observed that some methods adopt absolute encoding for prompt inputs when tackling specialized tasks. For example, Qwen2.5VL uses a format like {"bbox_2d": [x1, y1, x2, y2], "text_content": "xxx"} for text spotting. After modifying the prompt accordingly, Qwen2.5VL-7B achieved a text spotting score of 51.6 on public data, showing a significant improvement compared to the default prompt currently used in OCRBench v2. We encourage you to share the evaluation results using prompts adapted to your model's input format. This will help us further improve and refine the leaderboard.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu¹, Zhebin Kuang¹, Jiajun Song¹, Mingxin Huang², Biao Yang¹, Yuzhe Li¹, Linghao Zhu¹, Qidi Luo¹, Xinyu Wang³, Hao Lu¹, Zhang Li¹, Guozhi Tang⁴, Bin Shan⁴, Chunhui Lin⁴, Qi Liu⁴, Binghong Wu⁴, Hao Feng⁴, Hao Liu⁴, Can Huang⁴, Jingqun Tang⁴, Wei Chen¹, Lianwen Jin², Yuliang Liu¹, Xiang Bai¹

¹Huazhong University of Science and Technology, ²South China University of Technology, ³University of Adelaide, ⁴ByteDance

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4X more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning.

BibTeX

@misc{fu2024ocrbenchv2improvedbenchmark, title={OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning}, author={Ling Fu and Zhebin Kuang and Jiajun Song and Mingxin Huang and Biao Yang and Yuzhe Li and Linghao Zhu and Qidi Luo and Xinyu Wang and Hao Lu and Zhang Li and Guozhi Tang and Bin Shan and Chunhui Lin and Qi Liu and Binghong Wu and Hao Feng and Hao Liu and Can Huang and Jingqun Tang and Wei Chen and Lianwen Jin and Yuliang Liu and Xiang Bai}, year={2024}, eprint={2501.00321}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.00321}, }