This is the Repository for CC-OCR Benchmark.
Dataset and evaluation code for the Paper "CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy".
🚀 GitHub | 🤗 Hugging Face | 🤖 ModelScope | 📑 Paper | 📗 Blog
- 2024-12-27 🚀 CC-OCR is in the PR stage of VLMEvalKit. Please refer to the document (zh & en) for usage.
- 2024-12-26 🔥 We release CC-OCR, including both data and evaluation script!
Model | Multi-Scene Text Reading | Multilingual Text Reading | Document Parsing | Visual Information Extraction | Total |
---|---|---|---|---|---|
Gemini-1.5-pro | 83.25 | 78.97 | 62.37 | 67.28 | 72.97 |
Qwen-VL-72B | 77.95 | 71.14 | 53.78 | 71.76 | 68.66 |
GPT-4o | 76.40 | 73.44 | 53.30 | 63.45 | 66.65 |
Claude3.5-sonnet | 72.87 | 65.68 | 47.79 | 64.58 | 62.73 |
InternVL2-76B | 76.92 | 46.57 | 35.33 | 61.60 | 55.11 |
GOT | 61.00 | 24.95 | 39.18 | 0.00 | 31.28 |
Florence | 49.24 | 49.70 | 0.00 | 0.00 | 24.74 |
KOSMOS2.5 | 47.55 | 36.23 | 0.00 | 0.00 | 20.95 |
TextMonkey | 56.88 | 0.00 | 0.00 | 0.00 | 14.22 |
- The versions of APIs are GPT-4o-2024-08-06, Gemini-1.5-Pro-002, Claude-3.5-Sonnet-20241022, and Qwen-VL-Max-2024-08-09;
- We conducted the all test around November 20th, 2024, please refer to our paper for more information.
The CC-OCR benchmark is specifically designed for evaluating the OCR-centric capabilities of Large Multimodal Models. CC-OCR possesses a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, being released for the first time.
The main features of our CC-OCR include:
- We focus on four OCR-centric tasks, namely
Multi-Scene Text Reading
,Multilingual Text Reading
,Document Parsing
,Visual Information Extraction
; - The CC-OCR covers fine-grained visual challenges (i.e., orientation-sensitivity, natural noise, and artistic text), decoding of various expressions, and structured inputs and outputs;
For a detailed introduction to the CC-OCR dataset, see the documents (zh & en) and our paper.
We public the full data of CC-OCR, including images and annotation files. You can obtain the full data with the following the instructions.
We officially recommend VLMEvalKit for evaluation. Please refer to document (zh & en) for more information.
Evaluation within this repository is also supported, and we recommend that users first read the documentation (zh & en). Please edit the "TODO" things in our example for a quick start.
Example evaluation scripts:
MODEL_NAME="qwen_vl_max"
OUTPUT_DIR="/your/path/to/output_dir"
# get your key from: https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
export DASHBOARD_API_KEY="dashscope_api_key"
# multi_scene_ocr
SUB_OUTPUT_DIR=${OUTPUT_DIR}/multi_scene_ocr
python example.py ${MODEL_NAME} index/multi_scene_ocr.json ${SUB_OUTPUT_DIR}
python evaluation/main.py index/multi_scene_ocr.json ${SUB_OUTPUT_DIR}
# multi_lan_ocr
SUB_OUTPUT_DIR=${OUTPUT_DIR}/multi_lan_ocr
python example.py ${MODEL_NAME} index/multi_lan_ocr.json ${SUB_OUTPUT_DIR}
python evaluation/main.py index/multi_lan_ocr.json ${SUB_OUTPUT_DIR}
# doc_parsing
SUB_OUTPUT_DIR=${OUTPUT_DIR}/doc_parsing
python example.py ${MODEL_NAME} index/doc_parsing.json ${SUB_OUTPUT_DIR}
python evaluation/main.py index/doc_parsing.json ${SUB_OUTPUT_DIR}
# kie
SUB_OUTPUT_DIR=${OUTPUT_DIR}/kie
python example.py ${MODEL_NAME} index/kie.json ${SUB_OUTPUT_DIR}
python evaluation/main.py index/kie.json ${SUB_OUTPUT_DIR}
For common Q&A, please refer here. If you have any questions, feel free to open an issue for discussion.
If you find our work helpful, feel free to give us a cite.
@misc{yang2024ccocr,
title={CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy},
author={Zhibo Yang and Jun Tang and Zhaohai Li and Pengfei Wang and Jianqiang Wan and Humen Zhong and Xuejing Liu and Mingkun Yang and Peng Wang and Shuai Bai and LianWen Jin and Junyang Lin},
year={2024},
eprint={2412.02210},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.02210},
}
The source code is licensed under the MIT License that can be found at the root directory.
If you have any questions, feel free to send an email to: wpf272043@alibaba-inc.com or xixing.tj@alibaba-inc.com