kosmos-2.5

Kosmos-2.5: A Multimodal Literate Model

Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.


(a) Input	(b) Using the ocr prompt	(c) Using the markdown prompt

_{More model outputs can be found in the "CASES.md"}

News

Aug 2024: 🔥We have released Kosmos-2.5-CHAT, a model capable of handling Visual Question Answering (VQA) tasks. For more details, please refer to the model card and paper.
Aug 2024: 🔥Kosmos-2.5 will soon be integrated into Hugging Face. Until the official integration, you can use this temporary repo. Please refer to this link for more information.
May 2024: We've open-sourced the checkpoint and inference code of Kosmos-2.5, This checkpoint has been trained for more steps than the one reported in the paper.
Sep 2023: We release the Kosmos-2.5: A Multimodal Literate Model paper. Checkout the paper.

Checkpoints

The checkpoint can be downloaded via:

wget -O ckpt.pt https://huggingface.co/microsoft/kosmos-2.5/resolve/main/ckpt.pt?download=true

Results

Text Recognition

Datasets	F1	IOU	NED
Handwritten	71.6	94.1	90.6
Design	61.7	80.2	79.6
Receipt	89.4	80.1	83.3
General	97.6	89.8	93.9
Academic	98.8	93.3	99.1
Web Image	57.0	72.1	69.6

Image to Markdown

Datasets	NED	NTED
Docx	91.6	82.1
README	95.1	91.2
Arxiv	90.8	86.4
Tables	85.1	90.1
Math Equation	88.1	95.2
CROHME Math	98.5	99.7

Document Reading

Datasets	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact	ChartQA	TextVQA	VisualMRC
Kosmos-2.5-CHAT	81.1	41.3	65.8	35.1	32.4	49.9	62.3	40.7	156.0

Installation

The code uses Flash Attention2, so it only runs on Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).

git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2.5
pip install -r requirements.txt

Inference

python inference.py \
  --do_ocr \                      // --do_md for image2md task
  --image path/to/image \
  --ckpt path/to/checkpoint \

For images with extreme aspect ratios, we recommend resizing images to a more typical aspect ratio for better performance with the following command:

python inference.py \
  --do_ocr \                      // --do_md for image2md task
  --image path/to/image \
  --ckpt path/to/checkpoint \
  --use_preprocess \
  --hw_ratio_adj_upper_span "[1.5, 5]" \
  --hw_ratio_adj_lower_span "[0.5, 1.0]"

Please adjust the parameters based on your use cases. For example,

--hw_ratio_adj_upper_span "[1.5, 5]" indicates that if the image's aspect ratio is between 1.5 and 5, the image will be resized to an aspect ratio of 1.5.
--hw_ratio_adj_lower_span "[0.5, 1.0]" indicates that if the image's aspect ratio is between 0.5 and 1.0, the image will be resized to an aspect ratio of 1.0.

NOTE:

Since this is a generative model, there is a risk of hallucination during the generation process, and it CAN NOT guarantee the accuracy of all OCR/Markdown results in the images.

Citation

If you find this repository useful, please consider citing our work:

@article{lv2023kosmos,
  title={Kosmos-2.5: A multimodal literate model},
  author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
  journal={arXiv preprint arXiv:2309.11419},
  year={2023}
}

License

The content of this project itself is licensed under the MIT

Microsoft Open Source Code of Conduct

Contact

For help or issues using Kosmos-2.5, please submit a GitHub issue.

For other communications related to Kosmos-2.5, please contact Lei Cui or Furu Wei.

Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
kosmos2_5		kosmos2_5
CASES.md		CASES.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
__init.py		__init.py
dict.txt		dict.txt
draw_bbox.py		draw_bbox.py
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kosmos-2.5

kosmos-2.5

README.md

Kosmos-2.5: A Multimodal Literate Model

News

Checkpoints

Results

Text Recognition

Image to Markdown

Document Reading

Installation

Inference

NOTE:

Citation

License

Contact

Files

kosmos-2.5

Directory actions

More options

Directory actions

More options

Latest commit

History

kosmos-2.5

Folders and files

parent directory

README.md

Kosmos-2.5: A Multimodal Literate Model

News

Checkpoints

Results

Text Recognition

Image to Markdown

Document Reading

Installation

Inference

NOTE:

Citation

License

Contact