File size: 1,263 Bytes
63f47b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff402ac
63f47b2
ff402ac
 
 
63f47b2
 
 
 
ff402ac
 
 
 
 
 
63f47b2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This project aims to create a text scanner that converts paper images into machine-readable formats (e.g., Markdown, JSON). It is the son of Nougat, and thus, grandson of Douat.

The key idea is to combine the bounding box modality with text, achieving a pixel scan behavior that predicts not only the next token but also the next position.

![Example Image](https://raw.githubusercontent.com/veya2ztn/Lougat/main/images/image.png)

The name "Lougat" is a combination of LLama and Nougat. The key idea is nature continues of this paper [LOCR: Location-Guided Transformer for Optical Character Recognition]([[2403.02127\] LOCR: Location-Guided Transformer for Optical Character Recognition (arxiv.org)](https://arxiv.org/abs/2403.02127))

Current Branch: The **LOCR** model

Other Branch:
- Florence2 + LLama → Flougat
- Sam2 + LLama → Slougat
- Nougat + Relative Position Embedding LLama → Rlougat


# Inference and Train

Please see `https://github.com/veya2ztn/Lougat`