vlaurent17
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- trocr
|
4 |
+
- image-to-text
|
5 |
+
- pytorch
|
6 |
+
widget:
|
7 |
+
- src: "example_1.png"
|
8 |
+
example_title: Example 1
|
9 |
+
- src: "example_2.jpeg"
|
10 |
+
example_title: Example 2
|
11 |
+
- src: "example_3.jpeg"
|
12 |
+
example_title: Example 3
|
13 |
+
license: mit
|
14 |
+
language:
|
15 |
+
- es
|
16 |
+
- en
|
17 |
+
metrics:
|
18 |
+
- cer
|
19 |
+
- wer
|
20 |
+
library_name: transformers
|
21 |
+
pipeline_tag: image-to-text
|
22 |
+
#inference: false
|
23 |
+
---
|
24 |
+
|
25 |
+
# TrOCR (small sized model)
|
26 |
+
|
27 |
+
We present a new version of the small TrOCR model, fine-tuned on a self-generated Spanish dataset. The TrOCR architecture was initially introduced in the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Li et al. and is available in the associated [repository](https://github.com/microsoft/unilm/tree/master/trocr).
|
28 |
+
|
29 |
+
This model has been specialized for printed fonts and does not support handwritten recognition.
|
30 |
+
|
31 |
+
# Model presentation
|
32 |
+
|
33 |
+
The TrOCR model architecture is based on the Transformer framework, comprising an image Transformer as the encoder and a text Transformer as the decoder.
|
34 |
+
|
35 |
+
<img src="background_encoder_decoder.png" width="800" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">
|
36 |
+
|
37 |
+
The encoder takes input images, decomposes them into patches, and processes them to obtain visual features. These features are then used by the decoder to generate wordpiece sequences in an autoregressive way, guided by the visual information and previous predictions.
|
38 |
+
|
39 |
+
This design allows TrOCR to leverage pre-trained models for image understanding and language modeling, resulting in state-of-the-art performances.
|
40 |
+
|
41 |
+
# Dataset
|
42 |
+
|
43 |
+
As there was no publicly available dataset for training and testing Spanish OCR, we opted to create our own. This involved scraping 131,000 random Wikipedia pages and extracting 2,000,000 samples, ranging from single-word instances to 10-word sentences.
|
44 |
+
|
45 |
+
Subsequently, we artificially generated images from these samples, applied various data augmentation techniques, and achieved results such as the following:
|
46 |
+
|
47 |
+
<img src="generated_dataset_samples.jpeg" width="1000" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">
|
48 |
+
|
49 |
+
Note: We observed that generating images on-the-fly during training was faster than reading them from a pre-existing folder.
|
50 |
+
|
51 |
+
# Metrics
|
52 |
+
|
53 |
+
Regrettably, there is currently no established benchmark dataset specifically designed for evaluating Optical Character Recognition (OCR) performance in Spanish. To provide an indication of the efficacy of our OCR, we present benchmarks on the [XFUND](https://github.com/doc-analysis/XFUND) dataset in Spanish for the small, base, and large models. Additionally, we include a comparative analysis with EasyOCR.
|
54 |
+
|
55 |
+
| | CER | WER |
|
56 |
+
| --- | --- | --- |
|
57 |
+
| EasyOCR | 0.114 | 0.288 |
|
58 |
+
| qantev/trocr-small-spanish | 0.1059 | 0.2545 |
|
59 |
+
| qantev/trocr-base-spanish | 0.0732 | 0.2028 |
|
60 |
+
| qantev/trocr-large-spanish | 0.0632 | 0.1817 |
|
61 |
+
|
62 |
+
(Nota bene: Please note that the XFUND dataset contains mislabeled data, which could potentially impact the results in this benchmark.)
|
63 |
+
|
64 |
+
# Intended uses & limitations
|
65 |
+
|
66 |
+
While this model is freely available for use, it's essential to note that it hasn't been trained on handwritten texts and therefore may not accurately recognize handwritten content. Additionally, its proficiency is limited when it comes to accurately reading two lines of text or vertical text.
|
67 |
+
|
68 |
+
This model has to be used with conjunction with a text detection model.
|
69 |
+
|
70 |
+
# How to use
|
71 |
+
|
72 |
+
Here is how to use this model in PyTorch:
|
73 |
+
|
74 |
+
```python
|
75 |
+
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
|
76 |
+
from PIL import Image
|
77 |
+
import requests
|
78 |
+
|
79 |
+
# load image from the IAM database (actually this model is meant to be used on printed text)
|
80 |
+
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
|
81 |
+
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
82 |
+
|
83 |
+
processor = TrOCRProcessor.from_pretrained('qantev/trocr-small-spanish')
|
84 |
+
model = VisionEncoderDecoderModel.from_pretrained('qantev/trocr-small-spanish')
|
85 |
+
pixel_values = processor(images=image, return_tensors="pt").pixel_values
|
86 |
+
|
87 |
+
generated_ids = model.generate(pixel_values)
|
88 |
+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
89 |
+
```
|
90 |
+
|
91 |
+
# Contact us
|
92 |
+
|
93 |
+
dl_engineers@qantev.com
|