File size: 4,515 Bytes
8d0e7aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d788a9
 
 
 
8d0e7aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d788a9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
tags:
- trocr
- image-to-text
- pytorch
widget:
- src: "example_1.png"
  example_title: Example 1
- src: "example_2.jpeg"
  example_title: Example 2
- src: "example_3.jpeg"
  example_title: Example 3
license: mit
language:
- es
- en
metrics:
- cer
- wer
library_name: transformers
pipeline_tag: image-to-text
---

# TrOCR (small sized model)

We present a new version of the small TrOCR model, fine-tuned on a self-generated Spanish dataset. The TrOCR architecture was initially introduced in the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Li et al. and is available in the associated [repository](https://github.com/microsoft/unilm/tree/master/trocr).

This model has been specialized for printed fonts and does not support handwritten recognition.

# Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Computer Vision, NLP and Document AI please send your resume to jobs@qantev.com or apply to the many open positions at [Qantev Job Board](https://qantev.welcomekit.co).

# Model presentation

The TrOCR model architecture is based on the Transformer framework, comprising an image Transformer as the encoder and a text Transformer as the decoder. 

<img src="background_encoder_decoder.png" width="800" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">

The encoder takes input images, decomposes them into patches, and processes them to obtain visual features. These features are then used by the decoder to generate wordpiece sequences in an autoregressive way, guided by the visual information and previous predictions. 

This design allows TrOCR to leverage pre-trained models for image understanding and language modeling, resulting in state-of-the-art performances.

# Dataset

As there was no publicly available dataset for training and testing Spanish OCR, we opted to create our own. This involved scraping 131,000 random Wikipedia pages and extracting 2,000,000 samples, ranging from single-word instances to 10-word sentences. 

Subsequently, we artificially generated images from these samples, applied various data augmentation techniques, and achieved results such as the following:

<img src="generated_dataset_samples.jpeg" width="1000" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">

Note: We observed that generating images on-the-fly during training was faster than reading them from a pre-existing folder.

# Metrics

Regrettably, there is currently no established benchmark dataset specifically designed for evaluating Optical Character Recognition (OCR) performance in Spanish. To provide an indication of the efficacy of our OCR, we present benchmarks on the  [XFUND](https://github.com/doc-analysis/XFUND) dataset in Spanish for the small, base, and large models. Additionally, we include a comparative analysis with EasyOCR.

|  | CER | WER |
| --- | --- | --- |
| EasyOCR | 0.114 | 0.288 |
| qantev/trocr-small-spanish | 0.1059 | 0.2545 |
| qantev/trocr-base-spanish | 0.0732 | 0.2028 |
| qantev/trocr-large-spanish | 0.0632 | 0.1817 |

(Nota bene: Please note that the XFUND dataset contains mislabeled data, which could potentially impact the results in this benchmark.)

# Intended uses & limitations

While this model is freely available for use, it's essential to note that it hasn't been trained on handwritten texts and therefore may not accurately recognize handwritten content. Additionally, its proficiency is limited when it comes to accurately reading two lines of text or vertical text.

This model has to be used with conjunction with a text detection model.

# How to use

Here is how to use this model in PyTorch:

```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# load image from the IAM database (actually this model is meant to be used on printed text)
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

processor = TrOCRProcessor.from_pretrained('qantev/trocr-small-spanish')
model = VisionEncoderDecoderModel.from_pretrained('qantev/trocr-small-spanish')
pixel_values = processor(images=image, return_tensors="pt").pixel_values

generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

# Contact us

research [at] qantev [dot] com