vlaurent17 commited on
Commit
8d0e7aa
·
verified ·
1 Parent(s): 428d863

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - trocr
4
+ - image-to-text
5
+ - pytorch
6
+ widget:
7
+ - src: "example_1.png"
8
+ example_title: Example 1
9
+ - src: "example_2.jpeg"
10
+ example_title: Example 2
11
+ - src: "example_3.jpeg"
12
+ example_title: Example 3
13
+ license: mit
14
+ language:
15
+ - es
16
+ - en
17
+ metrics:
18
+ - cer
19
+ - wer
20
+ library_name: transformers
21
+ pipeline_tag: image-to-text
22
+ #inference: false
23
+ ---
24
+
25
+ # TrOCR (small sized model)
26
+
27
+ We present a new version of the small TrOCR model, fine-tuned on a self-generated Spanish dataset. The TrOCR architecture was initially introduced in the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Li et al. and is available in the associated [repository](https://github.com/microsoft/unilm/tree/master/trocr).
28
+
29
+ This model has been specialized for printed fonts and does not support handwritten recognition.
30
+
31
+ # Model presentation
32
+
33
+ The TrOCR model architecture is based on the Transformer framework, comprising an image Transformer as the encoder and a text Transformer as the decoder.
34
+
35
+ <img src="background_encoder_decoder.png" width="800" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">
36
+
37
+ The encoder takes input images, decomposes them into patches, and processes them to obtain visual features. These features are then used by the decoder to generate wordpiece sequences in an autoregressive way, guided by the visual information and previous predictions.
38
+
39
+ This design allows TrOCR to leverage pre-trained models for image understanding and language modeling, resulting in state-of-the-art performances.
40
+
41
+ # Dataset
42
+
43
+ As there was no publicly available dataset for training and testing Spanish OCR, we opted to create our own. This involved scraping 131,000 random Wikipedia pages and extracting 2,000,000 samples, ranging from single-word instances to 10-word sentences.
44
+
45
+ Subsequently, we artificially generated images from these samples, applied various data augmentation techniques, and achieved results such as the following:
46
+
47
+ <img src="generated_dataset_samples.jpeg" width="1000" style="display: block; margin: auto; max-width: 100%; max-height: 100%;">
48
+
49
+ Note: We observed that generating images on-the-fly during training was faster than reading them from a pre-existing folder.
50
+
51
+ # Metrics
52
+
53
+ Regrettably, there is currently no established benchmark dataset specifically designed for evaluating Optical Character Recognition (OCR) performance in Spanish. To provide an indication of the efficacy of our OCR, we present benchmarks on the [XFUND](https://github.com/doc-analysis/XFUND) dataset in Spanish for the small, base, and large models. Additionally, we include a comparative analysis with EasyOCR.
54
+
55
+ | | CER | WER |
56
+ | --- | --- | --- |
57
+ | EasyOCR | 0.114 | 0.288 |
58
+ | qantev/trocr-small-spanish | 0.1059 | 0.2545 |
59
+ | qantev/trocr-base-spanish | 0.0732 | 0.2028 |
60
+ | qantev/trocr-large-spanish | 0.0632 | 0.1817 |
61
+
62
+ (Nota bene: Please note that the XFUND dataset contains mislabeled data, which could potentially impact the results in this benchmark.)
63
+
64
+ # Intended uses & limitations
65
+
66
+ While this model is freely available for use, it's essential to note that it hasn't been trained on handwritten texts and therefore may not accurately recognize handwritten content. Additionally, its proficiency is limited when it comes to accurately reading two lines of text or vertical text.
67
+
68
+ This model has to be used with conjunction with a text detection model.
69
+
70
+ # How to use
71
+
72
+ Here is how to use this model in PyTorch:
73
+
74
+ ```python
75
+ from transformers import TrOCRProcessor, VisionEncoderDecoderModel
76
+ from PIL import Image
77
+ import requests
78
+
79
+ # load image from the IAM database (actually this model is meant to be used on printed text)
80
+ url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
81
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
82
+
83
+ processor = TrOCRProcessor.from_pretrained('qantev/trocr-small-spanish')
84
+ model = VisionEncoderDecoderModel.from_pretrained('qantev/trocr-small-spanish')
85
+ pixel_values = processor(images=image, return_tensors="pt").pixel_values
86
+
87
+ generated_ids = model.generate(pixel_values)
88
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
89
+ ```
90
+
91
+ # Contact us
92
+
93
+ dl_engineers@qantev.com