Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,103 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- fr
|
5 |
+
pipeline_tag: image-to-text
|
6 |
+
tags:
|
7 |
+
- trocr
|
8 |
+
- vision-encoder-decoder
|
9 |
+
metrics:
|
10 |
+
- cer
|
11 |
+
- wer
|
12 |
+
widget:
|
13 |
+
- src: >-
|
14 |
+
https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/0.jpg
|
15 |
+
example_title: Example 1
|
16 |
+
- src: >-
|
17 |
+
https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/1.jpg
|
18 |
+
example_title: Example 2
|
19 |
+
- src: >-
|
20 |
+
https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/2.jpg
|
21 |
+
example_title: Example 3
|
22 |
---
|
23 |
+
|
24 |
+
# TrOCR for French
|
25 |
+
|
26 |
+
## Overview
|
27 |
+
|
28 |
+
TrOCR has not yet released for French, so we trained a French model for PoC purpose. Based on this model, it is recommended to collect more data to additionally train the 1st stage or perform fine-tuning as the 2nd stage.
|
29 |
+
|
30 |
+
It's a special case of the [English trOCR model](https://huggingface.co/microsoft/trocr-base-printed) introduced in the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Li et al. and first released in [this repository](https://github.com/microsoft/unilm/tree/master/trocr)
|
31 |
+
|
32 |
+
This was possible thanks to [daekun-ml](https://huggingface.co/daekeun-ml/ko-trocr-base-nsmc-news-chatbot) and [Niels Rogge](https://github.com/NielsRogge/) than enabled us to publish this model with their tutorials and code.
|
33 |
+
|
34 |
+
## Collecting data
|
35 |
+
|
36 |
+
### Text data
|
37 |
+
We created training data of ~723k examples by taking random samples of the following datasets:
|
38 |
+
|
39 |
+
- [MultiLegalPile](https://huggingface.co/datasets/joelito/Multi_Legal_Pile) - 90k
|
40 |
+
- [French book Reviews](https://huggingface.co/datasets/Abirate/french_book_reviews) - 20k
|
41 |
+
- [WikiNeural](https://huggingface.co/datasets/Babelscape/wikineural) - 83k
|
42 |
+
- [Multilingual cc news](https://huggingface.co/datasets/intfloat/multilingual_cc_news) - 119k
|
43 |
+
- [Reviews Amazon Multi](https://huggingface.co/datasets/amazon_reviews_multi) - 153k
|
44 |
+
- [Opus Book](https://huggingface.co/datasets/opus_books) - 70k
|
45 |
+
- [BerlinText](https://huggingface.co/datasets/biglam/berlin_state_library_ocr) - 38k
|
46 |
+
|
47 |
+
We collected parts of each of the datasets and then cut randomly the sentences to collect the final training set.
|
48 |
+
|
49 |
+
### Image Data
|
50 |
+
|
51 |
+
Image data was generated with TextRecognitionDataGenerator (https://github.com/Belval/TextRecognitionDataGenerator) introduced in the TrOCR paper.
|
52 |
+
Below is a code snippet for generating images.
|
53 |
+
|
54 |
+
```shell
|
55 |
+
python3 ./trdg/run.py -i ocr_dataset_poc.txt -w 5 -t {num_cores} -f 64 -l ko -c {num_samples} -na 2 --output_dir {dataset_dir}
|
56 |
+
```
|
57 |
+
|
58 |
+
## Training
|
59 |
+
|
60 |
+
### Base model
|
61 |
+
The encoder model used `facebook/deit-base-distilled-patch16-384` and the decoder model used `camembert-base`. It is easier than training by starting weights from `microsoft/trocr-base-stage1`.
|
62 |
+
|
63 |
+
### Parameters
|
64 |
+
We used heuristic parameters without separate hyperparameter tuning.
|
65 |
+
- learning_rate = 4e-5
|
66 |
+
- epochs = 25
|
67 |
+
- fp16 = True
|
68 |
+
- max_length = 32
|
69 |
+
|
70 |
+
### Results on dev set
|
71 |
+
|
72 |
+
For the dev set we got those results
|
73 |
+
- size of the test set: 72k examples
|
74 |
+
- CER: 0.13
|
75 |
+
- WER: 0.26
|
76 |
+
- Val Loss: 0.424
|
77 |
+
|
78 |
+
## Usage
|
79 |
+
|
80 |
+
### inference.py
|
81 |
+
|
82 |
+
```python
|
83 |
+
from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
|
84 |
+
import requests
|
85 |
+
from io import BytesIO
|
86 |
+
from PIL import Image
|
87 |
+
|
88 |
+
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
|
89 |
+
model = VisionEncoderDecoderModel.from_pretrained("agomberto/trocr-base-printed-fr")
|
90 |
+
tokenizer = AutoTokenizer.from_pretrained("agomberto/trocr-base-printed-fr")
|
91 |
+
|
92 |
+
url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
|
93 |
+
response = requests.get(url)
|
94 |
+
img = Image.open(BytesIO(response.content))
|
95 |
+
|
96 |
+
pixel_values = processor(img, return_tensors="pt").pixel_values
|
97 |
+
generated_ids = model.generate(pixel_values, max_length=32)
|
98 |
+
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
99 |
+
print(generated_text)
|
100 |
+
```
|
101 |
+
|
102 |
+
All the code required for data collection and model training has been published on the author's Github.
|
103 |
+
- https://github.com/agombert/trocr-base-printed-fr/
|