daekeun-ml commited on
Commit
a7110bd
1 Parent(s): a7c5163

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ tags:
5
+ - trocr
6
+ - image-to-text
7
+ license: mit
8
+ metrics:
9
+ - wer
10
+ - cer
11
+ ---
12
+
13
+ # TrOCR for Korean Language (PoC)
14
+
15
+ ## Overview
16
+
17
+ TrOCR has not yet released a multilingual model including Korean, so we trained a Korean model for PoC purpose. Based on this model, it is recommended to collect more data to additionally train the 1st stage or perform fine-tuning as the 2nd stage.
18
+
19
+ ## Collecting data
20
+
21
+ ### Text data
22
+ We created training data by processing three types of datasets.
23
+ - News summariation dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
24
+ - Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
25
+ - Chatbot dataset: https://github.com/songys/Chatbot_data
26
+ For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
27
+
28
+ ### Image Data
29
+
30
+ Image data was generated with TextRecognitionDataGenerator (https://github.com/Belval/TextRecognitionDataGenerator) introduced in the TrOCR paper.
31
+ Below is a code snippet for generating images.
32
+ ```shell
33
+ python3 ./trdg/run.py -i ocr_dataset_poc.txt -w 5 -t {num_cores} -f 64 -l ko -c {num_samples} -na 2 --output_dir {dataset_dir}
34
+ ```
35
+
36
+ ## Training
37
+ We used heuristic parameters without separate hyperparameter tuning.
38
+ - learning_rate = 4e-5
39
+ - epochs = 25
40
+ - fp16 = True
41
+
42
+ ## Usage
43
+
44
+ ### inference.py
45
+
46
+ ```python
47
+ from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
48
+ import requests
49
+ from io import BytesIO
50
+ from PIL import Image
51
+
52
+ processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
53
+ model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
54
+ tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
55
+
56
+ url = "https://raw.githubusercontent.com/aws-samples/aws-ai-ml-workshop-kr/master/sagemaker/sm-kornlp/trocr/sample_imgs/news_1.jpg"
57
+ response = requests.get(url)
58
+ img = Image.open(BytesIO(response.content))
59
+
60
+ pixel_values = processor(img, return_tensors="pt").pixel_values
61
+ generated_ids = model.generate(pixel_values, max_length=64)
62
+ generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
63
+ print(generated_text)
64
+ ```
65
+
66
+ All the code required for data collection and model training has been published on the author's Github.
67
+ - https://github.com/daekeun-ml/sm-kornlp-usecases/tree/main/trocr