daekeun-ml
commited on
Commit
•
b8e0ab4
1
Parent(s):
0df683e
Update README.md
Browse files
README.md
CHANGED
@@ -9,21 +9,21 @@ metrics:
|
|
9 |
- wer
|
10 |
- cer
|
11 |
widget:
|
12 |
-
- src: https://raw.githubusercontent.com/
|
13 |
example_title: 랜덤 문장 1
|
14 |
-
- src: https://raw.githubusercontent.com/
|
15 |
example_title: 랜덤 문장 2
|
16 |
-
- src: https://raw.githubusercontent.com/
|
17 |
example_title: 챗봇 1
|
18 |
-
- src: https://raw.githubusercontent.com/
|
19 |
example_title: 챗봇 2
|
20 |
-
- src: https://raw.githubusercontent.com/
|
21 |
example_title: 뉴스 1
|
22 |
-
- src: https://raw.githubusercontent.com/
|
23 |
example_title: 뉴스 2
|
24 |
-
- src: https://raw.githubusercontent.com/
|
25 |
example_title: 영화 리뷰 1
|
26 |
-
- src: https://raw.githubusercontent.com/
|
27 |
example_title: 영화 리뷰 2
|
28 |
---
|
29 |
|
@@ -37,9 +37,11 @@ TrOCR has not yet released a multilingual model including Korean, so we trained
|
|
37 |
|
38 |
### Text data
|
39 |
We created training data by processing three types of datasets.
|
|
|
40 |
- News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
|
41 |
- Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
|
42 |
- Chatbot dataset: https://github.com/songys/Chatbot_data
|
|
|
43 |
For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
|
44 |
|
45 |
### Image Data
|
@@ -76,7 +78,7 @@ processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
|
|
76 |
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
77 |
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
78 |
|
79 |
-
url = "https://raw.githubusercontent.com/aws-samples/
|
80 |
response = requests.get(url)
|
81 |
img = Image.open(BytesIO(response.content))
|
82 |
|
|
|
9 |
- wer
|
10 |
- cer
|
11 |
widget:
|
12 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_2.jpg
|
13 |
example_title: 랜덤 문장 1
|
14 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_6.jpg
|
15 |
example_title: 랜덤 문장 2
|
16 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_3.jpg
|
17 |
example_title: 챗봇 1
|
18 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_5.jpg
|
19 |
example_title: 챗봇 2
|
20 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg
|
21 |
example_title: 뉴스 1
|
22 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_3.jpg
|
23 |
example_title: 뉴스 2
|
24 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_1.jpg
|
25 |
example_title: 영화 리뷰 1
|
26 |
+
- src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_2.jpg
|
27 |
example_title: 영화 리뷰 2
|
28 |
---
|
29 |
|
|
|
37 |
|
38 |
### Text data
|
39 |
We created training data by processing three types of datasets.
|
40 |
+
|
41 |
- News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
|
42 |
- Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
|
43 |
- Chatbot dataset: https://github.com/songys/Chatbot_data
|
44 |
+
|
45 |
For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
|
46 |
|
47 |
### Image Data
|
|
|
78 |
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
79 |
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
|
80 |
|
81 |
+
url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
|
82 |
response = requests.get(url)
|
83 |
img = Image.open(BytesIO(response.content))
|
84 |
|