daekeun-ml commited on
Commit
b8e0ab4
1 Parent(s): 0df683e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -9,21 +9,21 @@ metrics:
9
  - wer
10
  - cer
11
  widget:
12
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/random_2.jpg
13
  example_title: 랜덤 문장 1
14
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/random_6.jpg
15
  example_title: 랜덤 문장 2
16
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/chatbot_3.jpg
17
  example_title: 챗봇 1
18
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/chatbot_5.jpg
19
  example_title: 챗봇 2
20
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/news_1.jpg
21
  example_title: 뉴스 1
22
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/news_3.jpg
23
  example_title: 뉴스 2
24
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/nsmc_1.jpg
25
  example_title: 영화 리뷰 1
26
- - src: https://raw.githubusercontent.com/daekeun-ml/sm-kornlp-usecases/main/trocr/sample_imgs/nsmc_2.jpg
27
  example_title: 영화 리뷰 2
28
  ---
29
 
@@ -37,9 +37,11 @@ TrOCR has not yet released a multilingual model including Korean, so we trained
37
 
38
  ### Text data
39
  We created training data by processing three types of datasets.
 
40
  - News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
41
  - Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
42
  - Chatbot dataset: https://github.com/songys/Chatbot_data
 
43
  For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
44
 
45
  ### Image Data
@@ -76,7 +78,7 @@ processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
76
  model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
77
  tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
78
 
79
- url = "https://raw.githubusercontent.com/aws-samples/aws-ai-ml-workshop-kr/master/sagemaker/sm-kornlp/trocr/sample_imgs/news_1.jpg"
80
  response = requests.get(url)
81
  img = Image.open(BytesIO(response.content))
82
 
9
  - wer
10
  - cer
11
  widget:
12
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_2.jpg
13
  example_title: 랜덤 문장 1
14
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/random_6.jpg
15
  example_title: 랜덤 문장 2
16
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_3.jpg
17
  example_title: 챗봇 1
18
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/chatbot_5.jpg
19
  example_title: 챗봇 2
20
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg
21
  example_title: 뉴스 1
22
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_3.jpg
23
  example_title: 뉴스 2
24
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_1.jpg
25
  example_title: 영화 리뷰 1
26
+ - src: https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/nsmc_2.jpg
27
  example_title: 영화 리뷰 2
28
  ---
29
 
37
 
38
  ### Text data
39
  We created training data by processing three types of datasets.
40
+
41
  - News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
42
  - Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
43
  - Chatbot dataset: https://github.com/songys/Chatbot_data
44
+
45
  For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
46
 
47
  ### Image Data
78
  model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
79
  tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
80
 
81
+ url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
82
  response = requests.get(url)
83
  img = Image.open(BytesIO(response.content))
84