gogamza nazneen commited on
Commit
f9f2ec3
β€’
1 Parent(s): d9a1f64

model documentation (#1)

Browse files

- model documentation (9f62a16f664fd459acfa480cdcef14070a0c64ce)


Co-authored-by: Nazneen Rajani <nazneen@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +177 -9
README.md CHANGED
@@ -1,23 +1,191 @@
1
  ---
2
  language: ko
 
3
  tags:
4
  - bart
5
- license: mit
6
  ---
7
 
8
- ## KoBART-base-v2
9
 
10
- With the addition of chatting data, the model is trained to handle the semantics of sequences longer than KoBART.
11
 
12
- ```python
13
- from transformers import PreTrainedTokenizerFast, BartModel
 
 
 
 
14
 
15
- tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
16
- model = BartModel.from_pretrained('gogamza/kobart-base-v2')
17
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- ### Performance
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  NSMC
22
  - acc. : 0.901
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language: ko
3
+ license: mit
4
  tags:
5
  - bart
 
6
  ---
7
 
 
8
 
9
+ # Model Card for kobart-base-v2
10
 
11
+
12
+ # Model Details
13
+
14
+ ## Model Description
15
+
16
+ [**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” `autoencoder`의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ BART(μ΄ν•˜ **KoBART**) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ `Text Infilling` λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ **40GB** μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ `encoder-decoder` μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ `KoBART-base`λ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.
17
 
18
+ - **Developed by:** More information needed
19
+ - **Shared by [Optional]:** Heewon(Haven) Jeon
20
+ - **Model type:** Feature Extraction
21
+ - **Language(s) (NLP):** Korean
22
+ - **License:** MIT
23
+ - **Parent Model:** BART
24
+ - **Resources for more information:**
25
+ - [GitHub Repo](https://github.com/haven-jeon/KoBART)
26
+ - [Model Demo Space](https://huggingface.co/spaces/gogamza/kobart-summarization)
27
+
28
+
29
+
30
+ # Uses
31
+
32
+
33
+ ## Direct Use
34
+ This model can be used for the task of Feature Extraction.
35
+
36
+ ## Downstream Use [Optional]
37
+
38
+ More information needed.
39
+
40
+ ## Out-of-Scope Use
41
+
42
+ The model should not be used to intentionally create hostile or alienating environments for people.
43
+
44
+ # Bias, Risks, and Limitations
45
+
46
+
47
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
48
+
49
+
50
+
51
+ ## Recommendations
52
+
53
+
54
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
55
+
56
+ # Training Details
57
+
58
+ ## Training Data
59
+
60
+ | Data | # of Sentences |
61
+ |-------|---------------:|
62
+ | Korean Wiki | 5M |
63
+ | Other corpus | 0.27B |
64
+
65
+ ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, [λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ v1.0(λŒ€ν™”, λ‰΄μŠ€, ...)](https://corpus.korean.go.kr/), [μ²­μ™€λŒ€ ꡭ민청원](https://github.com/akngs/petitions) λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
66
+
67
+ `vocab` μ‚¬μ΄μ¦ˆλŠ” 30,000 이며 λŒ€ν™”μ— 자주 μ“°μ΄λŠ” μ•„λž˜μ™€ 같은 이λͺ¨ν‹°μ½˜, 이λͺ¨μ§€ 등을 μΆ”κ°€ν•˜μ—¬ ν•΄λ‹Ή ν† ν°μ˜ 인식 λŠ₯λ ₯을 μ˜¬λ ΈμŠ΅λ‹ˆλ‹€.
68
+ > πŸ˜€, 😁, πŸ˜†, πŸ˜…, 🀣, .. , `:-)`, `:)`, `-)`, `(-:`...
69
+
70
+ ## Training Procedure
71
+
72
+
73
+ ### Tokenizer
74
+
75
+ [`tokenizers`](https://github.com/huggingface/tokenizers) νŒ¨ν‚€μ§€μ˜ `Character BPE tokenizer`둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
76
+
77
+
78
 
 
79
 
80
+
81
+ ### Speeds, Sizes, Times
82
+ | Model | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims |
83
+ |--------------|:----:|:-------:|--------:|--------:|--------:|--------------:|
84
+ | `KoBART-base` | 124M | Encoder | 6 | 16 | 3072 | 768 |
85
+ | | | Decoder | 6 | 16 | 3072 | 768 |
86
+
87
+
88
+ # Evaluation
89
+
90
+
91
+ ## Testing Data, Factors & Metrics
92
+
93
+ ### Testing Data
94
+
95
+ More information needed
96
+
97
+
98
+ ### Factors
99
+ More information needed
100
+
101
+ ### Metrics
102
+
103
+ More information needed
104
+
105
+
106
+ ## Results
107
+
108
  NSMC
109
  - acc. : 0.901
110
 
111
+ The model authors also note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
112
+
113
+ | | [NSMC](https://github.com/e9t/nsmc)(acc) | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) |
114
+ |---|---|---|---|
115
+ | **KoBART-base** | 90.24 | 81.66 | 94.34 |
116
+
117
+ # Model Examination
118
+
119
+ More information needed
120
+
121
+ # Environmental Impact
122
+
123
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
124
+
125
+ - **Hardware Type:** More information needed
126
+ - **Hours used:** More information needed
127
+ - **Cloud Provider:** More information needed
128
+ - **Compute Region:** More information needed
129
+ - **Carbon Emitted:** More information needed
130
+
131
+ # Technical Specifications [optional]
132
+
133
+ ## Model Architecture and Objective
134
+
135
+ More information needed
136
+
137
+ ## Compute Infrastructure
138
+
139
+ More information needed
140
+
141
+ ### Hardware
142
+
143
+
144
+ More information needed
145
+
146
+ ### Software
147
+
148
+ More information needed.
149
+
150
+ # Citation
151
+
152
+
153
+ **BibTeX:**
154
+
155
+
156
+ More information needed.
157
+
158
+
159
+
160
+
161
+
162
+ # Glossary [optional]
163
+ More information needed
164
+
165
+ # More Information [optional]
166
+ More information needed
167
+
168
+
169
+ # Model Card Authors [optional]
170
+
171
+ Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team
172
+
173
+
174
+ # Model Card Contact
175
+ The model authors note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
176
+ `KoBART` κ΄€λ ¨ μ΄μŠˆλŠ” [이곳](https://github.com/SKT-AI/KoBART/issues)에 μ˜¬λ €μ£Όμ„Έμš”.
177
+
178
+ # How to Get Started with the Model
179
+
180
+ Use the code below to get started with the model.
181
+
182
+ <details>
183
+ <summary> Click to expand </summary>
184
+
185
+ ```python
186
+ from transformers import PreTrainedTokenizerFast, BartModel
187
+
188
+ tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
189
+ model = BartModel.from_pretrained('gogamza/kobart-base-v2')
190
+ ```
191
+ </details>