model documentation

#1
by nazneen - opened
Files changed (1) hide show
  1. README.md +223 -58
README.md CHANGED
@@ -1,58 +1,223 @@
1
- ---
2
- language:
3
- - zh
4
- - ja
5
- - en
6
-
7
- tags:
8
- - translation
9
-
10
- widget:
11
- - text: "ja2zh: 吾輩は猫である。名前はまだ無い。"
12
-
13
- license: cc-by-nc-sa-4.0
14
- ---
15
-
16
- This model is finetuned from [mt5-base](https://huggingface.co/google/mt5-base).
17
-
18
- The model vocabulary is trimmed to ~1/3 by selecting top 85000 tokens in the training data. The code to trim the vocabulary can be found [here](https://gist.github.com/K024/4a100a0f4f4b07208958e0f3244da6ad).
19
-
20
- Usage:
21
- ```python
22
- from transformers import (
23
- T5Tokenizer,
24
- MT5ForConditionalGeneration,
25
- Text2TextGenerationPipeline,
26
- )
27
-
28
- path = "K024/mt5-zh-ja-en-trimmed"
29
- pipe = Text2TextGenerationPipeline(
30
- model=MT5ForConditionalGeneration.from_pretrained(path),
31
- tokenizer=T5Tokenizer.from_pretrained(path),
32
- )
33
-
34
- sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
35
- res = pipe(sentence, max_length=100, num_beams=4)
36
- res[0]['generated_text']
37
- ```
38
-
39
- Training data:
40
- ```
41
- wikimedia-en-ja
42
- wikimedia-en-zh
43
- wikimedia-ja-zh
44
- wikititles-ja-en
45
- wikititles-zh-en
46
- wikimatrix-ja-zh
47
- news-commentary-en-ja
48
- news-commentary-en-zh
49
- news-commentary-ja-zh
50
- ted2020-en-ja
51
- ted2020-en-zh
52
- ted2020-ja-zh
53
- ```
54
-
55
- License: [![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]
56
-
57
- [cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
58
- [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - zh
5
+ - ja
6
+ - en
7
+
8
+ tags:
9
+ - translation
10
+
11
+ widget:
12
+ - text: "ja2zh: 吾輩は猫である。名前はまだ無い。"
13
+
14
+ ---
15
+
16
+
17
+ # Model Card for mt5-zh-ja-en-trimmed
18
+
19
+ # Model Details
20
+
21
+ ## Model Description
22
+
23
+ More information needed
24
+
25
+ - **Developed by:** K024
26
+ - **Shared by [Optional]:** K024
27
+ - **Model type:** Translation
28
+ - **Language(s) (NLP):** Japanese, Chinease, English
29
+ - **License:** [cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
30
+ - **Parent Model:** [mt5-base](https://huggingface.co/google/mt5-base).
31
+ - **Resources for more information:**
32
+ - [mT5 GitHub Repo](https://github.com/google-research/multilingual-t5)
33
+ - [Associated Paper](https://arxiv.org/abs/2010.11934)
34
+
35
+
36
+
37
+ # Uses
38
+
39
+
40
+ ## Direct Use
41
+ This model can be used for the task of translation.
42
+
43
+ ## Downstream Use [Optional]
44
+
45
+ More information needed.
46
+
47
+ ## Out-of-Scope Use
48
+
49
+ The model should not be used to intentionally create hostile or alienating environments for people.
50
+
51
+ # Bias, Risks, and Limitations
52
+
53
+
54
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
55
+
56
+
57
+
58
+ ## Recommendations
59
+
60
+
61
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
62
+
63
+ # Training Details
64
+
65
+ ## Training Data
66
+
67
+ The model vocabulary is trimmed to ~1/3 by selecting top 85000 tokens in the training data. The code to trim the vocabulary can be found [here](https://gist.github.com/K024/4a100a0f4f4b07208958e0f3244da6ad).
68
+
69
+ ```
70
+ wikimedia-en-ja
71
+ wikimedia-en-zh
72
+ wikimedia-ja-zh
73
+ wikititles-ja-en
74
+ wikititles-zh-en
75
+ wikimatrix-ja-zh
76
+ news-commentary-en-ja
77
+ news-commentary-en-zh
78
+ news-commentary-ja-zh
79
+ ted2020-en-ja
80
+ ted2020-en-zh
81
+ ted2020-ja-zh
82
+ ```
83
+
84
+
85
+ ## Training Procedure
86
+
87
+
88
+ ### Preprocessing
89
+
90
+ More information needed
91
+
92
+
93
+
94
+ ### Speeds, Sizes, Times
95
+ This model is finetuned from [mt5-base](https://huggingface.co/google/mt5-base).
96
+
97
+
98
+ # Evaluation
99
+
100
+
101
+ ## Testing Data, Factors & Metrics
102
+
103
+ ### Testing Data
104
+
105
+ More information needed
106
+
107
+
108
+ ### Factors
109
+ More information needed
110
+
111
+ ### Metrics
112
+
113
+ More information needed
114
+
115
+
116
+ ## Results
117
+
118
+ More information needed
119
+
120
+
121
+ # Model Examination
122
+
123
+ More information needed
124
+
125
+ # Environmental Impact
126
+
127
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
128
+
129
+ - **Hardware Type:** More information needed
130
+ - **Hours used:** More information needed
131
+ - **Cloud Provider:** More information needed
132
+ - **Compute Region:** More information needed
133
+ - **Carbon Emitted:** More information needed
134
+
135
+ # Technical Specifications [optional]
136
+
137
+ ## Model Architecture and Objective
138
+
139
+ More information needed
140
+
141
+ ## Compute Infrastructure
142
+
143
+ More information needed
144
+
145
+ ### Hardware
146
+
147
+
148
+ More information needed
149
+
150
+ ### Software
151
+
152
+ More information needed.
153
+
154
+ # Citation
155
+
156
+
157
+ **BibTeX:**
158
+
159
+
160
+ ```bibtex
161
+ @misc{https://doi.org/10.48550/arxiv.2010.11934,
162
+ doi = {10.48550/ARXIV.2010.11934},
163
+
164
+ url = {https://arxiv.org/abs/2010.11934},
165
+
166
+ author = {Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin},
167
+
168
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
169
+
170
+ title = {mT5: A massively multilingual pre-trained text-to-text transformer},
171
+
172
+ publisher = {arXiv},
173
+
174
+ year = {2020},
175
+
176
+ copyright = {arXiv.org perpetual, non-exclusive license}
177
+ }
178
+ ```
179
+
180
+
181
+
182
+
183
+ # Glossary [optional]
184
+ More information needed
185
+
186
+ # More Information [optional]
187
+ More information needed
188
+
189
+
190
+ # Model Card Authors [optional]
191
+
192
+ K024 in collaboration with Ezi Ozoani and the Hugging Face team
193
+
194
+
195
+ # Model Card Contact
196
+
197
+ More information needed
198
+
199
+ # How to Get Started with the Model
200
+
201
+ Use the code below to get started with the model.
202
+
203
+ <details>
204
+ <summary> Click to expand </summary>
205
+
206
+ ```python
207
+ from transformers import (
208
+ T5Tokenizer,
209
+ MT5ForConditionalGeneration,
210
+ Text2TextGenerationPipeline,
211
+ )
212
+
213
+ path = "K024/mt5-zh-ja-en-trimmed"
214
+ pipe = Text2TextGenerationPipeline(
215
+ model=MT5ForConditionalGeneration.from_pretrained(path),
216
+ tokenizer=T5Tokenizer.from_pretrained(path),
217
+ )
218
+
219
+ sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
220
+ res = pipe(sentence, max_length=100, num_beams=4)
221
+ res[0]['generated_text']
222
+ ```
223
+ </details>