ctoraman commited on
Commit
630bc8b
1 Parent(s): 9148154

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -41
README.md CHANGED
@@ -1,41 +1,50 @@
1
- ---
2
- language:
3
- - tr
4
- tags:
5
- - roberta
6
- license: cc-by-nc-sa-4.0
7
- datasets:
8
- - oscar
9
- ---
10
-
11
- # RoBERTa Turkish medium Character-level (uncased)
12
-
13
- Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
14
- The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
15
-
16
- Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 384.
17
-
18
- ## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for model loading and tokenization, example max length(1024) can be changed:
19
- ```
20
- model = AutoModel.from_pretrained([model_path])
21
- #for sequence classification:
22
- #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
23
-
24
- tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
25
- tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
26
- tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
27
- tokenizer.bos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
28
- tokenizer.sep_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
29
- tokenizer.eos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
30
- tokenizer.pad_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
31
- tokenizer.unk_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
32
- tokenizer.model_max_length = 1024
33
- ```
34
-
35
- The details can be found at this paper:
36
- https://arxiv.org/...
37
-
38
- ### BibTeX entry and citation info
39
- ```bibtex
40
- @article{}
41
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ tags:
5
+ - roberta
6
+ license: cc-by-nc-sa-4.0
7
+ datasets:
8
+ - oscar
9
+ ---
10
+
11
+ # RoBERTa Turkish medium Character-level (uncased)
12
+
13
+ Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
14
+ The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
15
+
16
+ Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Character-level, which means that text is split by individual characters. Vocabulary size is 384.
17
+
18
+ The details and performance comparisons can be found at this paper:
19
+ https://arxiv.org/abs/2204.08832
20
+
21
+ ## Note that this model does not include a tokenizer file, because it uses ByT5Tokenizer. The following code can be used for model loading and tokenization, example max length(1024) can be changed:
22
+ ```
23
+ model = AutoModel.from_pretrained([model_path])
24
+ #for sequence classification:
25
+ #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
26
+
27
+ tokenizer = ByT5Tokenizer.from_pretrained("google/byt5-small")
28
+ tokenizer.mask_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][0]
29
+ tokenizer.cls_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
30
+ tokenizer.bos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][1]
31
+ tokenizer.sep_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
32
+ tokenizer.eos_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][2]
33
+ tokenizer.pad_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
34
+ tokenizer.unk_token = tokenizer.special_tokens_map_extended['additional_special_tokens'][3]
35
+ tokenizer.model_max_length = 1024
36
+ ```
37
+
38
+ ### BibTeX entry and citation info
39
+ ```bibtex
40
+ @misc{https://doi.org/10.48550/arxiv.2204.08832,
41
+ doi = {10.48550/ARXIV.2204.08832},
42
+ url = {https://arxiv.org/abs/2204.08832},
43
+ author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
44
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
45
+ title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
46
+ publisher = {arXiv},
47
+ year = {2022},
48
+ copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
49
+ }
50
+ ```