keshan commited on
Commit
0ce3fa7
1 Parent(s): f5f0c35

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: si
3
+ tags:
4
+ - oscar
5
+ - Sinhala
6
+ - roberta
7
+ datasets:
8
+ - oscar
9
+ ---
10
+ ### Overview
11
+
12
+ This is a slightly smaller model trained on [OSCAR](https://oscar-corpus.com/) Sinhala dedup dataset. As Sinhala is one of those low resource languages, there are only a handful of models been trained. So, this would be a great place to start training for more downstream tasks.
13
+
14
+ ## Model Specification
15
+
16
+
17
+ The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
18
+ 1. vocab_size=50265
19
+ 2. max_position_embeddings=514
20
+ 3. num_attention_heads=12
21
+ 4. num_hidden_layers=12
22
+ 5. type_vocab_size=1
23
+
24
+ ## How to Use
25
+ You can use this model directly with a pipeline for masked language modeling:
26
+
27
+ ```py
28
+ from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
29
+
30
+ model = AutoModelWithLMHead.from_pretrained("keshan/sinhala-roberta-oscar")
31
+ tokenizer = AutoTokenizer.from_pretrained("keshan/sinhala-roberta-oscar")
32
+
33
+ fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
34
+
35
+ fill_mask("මම ගෙදර <mask>.")
36
+
37
+ ```