keshan commited on
Commit
655873a
2 Parent(s): d59c039 ad85a29

Merge branch 'main' of https://huggingface.co/keshan/sinhala-roberta-oscar into main

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: si
3
+ tags:
4
+ - oscar
5
+ - Sinhala
6
+ - roberta
7
+ - fill-mask
8
+ widget:
9
+ - text: "මම සිංහල භාෂාව <mask>"
10
+ datasets:
11
+ - oscar
12
+ ---
13
+ ### Overview
14
+
15
+ This is a slightly smaller model trained on [OSCAR](https://oscar-corpus.com/) Sinhala dedup dataset. As Sinhala is one of those low resource languages, there are only a handful of models been trained. So, this would be a great place to start training for more downstream tasks.
16
+
17
+ ## Model Specification
18
+
19
+
20
+ The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
21
+ 1. vocab_size=50265
22
+ 2. max_position_embeddings=514
23
+ 3. num_attention_heads=12
24
+ 4. num_hidden_layers=12
25
+ 5. type_vocab_size=1
26
+
27
+ ## How to Use
28
+ You can use this model directly with a pipeline for masked language modeling:
29
+
30
+ ```py
31
+ from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
32
+
33
+ model = AutoModelWithLMHead.from_pretrained("keshan/sinhala-roberta-oscar")
34
+ tokenizer = AutoTokenizer.from_pretrained("keshan/sinhala-roberta-oscar")
35
+
36
+ fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
37
+
38
+ fill_mask("මම ගෙදර <mask>.")
39
+
40
+ ```