File size: 1,603 Bytes
f61b68d
 
 
 
 
5b78be0
423f974
21d5aaf
 
 
 
7ed6f24
 
 
 
21d5aaf
 
 
 
 
 
 
 
 
7ed6f24
21d5aaf
7ed6f24
4b41f57
21d5aaf
 
 
 
 
7ed6f24
21d5aaf
 
 
 
 
 
 
7ed6f24
 
21d5aaf
7891d5d
 
7ed6f24
21d5aaf
7ed6f24
 
21d5aaf
7ed6f24
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: unknown
language:
- si
metrics:
- perplexity
library_name: transformers
tags:
- AshenBerto
- Sinhala
- Roberta
---



### 🌟 Overview  

This is a slightly smaller model trained on half of the [FastText](https://fasttext.cc/docs/en/crawl-vectors.html) dataset. Since Sinhala is a low-resource language, there’s a noticeable lack of pre-trained models available for it. 😕 This gap makes it harder to represent the language properly in the world of NLP.  

But hey, that’s where this model comes in! 🚀 It opens up exciting opportunities to improve tasks like sentiment analysis, machine translation, named entity recognition, or even question answering—tailored just for Sinhala. 🇱🇰✨  

---

### 🛠 Model Specs  

Here’s what powers this model (we went with [RoBERTa](https://arxiv.org/abs/1907.11692)):  

1️⃣ **vocab_size** = 25,000  
2️⃣ **max_position_embeddings** = 514  
3️⃣ **num_attention_heads** = 12  
4️⃣ **num_hidden_layers** = 6  
5️⃣ **type_vocab_size** = 1  
🎯 **Perplexity Value**: 3.5  

---

### 🚀 How to Use  

You can jump right in and use this model for masked language modeling! 🧩  

```python
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline

# Load the model and tokenizer
model = AutoModelWithLMHead.from_pretrained("ashenR/AshenBERTo")
tokenizer = AutoTokenizer.from_pretrained("ashenR/AshenBERTo")

# Create a fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Try it out with a Sinhala sentence! 🇱🇰
fill_mask("මම ගෙදර <mask>.")
```