rasyosef's picture
Update README.md
9d02d02 verified
---
library_name: transformers
datasets:
- oscar
- mc4
- rasyosef/amharic-sentences-corpus
language:
- am
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።
example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር <mask> ግዢ በእጅጉ ጨምሯል።
example_title: Example 2
- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ <mask> ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
example_title: Example 3
- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል <mask> እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
example_title: Example 4
---
# roberta-medium-amharic
This model has the same architecture as [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 Million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.
The model was trained for **15 hours** on an **A100 40GB GPU**.
It achieves the following results on the evaluation set:
- `Loss: 2.446`
- `Perplexity: 11.59`
Even though this model has **42 Million parameters** it beats the 7x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.
# How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-medium-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")
[{'score': 0.7755730152130127,
'token': 137,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
{'score': 0.09340856224298477,
'token': 346,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
{'score': 0.08586721867322922,
'token': 217,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
{'score': 0.011987944133579731,
'token': 733,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
{'score': 0.010042797774076462,
'token': 1392,
'token_str': 'ዓመቱ',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመቱ ተቆጥሯል።'}]
```
# Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks
- **Sentiment Classification**
- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
- Code: https://github.com/rasyosef/amharic-sentiment-classification
- **Named Entity Recognition**
- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
- Code: https://github.com/rasyosef/amharic-named-entity-recognition
### Finetuned Model Performance
The reported F1 scores are macro averages.
|Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)|
|-----|---------------|-----------|--------------|------------------------------|
|roberta-base-amharic|110M|8.08|0.88|0.78|
|**roberta-medium-amharic**|**42.2M**|**11.59**|**0.84**|**0.75**|
|bert-medium-amharic|40.5M|13.74|0.83|0.68|
|bert-small-amharic|27.8M|15.96|0.83|0.68|
|bert-mini-amharic|10.7M|22.42|0.81|0.64|
|bert-tiny-amharic|4.18M|71.52|0.79|0.54|
|xlm-roberta-base|279M||0.83|0.73|
|afro-xlmr-base|278M||0.83|0.75|
|afro-xlmr-large|560M||0.86|0.76|
|am-roberta|443M||0.82|0.69|