---
library_name: transformers
datasets:
- oscar
- mc4
language:
- am
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
  - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
    example_title: Example
---

# bert-mini-amharic

This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
It achieves the following results on the evaluation set:
- `Loss: 3.57`
- `Perplexity: 35.52`

Even though this model only has `9.7 Million` parameters, its performance is only slightly behind the 28x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.

# How to use
You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.4713546335697174,
  'token': 9308,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.25726795196533203,
  'token': 9540,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.07067586481571198,
  'token': 10354,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.07064681500196457,
  'token': 11212,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.012558948248624802,
  'token': 10588,
  'token_str': 'ወራት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ወራት ተቆጥሯል ።'}]
```

# Fine-tuning

The following github repository contains a [notebook](https://github.com/rasyosef/amharic-news-category-classification/blob/main/%5Bbert-mini-amharic%5D%20Amharic%20News%20Category%20Classification.ipynb) that fine-tunes this model for an Amharic text classification task. 

https://github.com/rasyosef/amharic-news-category-classification

#### Fine-tuned Model Performance
Since this is a multi-class classification task, the reported precision, recall, and f1 metrics are macro averages.

|Model|Size(# params)|Accuracy|Precision|Recall|F1|
|-----|--------------|--------|---------|------|--|
|bert-mini-amharic|9.67M|0.87|0.83|0.83|0.83|
|bert-small-amharic|25.7M|0.89|0.86|0.87|0.86|
|xlm-roberta-base|279M|0.9|0.88|0.88|0.88|