rasyosef's picture
Update README.md
9d02d02 verified
metadata
library_name: transformers
datasets:
  - oscar
  - mc4
  - rasyosef/amharic-sentences-corpus
language:
  - am
metrics:
  - perplexity
pipeline_tag: fill-mask
widget:
  - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።
    example_title: Example 1
  - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር <mask> ግዢ በእጅጉ ጨምሯል።
    example_title: Example 2
  - text: >-
      ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ
      ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ <mask> ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ
      ነው።
    example_title: Example 3
  - text: >-
      ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል <mask> እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት
      አንዱ ነው።
    example_title: Example 4

roberta-medium-amharic

This model has the same architecture as xlm-roberta-base and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.

The model was trained for 15 hours on an A100 40GB GPU.

It achieves the following results on the evaluation set:

  • Loss: 2.446
  • Perplexity: 11.59

Even though this model has 42 Million parameters it beats the 7x larger 279 Million parameter xlm-roberta-base multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-medium-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")

[{'score': 0.7755730152130127,
  'token': 137,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
 {'score': 0.09340856224298477,
  'token': 346,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
 {'score': 0.08586721867322922,
  'token': 217,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
 {'score': 0.011987944133579731,
  'token': 733,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
 {'score': 0.010042797774076462,
  'token': 1392,
  'token_str': 'ዓመቱ',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመቱ ተቆጥሯል።'}]

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

Finetuned Model Performance

The reported F1 scores are macro averages.

Model Size (# params) Perplexity Sentiment (F1) Named Entity Recognition (F1)
roberta-base-amharic 110M 8.08 0.88 0.78
roberta-medium-amharic 42.2M 11.59 0.84 0.75
bert-medium-amharic 40.5M 13.74 0.83 0.68
bert-small-amharic 27.8M 15.96 0.83 0.68
bert-mini-amharic 10.7M 22.42 0.81 0.64
bert-tiny-amharic 4.18M 71.52 0.79 0.54
xlm-roberta-base 279M 0.83 0.73
afro-xlmr-base 278M 0.83 0.75
afro-xlmr-large 560M 0.86 0.76
am-roberta 443M 0.82 0.69