rasyosef's picture
Update README.md
5b389e4 verified
|
raw
history blame
No virus
670 Bytes
metadata
license: mit
datasets:
  - oscar
language:
  - am
library_name: transformers

Amharic BPE Tokenizer

This repo contains a Byte-Pair Encoding tokenizer trained on the Amharic subset of the oscar dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of 24000.

How to use

You can load the tokenizer from huggingface hub as follows.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
tokenizer("αŠ α‰£α‹­αŠ• α‹«αˆ‹α‹¨ α‹¨α•αˆŒαŠ• α‰²αŠ¬α‰΅ αŠ₯α‰½αˆˆα‹‹αˆˆα‹α’")