rasyosef
/

gpt2-oscar-amharic-tokenizer

Inference Endpoints

Model card Files Files and versions Community

rasyosef commited on Jan 31

Commit

6bbb3d0

•

1 Parent(s): 068ddf1

Create README.md

Files changed (1) hide show

README.md +18 -0

README.md ADDED Viewed

	@@ -0,0 +1,18 @@

+---
+license: mit
+datasets:
+- oscar
+language:
+- am
+library_name: transformers
+---
+# Amharic BPE Tokenizer
+This repo contains a **Byte-Pair Encoding** tokenizer trained on the **Amharic** subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a **vocabulary size** of `24000`.
+# How to use
+You can load the tokenizer from huggingface hub as follows.
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
+```