rasyosef
/

gpt2-oscar-amharic-tokenizer

Inference Endpoints

Model card Files Files and versions Community

gpt2-oscar-amharic-tokenizer / README.md

rasyosef's picture

Update README.md

52c136b verified 9 months ago

|

history blame contribute delete

666 Bytes

	---
	license: mit
	datasets:
	- oscar
	language:
	- am
	library_name: transformers
	---
	# Amharic BPE Tokenizer
	This repo contains a Byte-Pair Encoding tokenizer trained on the Amharic subset of the [oscar](https://huggingface.co/datasets/oscar) dataset. It's the same as the GPT-2 tokenizer but trained from scratch on an amharic dataset with a vocabulary size of `24000`.

	# How to use
	You can load the tokenizer from huggingface hub as follows.
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("rasyosef/gpt2-oscar-amharic-tokenizer")
	tokenizer("አባይን ያላየ የፕሌን ቲኬት እችለዋለው።")
	```