byt5-dv / README.md
monsoon-nlp's picture
ByT5 setup
fdba4f9
metadata
language: dv

byt5-dv

Pretrained from scratch on Dhivei (language of the Maldives) with ByT5, Google's new byte-level tokenizer strategy.

Corpus: dv.wikipedia.org as of March 2020 (TFDS)

Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH

Demo

Todos

The Wikipedia corpus is too small for this language. In the future I would add OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those as one TFDS dataset.