byt5-dv / README.md
monsoon-nlp's picture
ByT5 setup
fdba4f9
---
language: dv
---
# byt5-dv
Pretrained from scratch on Dhivei (language of the Maldives)
with ByT5, Google's new byte-level tokenizer strategy.
Corpus: dv.wikipedia.org as of March 2020 (TFDS)
Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH
## Demo
## Todos
The Wikipedia corpus is too small for this language. In the future I would add
OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
as one TFDS dataset.