language: dv | |
# byt5-dv | |
Pretrained from scratch on Dhivei (language of the Maldives) | |
with ByT5, Google's new byte-level tokenizer strategy. | |
Corpus: dv.wikipedia.org as of March 2020 (TFDS) | |
Notebook: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH | |
## Demo | |
## Todos | |
The Wikipedia corpus is too small for this language. In the future I would add | |
OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those | |
as one TFDS dataset. | |