anzorq's picture
Add multilingual to the language tag (#1)
ade82ad
|
raw
history blame
360 Bytes
metadata
language:
  - kbd
  - ru
  - multilingual
license: unknown
tags:
  - circassian
  - kabardian
datasets:
  - anzorq/kbd_lat-835k_ru-3M

t5-v1_1-small pretrained with mlm task on

� kbd (custom latin script) 835K lines: a pile of scraped text from news sites, books etc.

� ru 3M lines: wiki corpus from OPUS

tokenizer: sentencepiece unigram, 8K, shared vocabulary