shirayu's picture
Added links
7a7211a
metadata
language: ja
tags:
  - t5
  - text2text-generation
  - seq2seq
license: apache-2.0
datasets:
  - mc4
  - wiki40b

t5-base-japanese-web (with Byte-fallback, 32K)

Description

megagonlabs/t5-base-japanese-web is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
Training codes are available on GitHub.

The vocabulary size of this model is 32K. 8K version is also available.

Corpora

We used following corpora for pre-training.

Tokenizer

We used Japanese Wikipedia to train SentencePiece.

Parameters

It took about 126 hours with TPU v3-8

Related models

License

Apache License 2.0

Citations

  • mC4

Contains information from mC4 which is made available under the ODC Attribution License.

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}
  • wiki40b
@inproceedings{49029,
title = {Wiki-40B: Multilingual Language Model Dataset},
author = {Mandy Guo and Zihang Dai and Denny Vrandecic and Rami Al-Rfou},
year = {2020},
booktitle   = {LREC 2020}
}