--- license: apache-2.0 datasets: - bigscience-data/roots_vi_binhvq_news_corpus - wikipedia language: - vi - en - zh library_name: transformers tags: - t5 - flant5 - summarization - translation - question-answering pipeline_tag: fill-mask --- ## Extend vocabulary and Pretrain We utilized [SentencePiece](https://github.com/google/sentencepiece) to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens. For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources: - [NewsCorpus](https://github.com/binhvq/news-corpus) - Vietnamese Wikipedia - Vietnamese books - Vietnamese legal documents - Vietnamese legal text - English Wikipedia - Chinese Text ## How to use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large") model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large") model.cuda() ``` ## Finetune and Benchmark - Wikilingua - Vietnews - Pho_NER - ..... ## Citation - Hatto - Ipcoms