metadata

language: vi
tags:
  - vi
  - vietnamese
  - gpt2
  - text-generation
  - lm
  - nlp
datasets:
  - oscar
widget:
  - text: Việt Nam là quốc gia có

GPT-2

Pretrained model on Vietnamese language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page.

How to use the model

from transformers import GPT2Tokenizer, AutoModelForCausalLM

tokenizer = GPT2Tokenizer.from_pretrained("NlpHUST/gpt2-vietnamese")

model = AutoModelForCausalLM.from_pretrained("NlpHUST/gpt2-vietnamese")

Model architecture

A 12-layer, 768-hidden-size transformer-based language model.

Training

The model was trained on Vietnamese Oscar dataset (32 GB) to optimize a traditional language modelling objective on v3-8 TPU for around 6 days. It reaches around 13.4 perplexity on a chosen validation set from Oscar.

GPT-2 Finetuning

The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling.

The script here .

python run_clm.py \
    --model_name_or_path NlpHUST/gpt2-vietnamese \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm