Model Card for Model ID

This model is pretrained and fine-tuned with Vietnamese language, based on GPT-NeoX which is a large language model developed by EleutherAI.

Model Details

Training Data

Pre-train: Culturax Vietnamese Dataset(450GB) + AI-Hub Vietnamese Dataset(1.3GB) + Crawled Vietnamese Wikipedia Dataset(630MB) + viwik18 Dataset(1.27GB)
Fine-tuning: 12MB Vietnamese Question & Answer dataset
Vietnamese Alpaca(16412 rows) + Vietnamese QA Dataset based on viwik18(14293 rows)

Training Hardware

Trained on A100 40GB GPU and 48 core CPU. Took 18 hours to reach 10 epochs.

Hyperparameters

Hyperparameter	Value
num_train_epochs	2670182400
train_batch_size	2
learning_rate	0.0001
warmup_steps	1000
weight_decay	0

How to use

The model can be loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune")
model = AutoModelForCausalLM.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune")