training_bilingual

This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-bilingual dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

The run_clm.py script from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \
--nproc_per_node=2 run_clm.py --output_dir="./training_bilingual" \
--model_type="gpt2" \
--config_name="./training" \
--tokenizer_name="./training" \
--dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-bilingual" \
--do_train \
--per_device_train_batch_size 8 \
--block_size="1024" \
--learning_rate="5e-3" --warmup_steps="1000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="1" \
--logging_steps="500" \
--save_steps="5000" --preprocessing_num_workers="16" \
--gradient_accumulation_steps="4" --report_to="tensorboard" \
--logging_dir="./log_bilingual"  > command_bilingual_log.log 2>&1 &

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.005
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 64
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 1.0

Training results

Evaluation results

Perplexity on random 2000 examples of the target language's Wikipedia dataset, using the code provided in the perplexity docs, with 512 tokes of stride. Baseline is the result from evaluating OpenAI's GPT-2 on the same examples.

Target language	PPL	Baseline PPL
en	40.30453872680664	26.562532424926758
de	24.30541229248047	56.907039642333984
es	22.53978729248047	55.592445373535156
fr	26.614990234375	49.69472885131836
it	28.24549674987793	75.95120239257812
pt	19.720951080322266
nl	33.292930603027344

The following script was used for evaluation

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm
import random

# Set the seed for reproducibility
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-bilingual"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

target_language_dataset = "20231101.de" # change here for other languages

dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
num_examples = 2000
random_numbers = list(np.random.randint(0, len(dataset), num_examples))
examples = []
for i in tqdm(random_numbers):
    examples.append(dataset[int(i)]["text"])
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

print("Perplexity: ", ppl.item())

Framework versions

Transformers 4.37.0.dev0
Pytorch 1.13.0
Datasets 2.16.0
Tokenizers 0.15.0

RaiBP
/

gpt2-openwebtext2-first-30-chunks-ablation-bilingual

training_bilingual

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Evaluation results

Framework versions

Dataset used to train RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-bilingual

Evaluation results