Vocabulary contains hole for index 51959

#4
by tomashoufek - opened

When trying to further pre-train the model on a specific domain I encountered an error:

When tokenizing using the robeczech-base tokenizer warning accours:
The OrderedVocab you are attempting to save contains a hole for index 51959, your vocabulary could be corrupted !

When I start training the model python throws following error PyTroch:

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [94,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
# This error is repeated for various block and thread values.

Traceback (most recent call last):
  File "/home/jovyan/tomas/medical-lm/PyTorch/TrainLM-masked-pytorch.py", line 98, in <module>
    result = trainer.train()
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/opt/conda/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 1100, in forward
    outputs = self.roberta(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward
    embedding_output = self.embeddings(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 123, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Code I use to pre-train the model:

import torch

from transformers import AutoTokenizer, AutoModelForMaskedLM

from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset
import math
import evaluate

from pynvml import *

model_name = "ufal/robeczech-base" 

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset('text', data_dir="data/")
datasets = dataset["train"].train_test_split(test_size = 0.1)

print(tokenizer.model_max_length)

print_gpu_utilization()

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=16, remove_columns=["text"])

block_size=512
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=16,
)

model = AutoModelForMaskedLM.from_pretrained(model_name)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

training_args = TrainingArguments(
    f"{model_name}-pre-trained-med",
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=2000,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=True,
    logging_dir=f"{model_name}-pre-trained-med",
    logging_strategy="steps",
    num_train_epochs=10,
    logging_steps=100,
    save_strategy="epoch",
    save_total_limit=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["test"],
    data_collator=data_collator,
)

result = trainer.train()
print_summary(result)

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

trainer.save_model(f"py{model_name}-med-pretrained")

I tried to download the model and tokenizer locally, "fill" the vocab hole and reload the tokenizer, but I was not able to reload the fixed tokenizer.

Are there any possible ways how to fix this?

Institute of Formal and Applied Linguistics, Charles University, Prague org

Hi,

yes, the tokenizer of RobeCzech model is unfortunately a bit non-standard. Notably, there are multiple subwords with the same ID 3 (originally the ID of an <unk> token).

  • The problem was caused by the following. We first created a ByteBPE tokenizer, remapped the inputs, and then trained the model using FairSeq. However, that again renumbered the subwords and embeddings were created only for subwords that actually appeared in the training data. After training, we "composed" the two mappings, arriving at the final tokenizer. However, the ByteBPE tokenizer requires the 256 special tokens representing the 0-255 byte values, and some of them were not present in the training data, so they did not get an embedding. However, without some of the subwords for 0-255 byte values the ByteBPE tokenizer does not even load.

    Unfortunately, we "solved" the issue by mapping the missing subwords to index 3 (<unk>).

  • We provide both the "fast" and "slow" tokenizers in this repo, mapping multiple tokens to ID 3. However, they cannot be saved (as the tokenizers are expected to be injective), so you must refrain from saving them. Furthermore, the number of embeddings is not the same as the number of subwords in the tokenizer.

  • Other than that, the model works fine, and we have finetuned it successfully both in PyTorch and in TensorFlow.

  • Retrospectively, a much better fix would be to actually append the embeddings for the missing tokens (and initialize them to the value of <unk>) -- the tokenizer would be injective and standard.

    However, changing ufal/robeczech-base this way would not be backward compatible, so if people finetuned the original version and then tried to load it into the updated model, it would fail (because the number of embeddings would be different) -- which is why we haven't done it in the first place. We could release a model with a name like ufal/robeczech-standardtokenizer-base with a "normal" tokenizer, but we do not currently think it is worth it.

Sorry for the trouble and cheers!

foxik changed discussion status to closed

Sign up or log in to comment