Unable to set max number of tokens in input higher than 1024

#3
by traopia - opened

Hey!
I am fine-tuning this model with my own data - but if I set the max number of tokens to be higher than 1024 I get this error
'IndexError: index out of range in self'
which indeed I would not expect - since I chose this model as to handle longer sequence inputs.
Does anyone know why that could be the case?

Can you paste your code here.

I get the exact same error 'IndexError: index out of range in self' when I set the max number of tokens to be higher than 1024. Is there a solution to this problem?

Paste code here.

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding

data_files = {"train":"/content/drive/MyDrive/THESIS/train_baseline_documents.csv", "test":"/content/drive/MyDrive/THESIS/test_baseline_documents.csv", "validation":"/content/drive/MyDrive/THESIS/val_baseline_documents.csv"}
splitted_dataset = load_dataset("csv", data_files=data_files)

Load the tokenizer of the model BioBERT and tokenize the dataset

checkpoint = "allenai/led-base-16384"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
return tokenizer(example["Document"], truncation=True, padding=True)

tokenized_dataset = splitted_dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.rename_column("Credibility", "labels")
print(tokenized_dataset)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Trainning arguments

from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="/home/theotsio", per_device_train_batch_size=6, seed=42)

Loading Modelfrom

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Train the pretrained model on the specific task

from transformers import Trainer

trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
tokenizer=tokenizer
)

from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
tracker.start()

Run the train

trainer.train()

tracker.stop()

Evaluation of validation

predictions_val = trainer.predict(tokenized_dataset["validation"])
predictions_test = trainer.predict(tokenized_dataset["test"])

import numpy as np

Evaluation of validation

preds_val = np.argmax(predictions_val.predictions, axis=-1)
preds_test = np.argmax(predictions_test.predictions, axis=-1)

import datasets
metric = datasets.load_metric("accuracy")

print("The validation accuracy", metric.compute(predictions=preds_val, references=predictions_val.label_ids))
print("The test set accuracy", metric.compute(predictions=preds_test, references=predictions_test.label_ids))

This is my code. Thank you for your time

Sign up or log in to comment