Using non-fast tokenizers

#4
by mysil - opened
AI Sweden Model Hub org

Hi, I am trying to run a few tasks from NorBench on gpt-sw3-126m, specifically a sentiment analysis task. I have loaded the model and tokenizer using the suggested code:

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_name = "AI-Sweden-Models/gpt-sw3-126m"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)

But when running the script for sentiment analysis, I get the following error:
ValueError: word_ids() is not available when using non-fast tokenizers (e.g. instance of a XxxTokenizerFast class).

Is it correct that the instantiated tokenizer is slow, and does not support word_ids?
Here is a link to the script: https://github.com/ltgoslo/norbench/blob/main/evaluation_scripts/tsa_finetuning.py

AI Sweden Model Hub org

@mysil As part of the ScandEval framework I had to deal with this issue too. I ended up coding a "manual" version of word_ids, and you can find that implementation here.

Sign up or log in to comment