openai/whisper · OSError: Can't load tokenizer for 'SameerMahajan/whisper-tiny-retrained'

SameerMahajan

Mar 24, 2023

I retrained a model and saved / uploaded it to https://huggingface.co/SameerMahajan/whisper-tiny-retrained

However when I try to load it and use it for predictions like:

import os
import os.path
import torch
from transformers import pipeline

model_id = "SameerMahajan/whisper-tiny-retrained"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
"automatic-speech-recognition",
model=model_id,
device=device,
)

for number in range(1,21,1):
for i in range(50):
audio_file = './samples/' + str(number) + '/' + str(number) + '_' + str(i) + '.wav'
if os.path.isfile(audio_file):
out = pipe(audio_file)
print (audio_file, out)

I get an error of:

OSError: Can't load tokenizer for 'SameerMahajan/whisper-tiny-retrained'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'SameerMahajan/whisper-tiny-retrained' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.

I have made sure that I don't have local directory with that name.

Any ideas?

SameerMahajan

Mar 24, 2023

I could fix this problem by specifying my custom tokenizer in the pipeline.

sanchit-gandhi

Mar 24, 2023

•

edited Mar 24, 2023

Looks like the tokenizer wasn't saved during training - since we don't change it during training you can simply load the pre-trained tokenizer and then push it to your fine-tuned repo:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/whisper-tiny")

tokenizer.push_to_hub("SameerMahajan/whisper-tiny-retrained")

They you'll be able to load the pipeline without the tokenizer arg:

import torch
from transformers import pipeline

model_id = "SameerMahajan/whisper-tiny-retrained"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
"automatic-speech-recognition",
model=model_id,
device=device,
)

Hope that answers your question!

sanchit-gandhi changed discussion status to closed Mar 24, 2023