Model
Why do these models perform so poorly in practice? Even compared to stock openai?
Write a comparison script. Load any model of yours. Then, at the same time in the same script, load it's openai counterpart. Then transcribe a sentence taken from common_voice_16.
What's the deal?
Can I take a peek at your train code?
Appreciate your very kind feedback! First of all, this model is Japanese ASR model, so you can't expect it to work well on non-Japanese speech (we believe you'd be aware of it but asking in case). Then, assuming that you are using Japanese subset of common_voice_16, how did you load the model, transcribe an audio with the model, and compute the metrics on the dataset? Following snippet shows an example to transcribe the Japanese subset of common_voice_8, the most famous one used to evaluate Japanese ASR model, and compute metrics of CER and WER. As we show in the repo, kotoba-whisper-v1.0 achieves very competitive scores on common_voice_8 with the openai/whisper-large-v3 model, so if there'd be really a huge gap in between kotoba-whisper and openai/whisper on the dataset you've tested, it's a pretty interesting finding and we would definitely work on that. So please double check your code first (might be easier to adapt our code to your dataset by replacing the common_voice_8 dataset japanese-asr/ja_asr.common_voice_8_0
to yours). If you still see a huge drop in the metric, let us know with your code to reproduce the score (a few failed examples should be helpful too!).
The training code should be ready soon after releasing a few more models ([kotoba-tech/kotoba-whisper-v1.1](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.1)
is ready btw, enjoy!).
Cheers
import torch
from transformers import pipeline
from datasets import load_dataset
from evaluate import load
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
# model config
normalizer = BasicTextNormalizer()
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}
pipe = pipeline(
"automatic-speech-recognition",
model="kotoba-tech/kotoba-whisper-v1.0",
torch_dtype=torch_dtype,
device=device,
model_kwargs=model_kwargs,
chunk_length_s=15,
batch_size=16
)
# load the dataset and get prediction
dataset = load_dataset("japanese-asr/ja_asr.common_voice_8_0", split="test")
prediction = pipe(dataset['audio'], generate_kwargs=generate_kwargs)
prediction = [i['text'].replace(" ", "") for i in prediction]
references = [i.replace(" ", "") for i in dataset['transcription']]
if arg.normalization:
prediction = [normalizer(i).replace(" ", "") for i in prediction]
references = [normalizer(i).replace(" ", "") for i in references]
audio_id = [i["path"] for i in dataset['audio']]
# compute metrics
cer_metric = load("cer")
wer_metric = load("wer")