issues working with marathi numbers

#66
by SameerMahajan - opened

We are trying to interpret numbers using whisper model. When we give audio files with recordings of numbers in English, the model gives consistent results. e.g. 30-40 files of english number 1, consistently generated "One". However when we tried that with marathi number 1, there were wide (20+) wide variations like ek, ik, Ik, 1 etc. It will make building our application difficult as we need to correctly identify numbers 1 through 300 in marathi.

We can share our audio files (both in english as well as marathi), test code, model etc. if it will help in analyzing. Please let us know.

Hey @SameerMahajan ! To clarify, are you simply performing audio classification? i.e. you have an audio input where someone says a number, and you want to predict the number that they said. Or is this in the context of speech recognition? i.e. someone speaks a full sentence, one word of which is a number that you want to transcribe correctly.

Is the performance of the model otherwise good on Marathi? Am wondering whether you can fine-tune it for audio classification or speech recognition to boost Marathi performance as required (https://huggingface.co/blog/fine-tune-whisper)

@sanchit-gandhi our case is simple earlier one of audio classification. This is the only thing we are doing with the model so don't know any other thing about the model. Thanks for sharing your blog on fine tuning which we will take a look at to see whether it helps in our case.

Hey @SameerMahajan ! If you have a couple of hours of labelled audio-text data then you should definitely be able to fine-tune Whisper for this task!

One other thing you can try is using a Marathi model that has already been fine-tuned and see if it's any better: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=mr&split=test

The number one model has a demo built for it: https://huggingface.co/spaces/DrishtiSharma/Whisper-Marathi-Transcriber

IMO it's worth trying this out first and seeing how it performs.

You can also use the Python API to directly use the model in a Python script:

import torch
from transformers import pipeline

model_id = "DrishtiSharma/whisper-large-v2-marathi"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
   "automatic-speech-recognition",
   model=model_id,
   device=device,
)

audio = PATH/TO/YOUR/AUDIO

out = pipe(audio)

Thanks @sanchit-gandhi we will take a look. Do you have any example of retraining / tuning model with our own custom data (rather than your datasets which have somewhat complex structure)? The reason I am asking is that we have only numbers and some pre recorded audio files (30-40 samples per number). We can just label them as "1", "2", ... etc., try retraining / tuning and see what we get.

@sanchit-gandhi one problem with this tuned marathi model is that it is very large (6.17 GB as I see it during download). Our use case (https://youtu.be/L3L4mEszzTs) requires us to build an offline android app which typically cannot exceed a couple of hundred MBs...

Hey @SameerMahajan ! I've looked deeper into fine-tuning for audio classification and got it working very well! With language identification, I get state of the art results: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id

See related PR here: https://github.com/huggingface/transformers/pull/21754

And expect a blog post shortly!

Regarding model size, you can try the small or tiny checkpoints on the leaderboard? These (as the names suggest) are smaller than the large checkpoint!

Leaderboard: https://huggingface.co/spaces/whisper-event/leaderboard?dataset=mozilla-foundation%2Fcommon_voice_11_0&config=mr&split=test
Small: https://huggingface.co/steja/whisper-small-marathi
Tiny: https://huggingface.co/shripadbhat/whisper-tiny-mr

The tiny one is only 41 million parameters (vs large-v2 which is 1 billion), so very fast and lightweight

These are the different ways you can get your dataset into the right format: https://huggingface.co/docs/datasets/audio_dataset#local-files

@sanchit-gandhi Trying to use the tiny model I am getting an error of:

requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/shripadbhat/whisper-tiny-mr/resolve/main/config.json

I am trying to use the model as:

model_id = "shripadbhat/whisper-tiny-mr"

in your above code. Here is the complete stack:

Traceback (most recent call last):
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/shripadbhat/whisper-tiny-mr/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\file_download.py", line 1305, in hf_hub_download
http_get(
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\file_download.py", line 514, in http_get
hf_raise_for_status(r)
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_errors.py", line 301, in hf_raise_for_status
raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/shripadbhat/whisper-tiny-mr/resolve/main/config.json (Request ID: Root=1-640c53a1-5ee17d5232f5e2717c822557)

Internal Error - We're working hard to fix that as soon as possible!

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\ML\Tables\whisper\try.py", line 8, in
pipe = pipeline(
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\pipelines_init_.py", line 675, in pipeline
config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\auto\configuration_auto.py", line 852, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\configuration_utils.py", line 565, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\configuration_utils.py", line 620, in _get_config_dict
resolved_config_file = cached_file(
File "C:...\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\hub.py", line 466, in cached_file
raise EnvironmentError(f"There was a specific connection error when trying to load {path_or_repo_id}:\n{err}")
OSError: There was a specific connection error when trying to load shripadbhat/whisper-tiny-mr:
500 Server Error: Internal Server Error for url: https://huggingface.co/shripadbhat/whisper-tiny-mr/resolve/main/config.json (Request ID: Root=1-640c53a1-5ee17d5232f5e2717c822557)

Internal Error - We're working hard to fix that as soon as possible!

Hey @SameerMahajan ! Looks like there was a network outage on the HuggingFace Hub - should be fixed now and the code should run!

Thanks @sanchit-gandhi I could use the tiny model now.

I ran tiny and large models over my recordings of number 1 in marathi. The large model did fairly well (though it could have been better) however tiny model results are not acceptable for our use case. The large model is not acceptable because of its size :( Here are the details:

with tiny model:
1_0.wav {'text': ' E.'}
1_1.wav {'text': ' In.'}
1_2.wav {'text': ' [MUSIC PLAYING]'}
1_3.wav {'text': ' A.'}
1_4.wav {'text': ' In.'}
1_5.wav {'text': ' [INAUDIBLE]'}
1_6.wav {'text': ' A'}
1_7.wav {'text': ' [MUSIC PLAYING]'}
1_8.wav {'text': ' [MUSIC PLAYING]'}
1_9.wav {'text': ' [MUSIC PLAYING]'}
1_10.wav {'text': ' [MUSIC PLAYING]'}
1_11.wav {'text': ' [MUSIC PLAYING]'}
1_12.wav {'text': ' [INAUDIBLE]'}
1_13.wav {'text': ' Egg?'}
1_14.wav {'text': ' Egg?'}
1_15.wav {'text': ' [MUSIC PLAYING]'}
1_16.wav {'text': ' Egg.'}
1_17.wav {'text': ' [MUSIC PLAYING]'}
1_18.wav {'text': ' [MUSIC PLAYING]'}
1_19.wav {'text': ' [MUSIC PLAYING]'}
1_20.wav {'text': ' Egg.'}
1_21.wav {'text': ' Em.'}
1_22.wav {'text': ' [MUSIC PLAYING]'}
1_23.wav {'text': ' [BLANK_AUDIO]'}
1_24.wav {'text': ' Inc.'}
1_25.wav {'text': ' A.'}
1_26.wav {'text': ' [MUSIC PLAYING]'}

with large model:
1_0.wav {'text': 'एक.'}
1_1.wav {'text': 'एक.'}
1_2.wav {'text': 'एक'}
1_3.wav {'text': ' E.'}
1_4.wav {'text': ' एक.'}
1_5.wav {'text': ' Egg.'}
1_6.wav {'text': 'एक.'}
1_7.wav {'text': ' EK.'}
1_8.wav {'text': 'एक.'}
1_9.wav {'text': ' एक.'}
1_10.wav {'text': ' E.'}
1_11.wav {'text': 'एक.'}
1_12.wav {'text': ' E.'}
1_13.wav {'text': 'एक.'}
1_14.wav {'text': ''}
1_15.wav {'text': 'एक.'}
1_16.wav {'text': ' Egg.'}
1_17.wav {'text': ' 1.'}
1_18.wav {'text': 'एक.'}
1_19.wav {'text': ''}
1_20.wav {'text': ''}
1_21.wav {'text': ' E.'}
1_22.wav {'text': 'एक.'}
1_23.wav {'text': ' Egg.'}
1_24.wav {'text': 'एक.'}
1_25.wav {'text': ''}
1_26.wav {'text': ''}

@sanchit-gandhi I tried to follow the steps given in your pointer to push my data as a dataset. However when I do

audio_dataset.push_to_hub("SameerMahajan/my_trial_dataset")

I get an error of:

Repository Not Found for url: https://huggingface.co/api/datasets/SameerMahajan/my_trial_dataset.
Please make sure you specified the correct repo_id and repo_type.
If you are trying to access a private or gated repo, make sure you are authenticated.

I had already

huggingface-cli login

with my token to get this far....

Also @sanchit-gandhi from your link https://huggingface.co/docs/datasets/audio_dataset#local-files to get my data into the right format, I could not figure out how I can provide my labels to my individual data points in my dataset...

Based on some samples I tried putting my recordings in different folders e.g. all number 1 recordings in folder "1" etc. and specified these paths to my dataset creation. However it did not create the 'label' column as mentioned in the link...

I guess I could work with hugging face Datasets using add_column, concatenate_datasets etc. to get my data in the right format. Now let me see whether I can fit it into your training pipeline...

When I try to create a repo I get an error of:

You don't have the rights to create a model under this namespace

How do I request creation rights? Alternatively can I work without using hub for my retrained models etc.? If so, how?

@sanchit-gandhi ,

I could train with something like:

https://github.com/sameermahajan/whisper/blob/main/Retrain.py

however when I set

push_to_hub=True,

I get

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create

due to

huggingface_hub.utils._errors.HfHubHTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6418833b-15996d6344b9c59339f63e1b)

and

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/SameerMahajan/whisper-tiny-retrained

huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-6418833c-430d7ab9489fbeb14ac0d04e)

When I try

from huggingface_hub import create_repo
create_repo("SameerMahajan/whisper-tiny-retrained")

I get

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create

and

huggingface_hub.utils._errors.HfHubHTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6418836f-21598dc73265a63342bee7a2)

In the pipeline I don't see a way to point to my in memory model which I suppose should be retrained after

trainer.train()

I could create a 'WRITE' token and then proceed with pushing to hub. I guess the default token had only READ access.

Yep! You need a token with write access for this to work - great find!

Hey @SameerMahajan ! I've seen that you've successfully got your dataset up on the Hub! https://huggingface.co/datasets/SameerMahajan/marathi_numbers-1-20

Nice work!

Could you tweak your dataset in one way:

from datasets import load_dataset
from datasets import ClassLabel, Sequence

raw_datasets = load_dataset("SameerMahajan/marathi_numbers-1-20")

def preprocess(batch):
    batch["number"] = batch["labels"]
    batch["labels"] =  [batch["labels"][0] - 1]
    return batch

raw_datasets = raw_datasets.map(preprocess)

# get labels from the dataset
label_names = sorted(set(label for labels in raw_datasets["train"]["labels"] for label in labels))

# Cast to ClassLabel
raw_datasets = raw_datasets.cast_column("labels", Sequence(ClassLabel(names=label_names)))

# push to hub
raw_datasets.push_to_hub("SameerMahajan/marathi_numbers-1-20")

Could you try this for fine-tuning your model?

  1. Update your transformers version: pip install -U transformers
  2. Copy the script run_audio_classification.py from here: https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py
  3. Change the lines:
    labels = raw_datasets["train"].features[data_args.label_column_name].names
    label2id, id2label = {}, {}
    for i, label in enumerate(labels):
        label2id[label] = str(i)
        id2label[str(i)] = label

to:

    labels = raw_datasets["train"].features[data_args.label_column_name].feature.names
    label2id, id2label = {}, {}
    for i, label in enumerate(labels):
        label2id[label] = str(i + 1)
        id2label[str(i + 1)] = label
  1. Create bash file called run.sh (just execute the copy and paste what I've put below):
echo 'python run_audio_classification.py \
    --model_name_or_path openai/whisper-tiny \
    --dataset_name SameerMahajan/marathi_numbers-1-20 \
    --train_split_name train \
    --eval_split_name train \
    --output_dir whisper-tiny-ft-marathi-numbers \
    --overwrite_output_dir \
    --remove_unused_columns False \
    --label_column_name labels \
    --do_train \
    --do_eval \
    --fp16 \
    --learning_rate 3e-5 \
    --max_length_seconds 3 \
    --attention_mask False \
    --warmup_ratio 0.1 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 1 \
    --per_device_eval_batch_size 8 \
    --dataloader_num_workers 4 \
    --logging_strategy steps \
    --logging_steps 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --load_best_model_at_end True \
    --metric_for_best_model accuracy \
    --save_total_limit 3 \
    --seed 0 \
    --freeze_feature_encoder False \
    --push_to_hub'  >> run.sh
  1. Call run.sh to start training!

I did all of the above steps and am training the model here: https://huggingface.co/sanchit-gandhi/whisper-tiny-ft-marathi-numbers

Let me know if that helps! Those are all the steps you need to fine-tune the model for number identification! You might need to increase the regularisation since your dataset is small: https://huggingface.co/openai/whisper-tiny/blob/a8d76517e6d65d92771752dbbf5e9c0a1a5b3a0d/config.json#L20

Note that we just train and evaluate on the same dataset (this is why we get 100% accuracy) - if you had more data we would want to evaluate on some held-out dataset to check your model performance

You can also try training for fewer epochs to reduce overfitting.

@sanchit-gandhi thanks for your pointers however

  • when I try inference APIs on your model I keep getting "unknown error"

image.png

  • for training / text classification updating transformers did not work but I had to install it from sources. But even then when I run run_audio_classification.py I get the error of
    AttributeError: Can't pickle local object 'main..train_transforms'
    followed by error of
    EOFError: Ran out of input

and then looks like it just hangs.

I guess I will try testing your model in my program https://github.com/sameermahajan/whisper/blob/main/MyMarathiModel.py next.

Trying your model in my https://github.com/sameermahajan/whisper/blob/main/LiveDemo.py inference program also gives an error of

KeyError: 0

in transformers\pipelines\audio_classification.py", line 175, in
labels = [{"score": score, "label": self.model.config.id2label[_id]} for score, _id in zip(scores, ids)]

@sanchit-gandhi when I try your sanchit-gandhi/whisper-tiny-ft-marathi-numbers model in https://github.com/sameermahajan/whisper/blob/main/MyMarathiModel.py I think it is working. I get only 3/4 variations for each number (e.g. only 2 variations like below for number 1). How do I change that really long text to something manageably small?

./samples/1/1_33.wav {'text': ' мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty salty'}
./samples/1/1_34.wav {'text': ' мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод мод'}

better yet if I could make pipeline predict 'labels' or 'number' that we have specified. I couldn't figure it out from parameters that I can specify on pipe / pipeline while predicting... But I might not know where to look for such possibilities...

Hey @SameerMahajan - the model works for me (with a few hacks):

from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
from datasets import Audio, load_dataset
import torch

# load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("sanchit-gandhi/whisper-tiny-ft-marathi-numbers")
feature_extractor = AutoFeatureExtractor.from_pretrained("sanchit-gandhi/whisper-tiny-ft-marathi-numbers")

# load dataset and resample
dataset = load_dataset("SameerMahajan/marathi_numbers-1-20")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))

# load first audio sample 
sample = dataset["train"][0]
audio = sample["audio"]

# get log mel features
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features

# move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# forward pass to get logits
with torch.no_grad():
    logits = model(input_features.to(device)).logits

# take argmax to get pred label
pred_label = torch.argmax(logits)
pred_id = model.config.id2label[int(pred_label)]

print("Prediction: ", pred_id)
print("Target: ", sample["number"])

Print Output:

Prediction:  1
Target:  [1]

I've fixed it so that pipeline works now :)

from transformers import pipeline
from datasets import Audio, load_dataset

# load pipeline
pipe = pipeline("audio-classification", model="sanchit-gandhi/whisper-tiny-ft-marathi-numbers")

# load dataset and resample
dataset = load_dataset("SameerMahajan/marathi_numbers-1-20")
dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))

# load first audio sample 
sample = dataset["train"][0]
audio = sample["audio"]["array"]

out = pipe(audio)
print(out)

label = out[0]["label"]
print(label)

The inference API also works at https://huggingface.co/sanchit-gandhi/whisper-tiny-ft-marathi-numbers:

Screenshot 2023-04-04 at 17.26.44.png

Seems like we're all good here right @SameerMahajan ?

@sanchit-gandhi how about predicting from a file using your model? Say I have some other test .wav recording for a number...

@sanchit-gandhi the hosted inference API works now but very inaccurate :(

@sanchit-gandhi how about predicting from a file using your model? Say I have some other test .wav recording for a number...

Then you would just do:

from transformers import pipeline
from datasets import Audio, load_dataset

# load pipeline
pipe = pipeline("audio-classification", model="sanchit-gandhi/whisper-tiny-ft-marathi-numbers")

# inference
output = pipe("test.wav")

@sanchit-gandhi the hosted inference API works now but very inaccurate :(

We probably need more data in that case to get the model to work - have you tried training using regularisation (like dropout)? This would also help given the small dataset size.

Otherwise what you can try is prompting. The issue here explains how this works: https://github.com/huggingface/transformers/issues/22395. The PR here will add it to transformers shortly: https://github.com/huggingface/transformers/pull/22496

You can then supply the twenty Marathi numbers as prompts to the Whisper model. This should boost the probability of predicting these words.

Sign up or log in to comment