README.md · rosyvs/whisat at c593553646791ae6da29a1b0fcf6b3e4afba9c32

metadata

language:
  - en
library_name: transformers
pipeline_tag: automatic-speech-recognition

Model trained in int8 with LoRA

Usage:

prepare pipeline, providing any custom generate_kwargs supprted by https://huggingface.co/docs/transformers/v4.40.0/en/main_classes/text_generation#transformers.GenerationConfig

asr_model=prepare_pipeline(
        model_dir='.', # wherever you save the model
        generate_kwargs={
                'max_new_tokens':112,
                'num_beams':1,
                'repetition_penalty':1,
                'do_sample':False
                            }
                )

run ASR:

asr_model(audio_path)

run ASR on full directory in audio_dir: If generate_kwargs not specified, will give you (deterministic) greedy decoding with up to 112 tokens generated, no repetition penalty

ASRdirWhisat(
        audio_dir, 
        out_dir = '../whisat_results/',
        model_dir=".",
)

Training information: Training script: tune_hf_whisper.py
Training hyperparameters: hparams.yaml Training data manifest: PUBLIC_KIDS_TRAIN_v4_deduped.csv

Note: to recreate this training you will need to acquire the following public datasets: MyST (myst-v0.4.2) CuKids CSLU

and ensure they are stored at paths consistend with those in the data manifest above.

Reference: @inproceedings{southwell2024, title={Automatic speech recognition tuned for child speech in the classroom}, author={ Southwell, Rosy and Ward , Wayne and Trinh , Viet Anh and Clevenger, Charis and Clevenger, Clay and Watts, Emily and Reitman, Jason and D’Mello, Sidney and Whitehill, Jacob}, booktitle={{IEEE} International Conference on Acoustics, Speech and Signal Processing {ICASSP} 2024, Seoul, South Korea, April 14-19, 2024},
year={2024}, }