rosyvs/whisat · Hugging Face

Model trained in int8 with LoRA

Usage:

prepare pipeline, providing any custom generate_kwargs supprted by https://huggingface.co/docs/transformers/v4.40.0/en/main_classes/text_generation#transformers.GenerationConfig

asr_model=prepare_pipeline(
        model_dir='.', # wherever you save the model
        generate_kwargs={
                'max_new_tokens':112,
                'num_beams':1,
                'repetition_penalty':1,
                'do_sample':False
                            }
                )

run ASR:

asr_model(audio_path)

run ASR on full directory in audio_dir: If generate_kwargs not specified, will give you (deterministic) greedy decoding with up to 112 tokens generated, no repetition penalty

ASRdirWhisat(
        audio_dir, 
        out_dir = '../whisat_results/',
        model_dir=".",
)

Training information:

Training script: tune_hf_whisper.py
Training hyperparameters: hparams.yaml
Training data manifest: PUBLIC_KIDS_TRAIN_v4_deduped.csv

Note: to recreate this training you will need to acquire the following public datasets:

MyST (myst-v0.4.2)
CuKids
CSLU

and ensure they are stored at paths consistend with those in the data manifest above.

Reference:

@inproceedings{southwell2024,
  title={Automatic speech recognition tuned for child speech in the classroom},
  author={ Southwell, Rosy and  Ward , Wayne and Trinh , Viet Anh and Clevenger, Charis and  Clevenger, Clay and  Watts, Emily and Reitman, Jason and  D’Mello, Sidney and Whitehill, Jacob},
booktitle={{IEEE} International Conference on Acoustics, Speech and Signal Processing
                  {ICASSP} 2024, Seoul, South Korea, April 14-19, 2024},  
                  year={2024},
}