--- library_name: transformers, peft, torch tags: [asr, whisper, finetune, atc, aircraft, communications, english] --- # Model Card for Model ID [SUMMARY HERE] ## Model Details ### Model Description - **Developed by:** Jesse Arzate - **Model type:** Sequence-to-Sequence (Seq2Seq) Transformer-based model - **Language(s) (NLP):** English - **License:** [More Information Needed] - **Finetuned from model [optional]:** Whisper ASR: distil-large-v3 ### Model Sources [optional] - **Repository:** https://github.com/Vaibhavs10/fast-whisper-finetuning ## Uses ### Direct Use [More Information Needed] ### Downstream Use [optional] [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import ( AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, WhisperTokenizer, WhisperProcessor, ) from peft import PeftModel, PeftConfig peft_model_id = "baileyarzate/whisper-distil-large-v3-atc-english" # huggingface model path language = "en" task = "transcribe" device = 'cuda' peft_config = PeftConfig.from_pretrained(peft_model_id) model = WhisperForConditionalGeneration.from_pretrained( peft_config.base_model_name_or_path, device_map="cuda" ).to(device) model = PeftModel.from_pretrained(model, peft_model_id).to(device) tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task) processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task) feature_extractor = processor.feature_extractor forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task) pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor) model.config.use_cache = True def transcribe(audio): with torch.cuda.amp.autocast(): text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"] return text transcriptions_finetuned = [] for i in tqdm(range(len(df_subset))): # When you only have audio file path #transcriptions_finetuned.append(transcribe(librosa.load(df["path"][i], sr = 16000, offset = df["start"][i], duration = df["stop"][i] - df["start"][i])[0])) #,model # When you have audio array, saves time transcriptions_finetuned.append(transcribe(df_subset['array'].iloc[i])) transcriptions_finetuned = pd.DataFrame(transcriptions_finetuned, columns=['transcription_finetuned']) df_subset = df_subset.reset_index().drop(columns=['index']) df_subset = pd.concat([df_subset, transcriptions_finetuned], axis=1) ``` ## Training Details ### Training Data Dataset: ATC audio recordings from actual flight operations. Size: ~250 hours of annotated data. ### Training Procedure Modeled the procedure after: https://github.com/Vaibhavs10/fast-whisper-finetuning #### Preprocessing [optional] Preprocessing: Striped leading and trailing whitespaces from transcript sentences. Removed any sentences containing the phrase "UNINTELLIGIBLE" to filter out unclear or garbled speech. Removed filler words such as "ah" or "uh". #### Training Hyperparameters - **Training regime:** [More Information Needed] ```python training_args = Seq2SeqTrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=2, learning_rate=5e-4, warmup_steps=100, num_train_epochs=3, fp16=True, per_device_eval_batch_size=4, generation_max_length=128, logging_steps=100, save_steps=500, save_total_limit=3, remove_unused_columns=False, # required as the PeftModel forward doesn't have the signature of the wrapped model's forward label_names=["labels"], # same reason as above ) ``` #### Speeds, Sizes, Times [optional] Inference time is about 2 samples per second with an RTX A2000. ## Evaluation Final training loss: 0.103 ### Testing Data, Factors & Metrics #### Testing Data Dataset: ATC audio recordings from actual flight operations. Size: ~250 hours of annotated data. Randomly sampled 20% of the data with seed = 42. [More Information Needed] #### Factors [More Information Needed] #### Metrics Word Error Rate, Normalized Word Error Rate ### Results Mean WER for 500 test samples: 0.145 with 95% confidence interval: (0.123, 0.167) #### Summary [IN PROGRESS] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** RTX A2000 - **Hours used:** 24 - **Cloud Provider:** Private Infrustructure - **Compute Region:** Southern California - **Carbon Emitted:** 1.57 kg ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware - **CPU**: AMD EPYC 7313P 16-Core Processor 3.00 GHz - **GPU**: NVIDIA RTX A2000 - **vRAM**: 6GB - **RAM**: 128GB #### Software - **OS**: Windows 11 Enterprise - 21H2 - **Python**: Python 3.10.14 ## Citation [optional] [IN PROGRESS] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Model Card Contact Jesse Arzate: baileyarzate@gmail.com