MU-NLPC
/

whisper-tiny-audio-captioning

@@ -4,30 +4,33 @@ datasets:
 - AudioCaps
 - Clotho-v2.1
 metrics:
-- TODO
 model-index:
-- name: whisper-TODO-audio-captioning
   results:
   - task:
       type: audio-captioning
       name: Audio Captioning
     dataset:
-      type: TODO
-      name: TODO
-      split: TODO
     metrics:
-    - type: Spider
-      value: TODO
     - type: SPICE
-      value: TODO
     - type: CIDEr
-      value: TODO
     - type: SPIDEr
-      value: TODO
     - type: METEOR
-      value: TODO
     - type: SacreBLEU
-      value: TODO
 license: cc-by-nc-4.0
 language:
 - en
@@ -41,7 +44,7 @@ A transformer encoder-decoder model for automatic audio captioning. As opposed t
 - **Model type:** Whisper encoder-decoder transformer
 - **Language(s) (NLP):** en
 - **License:** cc-by-4.0
-- **Parent Model:** openai/whisper-TODO
 - **Resources for more information:**
     - [GitHub Repo](https://github.com/prompteus/audio-captioning)
     - [Technical Report](TODO)
@@ -55,14 +58,14 @@ Minimal example:
 ```python3
 # Load model
-architecture = "openai/whisper-TODO"
-checkpoint = "TODO"
 model = audiocap.WhisperForAudioCaptioning.from_pretrained(checkpoint)
 tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
 feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
 # Load and preprocess audio
-input_file = "TODO"
 audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
 features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
@@ -93,9 +96,9 @@ Our model class `WhisperForAudioCaptioning` can be found in our git repository o
 ## Training details
-The model was initialized by original speech-to-text `openai/whisper-TODO` weights. Then, it was pretrained on a mix of (1) subset of AudioSet with synthetic labels, (2) AudioCaps captioning dataset and (3) Clotho v2.1 captioning dataset. Finally, it was finetuned on Clotho v2.1 to focus the model on the specific style of the captions. For each traning input, the model was informed about the source of the data, so it can mimic the caption style in all 3 styles.
-During pretraining, the ratio of samples in each batch was approximately 12:3:1 (AudioSet:AudioCaps:Clotho). The pretraining took TODO steps with batch size 32 and learning rate 2e-5. Finetuning was done on Clotho only, and the model was trained for TODO steps with batch size 32 and learning rate 4e-6. All layers except *fc1* layers were frozen during finetuning.
 For more information about the training regime, see the [technical report](TODO).
@@ -104,6 +107,14 @@ For more information about the training regime, see the [technical report](TODO)
 Metrics reported in the metadata were computed on Clotho v2.1 test split with captions generated using a beam search with 5 beams.
 ## Limitations

 - AudioCaps
 - Clotho-v2.1
 metrics:
+- SPICE
+- CIDEr
+- SPIDEr
+- METEOR
+- SacreBLEU
 model-index:
+- name: whisper-tiny-audio-captioning
   results:
   - task:
       type: audio-captioning
       name: Audio Captioning
     dataset:
+      type: clotho-v2.1
+      name: Clotho
+      split: evaluation
     metrics:
     - type: SPICE
+      value: 0.1077
     - type: CIDEr
+      value: 0.3404
     - type: SPIDEr
+      value: 0.2240
     - type: METEOR
+      value: 0.3452
     - type: SacreBLEU
+      value: 13.77
 license: cc-by-nc-4.0
 language:
 - en
 - **Model type:** Whisper encoder-decoder transformer
 - **Language(s) (NLP):** en
 - **License:** cc-by-4.0
+- **Parent Model:** openai/whisper-tiny
 - **Resources for more information:**
     - [GitHub Repo](https://github.com/prompteus/audio-captioning)
     - [Technical Report](TODO)
 ```python3
 # Load model
+architecture = "openai/whisper-tiny"
+checkpoint = "MU-NLPC/whiper-tiny-audio-captioning"
 model = audiocap.WhisperForAudioCaptioning.from_pretrained(checkpoint)
 tokenizer = transformers.WhisperTokenizer.from_pretrained(checkpoint, language="en", task="transcribe")
 feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
 # Load and preprocess audio
+input_file = "..."
 audio, sampling_rate = librosa.load(input_file, sr=feature_extractor.sampling_rate)
 features = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features
 ## Training details
+The model was initialized by original speech-to-text `openai/whisper-tiny` weights. Then, it was pretrained on a mix of (1) subset of AudioSet with synthetic labels, (2) AudioCaps captioning dataset and (3) Clotho v2.1 captioning dataset. Finally, it was finetuned on Clotho v2.1 to focus the model on the specific style of the captions. For each traning input, the model was informed about the source of the data, so it can mimic the caption style in all 3 styles.
+During pretraining, the ratio of samples in each batch was approximately 12:3:1 (AudioSet:AudioCaps:Clotho). The pretraining took 36000 steps with batch size 32 and learning rate 2e-5. Finetuning was done on Clotho only, and the model was trained for 3900 steps with batch size 32 and learning rate 4e-6. All layers except *fc1* layers were frozen during finetuning.
 For more information about the training regime, see the [technical report](TODO).
 Metrics reported in the metadata were computed on Clotho v2.1 test split with captions generated using a beam search with 5 beams.
+|                      | whisper-tiny | whisper-small | whisper-large-v2 |
+|----------------------|--------------|---------------|------------------|
+| SacreBLEU            | 13.77        | 15.76         | 16.50            |
+| METEOR               | 0.3452       | 0.3781        | 0.3782           |
+| CIDEr                | 0.3404       | 0.4142        | 0.4331           |
+| SPICE                | 0.1077       | 0.1234        | 0.1257           |
+| SPIDEr               | 0.2240       | 0.2687        | 0.2794           |
 ## Limitations