--- language: - en license: apache-2.0 library_name: transformers datasets: - mozilla-foundation/common_voice_16_1 - openslr/librispeech_asr metrics: - wer - accuracy model-index: - name: SpeechLLM results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - type: wer value: 7.3 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - type: wer value: 10.47 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: wer value: 24.47 name: Test WER - task: type: audio-classification name: Audio Classification dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: accuracy value: 60.61 name: Test Age Accuracy - task: type: audio-classification name: Audio Classification dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: accuracy value: 61.56 name: Test Accent Accuracy --- # SpeechLLM [The model is still training, we will be releasing the latest checkpoints soon...] SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following: 1. **SpeechActivity** : if the audio signal contains speech (True/False) 2. **Transcript** : ASR transcript of the audio 3. **Gender** of the speaker (Female/Male) 4. **Age** of the speaker (Young/Middle-Age/Senior) 5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) 6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) ## Usage ```python # Load model directly from huggingface from transformers import AutoModel model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True) model.generate_meta( audio_path="path-to-audio.wav", instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", max_new_tokens=500, return_special_tokens=False ) # Model Generation ''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ``` ## Model Details - Model Size : 2.1 B - Checkpoint : 2000 k steps (bs=1) - Adapters : r=4, alpha=8 - lr = 1e-4 - gradient accumulation steps : 8 ## Checkpoint Result | **Dataset** | **Word Error Rate(%)** | **Gender(%)** | **Age(%)** | **Accent(%)** | |:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:| | librispeech-test-clean | 0.0736 | 0.9490 | | | | librispeech-test-other | 0.1047 | 0.9099 | | | | CommonVoice test | 0.2447 | 0.8680 | 0.6061 | 0.6156 |