--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_16_1 - openslr/librispeech_asr language: - en metrics: - wer library_name: transformers model-index: - name: SpeechLLM results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - name: Test WER type: wer value: 12.3 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 18.9 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - name: Test WER type: wer value: 25.01 --- # SpeechLLM SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following: 1. **SpeechActivity** : if the audio signal contains speech (True/False) 2. **Transcript** : ASR transcript of the audio 3. **Gender** of the speaker (Female/Male) 4. **Age** of the speaker (Young/Middle-Age/Senior) 5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) 6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) ## Usage ```python # Load model directly from huggingface from transformers import AutoModel model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True) model.generate_meta( audio_path="path-to-audio.wav", instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", max_new_tokens=500, return_special_tokens=False ) # Model Generation ''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ``` ## Model Details - Model Size : 2.1 B - Checkpoint : 2000 k steps ## Checkpoint Result | Dataset | Word Error Rate(%) | Gender(%) | |:----------------------:|:------------------:|:---------:| | librispeech-test-clean | 0.1230 | 0.8778 | | librispeech-test-other | 0.1890 | 0.8908 | | CommonVoice test | 0.2501 | 0.8753 |