--- language: - en license: apache-2.0 library_name: transformers tags: - multi-modal - speech-language datasets: - mozilla-foundation/common_voice_16_1 - openslr/librispeech_asr - MLCommons/ml_spoken_words - Ar4ikov/iemocap_audio_text_splitted metrics: - wer - accuracy model-index: - name: SpeechLLM results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - type: wer value: 6.73 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - type: wer value: 9.13 name: Test WER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: wer value: 25.66 name: Test WER - task: type: audio-classification name: Audio Classification dataset: name: Common Voice 16.1 type: common_voice_16_1 split: test args: language: en metrics: - type: accuracy value: 60.41 name: Test Age Accuracy - type: accuracy value: 69.59 name: Test Accent Accuracy --- # SpeechLLM [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/skit-ai/SpeechLLM.git) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/skit-ai/SpeechLLM/blob/main/LICENSE) [![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-F9AB00?logo=googlecolab&color=blue)](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing) ![](./speechllm.png) SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following: 1. **SpeechActivity** : if the audio signal contains speech (True/False) 2. **Transcript** : ASR transcript of the audio 3. **Gender** of the speaker (Female/Male) 4. **Age** of the speaker (Young/Middle-Age/Senior) 5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) 6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) ## Usage ```python # Load model directly from huggingface from transformers import AutoModel model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True) model.generate_meta( audio_path="path-to-audio.wav", #16k Hz, mono audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", max_new_tokens=500, return_special_tokens=False ) # Model Generation ''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ``` Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing). Also, check out our [blog](https://tech.skit.ai/speech-conversational-llms/) on SpeechLLM for end-to-end conversational agents(User Speech -> Response). ## Model Details - **Developed by:** Skit AI - **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma) - **Language:** English - **Finetuned from model:** [HubertX](https://huggingface.co/facebook/hubert-xlarge-ll60k), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - **Model Size:** 2.1 B - **Checkpoint:** 2000 k steps (bs=1) - **Adapters:** r=4, alpha=8 - **lr** : 1e-4 - **gradient accumulation steps:** 8 ## Checkpoint Result | **Dataset** | **Type** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** | |:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:| | **librispeech-test-clean** | Read Speech | 6.73 | 0.9496 | | | | **librispeech-test-other** | Read Speech | 9.13 | 0.9217 | | | | **CommonVoice test** | Diverse Accent, Age | 25.66 | 0.8680 | 0.6041 | 0.6959 | ## Cite ``` @misc{Rajaa_SpeechLLM_Multi-Modal_LLM, author = {Rajaa, Shangeth and Tushar, Abhinav}, title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}}, url = {https://github.com/skit-ai/SpeechLLM} } ```