|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- multi-modal |
|
- speech-language |
|
datasets: |
|
- mozilla-foundation/common_voice_16_1 |
|
- openslr/librispeech_asr |
|
- MLCommons/ml_spoken_words |
|
- Ar4ikov/iemocap_audio_text_splitted |
|
metrics: |
|
- wer |
|
- accuracy |
|
model-index: |
|
- name: SpeechLLM |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 11.51 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 16.68 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Common Voice 16.1 |
|
type: common_voice_16_1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 26.02 |
|
name: Test WER |
|
- task: |
|
type: audio-classification |
|
name: Audio Classification |
|
dataset: |
|
name: Common Voice 16.1 |
|
type: common_voice_16_1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: accuracy |
|
value: 64.98 |
|
name: Test Age Accuracy |
|
- type: accuracy |
|
value: 81.21 |
|
name: Test Accent Accuracy |
|
--- |
|
|
|
# SpeechLLM |
|
|
|
![](./speechllm.png) |
|
|
|
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following: |
|
1. **SpeechActivity** : if the audio signal contains speech (True/False) |
|
2. **Transcript** : ASR transcript of the audio |
|
3. **Gender** of the speaker (Female/Male) |
|
4. **Age** of the speaker (Young/Middle-Age/Senior) |
|
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) |
|
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated) |
|
|
|
## Usage |
|
```python |
|
# Load model directly from huggingface |
|
from transformers import AutoModel |
|
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True) |
|
|
|
model.generate_meta( |
|
audio_path="path-to-audio.wav", #16k Hz, mono |
|
audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly |
|
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", |
|
max_new_tokens=500, |
|
return_special_tokens=False |
|
) |
|
|
|
# Model Generation |
|
''' |
|
{ |
|
"SpeechActivity" : "True", |
|
"Transcript": "Yes, I got it. I'll make the payment now.", |
|
"Gender": "Female", |
|
"Emotion": "Neutral", |
|
"Age": "Young", |
|
"Accent" : "America", |
|
} |
|
''' |
|
``` |
|
|
|
Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing). |
|
|
|
## Model Details |
|
|
|
- **Developed by:** Skit AI |
|
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma) |
|
- **Language:** English |
|
- **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) |
|
- **Model Size:** 1.5 B |
|
- **Checkpoint:** 2000 k steps (bs=1) |
|
- **Adapters:** r=8, alpha=16 |
|
- **lr** : 1e-4 |
|
- **gradient accumulation steps:** 8 |
|
|
|
|
|
## Checkpoint Result |
|
|
|
| **Dataset** | **Type** | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** | |
|
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:| |
|
| **librispeech-test-clean** | Read Speech | 11.51 | 0.9594 | | | |
|
| **librispeech-test-other** | Read Speech | 16.68 | 0.9297 | | | |
|
| **CommonVoice test** | Diverse Accent, Age | 26.02 | 0.9476 | 0.6498 | 0.8121 | |