metadata
language:
- en
license: apache-2.0
library_name: transformers
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 7.3
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 10.47
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: wer
value: 24.47
name: Test WER
- task:
type: audio-classification
name: Audio Classification
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: accuracy
value: 60.61
name: Test Age Accuracy
- type: accuracy
value: 61.56
name: Test Accent Accuracy
SpeechLLM
[The model is still training, we will be releasing the latest checkpoints soon...]
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
- SpeechActivity : if the audio signal contains speech (True/False)
- Transcript : ASR transcript of the audio
- Gender of the speaker (Female/Male)
- Age of the speaker (Young/Middle-Age/Senior)
- Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
- Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
Usage
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav", #16k Hz, mono
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
Try the model in Google Colab Notebook.
Model Details
- Developed by: Skit AI
- Authors: Shangeth Rajaa, Abhinav Tushar
- Language: English
- Finetuned from model: HubertX, TinyLlama
- Model Size: 2.1 B
- Checkpoint: 2000 k steps (bs=1)
- Adapters: r=4, alpha=8
- lr : 1e-4
- gradient accumulation steps: 8
Checkpoint Result
Dataset | Word Error Rate | Gender Acc | Age Acc | Accent Acc |
---|---|---|---|---|
librispeech-test-clean | 7.36 | 0.9490 | ||
librispeech-test-other | 10.47 | 0.9099 | ||
CommonVoice test | 24.47 | 0.8680 | 0.6061 | 0.6156 |