speechllm-2B / README.md
shangeth's picture
Update README.md
183a43c verified
|
raw
history blame
No virus
2.58 kB
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
language:
- en
metrics:
- wer
library_name: transformers
model-index:
- name: SpeechLLM
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 12.3
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 18.9
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 25.01
---
# SpeechLLM
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:
1. Speech Activity
2. ASR Transcript
3. Gender of the speaker
4. Age of the speaker
5. Accent of the speaker
6. Emotion of the speaker
## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav",
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{ "SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
```
## Checkpoint Result
| Dataset | Word Error Rate(%) | Gender(%) |
|:----------------------:|:------------------:|:---------:|
| librispeech-test-clean | 0.1230 | 0.8778 |
| librispeech-test-other | 0.1890 | 0.8908 |
| CommonVoice test | 0.2501 | 0.8753 |