File size: 3,637 Bytes
49babd2 7eacacf 49babd2 7eacacf 49babd2 d9290bd 49babd2 8036924 d9290bd 7eacacf 8036924 7eacacf 8036924 7eacacf 8036924 7eacacf 8036924 1b27420 64a7de7 183a43c 60d6884 183a43c 1b27420 477312d 1b27420 477312d 1b27420 74cd139 9c1f1a8 21c6ff5 9c1f1a8 74cd139 1b27420 8386474 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
language:
- en
license: apache-2.0
library_name: transformers
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 7.3
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 10.47
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: wer
value: 24.47
name: Test WER
- task:
type: audio-classification
name: Audio Classification
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: accuracy
value: 60.61
name: Test Age Accuracy
- task:
type: audio-classification
name: Audio Classification
dataset:
name: Common Voice 16.1
type: common_voice_16_1
split: test
args:
language: en
metrics:
- type: accuracy
value: 61.56
name: Test Accent Accuracy
---
# SpeechLLM
[The model is still training, we will be releasing the latest checkpoints soon...]
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. SpeechLLM model is based on HubertX acoustic encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/SpeechLLM", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav",
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
```
## Model Details
- Model Size : 2.1 B
- Checkpoint : 2000 k steps (bs=1)
- Adapters : r=4, alpha=8
- lr = 1e-4
- gradient accumulation steps : 8
## Checkpoint Result
| **Dataset** | **Word Error Rate(%)** | **Gender(%)** | **Age(%)** | **Accent(%)** |
|:----------------------:|:----------------------:|:-------------:|:----------:|:-------------:|
| librispeech-test-clean | 0.0736 | 0.9490 | | |
| librispeech-test-other | 0.1047 | 0.9099 | | |
| CommonVoice test | 0.2447 | 0.8680 | 0.6061 | 0.6156 | |