speechllm-2B / README.md
shangeth's picture
Update README.md
6a10328 verified
|
raw
history blame
5.03 kB
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - multi-modal
  - speech-language
datasets:
  - mozilla-foundation/common_voice_16_1
  - openslr/librispeech_asr
  - MLCommons/ml_spoken_words
  - Ar4ikov/iemocap_audio_text_splitted
metrics:
  - wer
  - accuracy
model-index:
  - name: SpeechLLM
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: clean
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 6.73
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 9.13
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 16.1
          type: common_voice_16_1
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 24.47
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: ML Spoken Words
          type: MLCommons/ml_spoken_words
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 36.12
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: IEMOCAP
          type: Ar4ikov/iemocap_audio_text_splitted
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 44.15
            name: Test WER
      - task:
          type: audio-classification
          name: Audio Classification
        dataset:
          name: Common Voice 16.1
          type: common_voice_16_1
          split: test
          args:
            language: en
        metrics:
          - type: accuracy
            value: 62.51
            name: Test Age Accuracy
          - type: accuracy
            value: 64.57
            name: Test Accent Accuracy

SpeechLLM

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:

  1. SpeechActivity : if the audio signal contains speech (True/False)
  2. Transcript : ASR transcript of the audio
  3. Gender of the speaker (Female/Male)
  4. Age of the speaker (Young/Middle-Age/Senior)
  5. Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
  6. Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

Try the model in Google Colab Notebook.

Model Details

  • Developed by: Skit AI
  • Authors: Shangeth Rajaa, Abhinav Tushar
  • Language: English
  • Finetuned from model: HubertX, TinyLlama
  • Model Size: 2.1 B
  • Checkpoint: 2000 k steps (bs=1)
  • Adapters: r=4, alpha=8
  • lr : 1e-4
  • gradient accumulation steps: 8

Checkpoint Result

Dataset Type Word Error Rate Gender Acc Age Acc Accent Acc
librispeech-test-clean Read Speech 6.73 0.9536
librispeech-test-other Read Speech 9.13 0.9099
CommonVoice test Diverse Accent, Age 24.27 0.8680 0.6251 0.6457
ML Spoken Words test Short Utterance 36.12 0.6587
IEMOCAP test Emotional Speech 44.15 0.7557