[The model is still training, we will be releasing the latest checkpoints soon...]

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:

  1. SpeechActivity : if the audio signal contains speech (True/False)
  2. Transcript : ASR transcript of the audio
  3. Gender of the speaker (Female/Male)
  4. Age of the speaker (Young/Middle-Age/Senior)
  5. Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
  6. Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)


# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",

# Model Generation
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",

Model Details

  • Developed by: Skit AI
  • Authors: Shangeth Rajaa, Abhinav Tushar
  • Language: English
  • Finetuned from model: HubertX, TinyLlama
  • Model Size: 2.1 B
  • Checkpoint: 2000 k steps (bs=1)
  • Adapters: r=4, alpha=8
  • lr : 1e-4
  • gradient accumulation steps: 8

Checkpoint Result

Dataset Word Error Rate Gender Acc Age Acc Accent Acc
librispeech-test-clean 7.36 0.9490
librispeech-test-other 10.47 0.9099
CommonVoice test 24.47 0.8680 0.6061 0.6156
