Feature Extraction
Transformers
Safetensors
diva
custom_code

Model Card for Diva Llama 3

This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details in the pre-print here.

See the model in action at diva-audio.github.io or look at the full training logs on Weights&Biases.

Citation

BibTeX:

@misc{DiVA,
      title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata}, 
      author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
      year={2024},
      eprint={2410.02678},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.02678}, 
}
    

Inference Example

from transformers import AutoModel
import librosa
import wget

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav"
)

speech_data, _ = librosa.load(filename, sr=16_000)

model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)

print(model.generate([speech_data]))
print(model.generate([speech_data], ["Reply Briefly Like A Pirate"]))

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-2426554427049983479.wav"
)

speech_data2, _ = librosa.load(filename, sr=16_000)

print(
    model.generate(
        [speech_data, speech_data2],
        ["Reply Briefly Like A Pirate", "Reply Briefly Like A New Yorker"],
    )
)

Table of Contents

Training Details

Training Data

This model was trained on the CommonVoice corpus.

Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

Environmental Impact

  • Hardware Type: V4-256 TPU
  • Hours used: 11 Hours
  • Cloud Provider: Google Cloud.
  • Compute Region: US Central C

Hardware

This model was trained on at V4-256 TPU on Google Cloud.

Software

This model was trained with Levanter

Model Card Authors [optional]

Will Held

Model Card Contact

held@stanford.edu

Downloads last month
516
Safetensors
Model size
2.49B params
Tensor type
F32
ยท
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Model tree for WillHeld/DiVA-llama-3-v0-8b

Finetuned
(595)
this model

Dataset used to train WillHeld/DiVA-llama-3-v0-8b

Spaces using WillHeld/DiVA-llama-3-v0-8b 2