metadata

license: mpl-2.0
datasets:
  - mozilla-foundation/common_voice_17_0
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

Model Card for Diva Llama 3

This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details in the pre-print here.

See the model in action at diva-audio.github.io or look at the full training logs on Weights&Biases.

Citation

BibTeX:

@misc{DiVA,
      title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata}, 
      author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
      year={2024},
      eprint={2410.02678},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.02678}, 
}

Inference Example

from transformers import AutoModel
import librosa
import wget
from modeling_diva import DiVAModel

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav"
)

speech_data, _ = librosa.load(filename, sr=16_000)

model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)

print(model.generate([speech_data]))
print(model.generate([speech_data], ["Reply Briefly Like A Pirate"]))

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-2426554427049983479.wav"
)

speech_data2, _ = librosa.load(filename, sr=16_000)

print(
    model.generate(
        [speech_data, speech_data2],
        ["Reply Briefly Like A Pirate", "Reply Briefly Like A New Yorker"],
    )
)

Model Card for DiVA Llama 3
Citation
Table of Contents
Training Details
- Training Data
- Training Procedure
Environmental Impact
Technical Specifications [optional]
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Model Card Contact

Training Details

Training Data

This model was trained on the CommonVoice corpus.

Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

Environmental Impact

Hardware Type: V4-256 TPU
Hours used: 11 Hours
Cloud Provider: Google Cloud.
Compute Region: US Central C

Hardware

This model was trained on at V4-256 TPU on Google Cloud.

Software

This model was trained with Levanter

Model Card Authors [optional]

Will Held

Model Card Contact

held@stanford.edu