datasets:
- homebrewltd/instruction-speech-whispervq-v2
language:
- en
license: apache-2.0
tags:
- sound language model
pipeline_tag: audio-text-to-text
Model Details
We have developed and released the family llama3s. This family is natively understanding audio and text input.
We continual pretrain on the expanded vocabulary homebrewltd/llama3.1-s-whispervq-init with 900M tokens from homebrewltd/raw-speech-whispervq-v1 dataset.
Model developers Homebrew Research.
Input Text and sound.
Output Text.
Model Architecture Llama-3.
Language(s): English.
Intended Use
Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
Training process
Training Metrics Image: Below is a snapshot of the training loss curve visualized.
MMLU:
Model | MMLU Score |
---|---|
llama3.5-instruct-8b | 69.40 |
ichigo-llama3.1-s-v0.3: phase 3 | 63.79 |
ichigo-llama3.1-s-v0.3: phase 2 | 63.08 |
ichigo-llama3.1-s-base-v0.3 | 42.11 |
llama3.5-instruct-v0.2 | 50.27 |
Hardware
GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.
GPU Usage:
- Continual Training: 30 hours.
Training Arguments
We utilize torchtune library for the latest FSDP2 training code implementation.
Parameter | Continual Training |
---|---|
Epoch | 1 |
Global batch size | 480 |
Learning Rate | 2e-4 |
Learning Scheduler | Cosine with warmup |
Optimizer | AdamW fused |
Warmup Steps | 50 |
Weight Decay | 0.01 |
Max Sequence Length | 512 |
Citation Information
BibTeX:
@article{Llama3-S: Sound Instruction Language Model 2024,
title={Llama3-S},
author={Homebrew Research},
year=2024,
month=August},
url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}