README.md · homebrewltd/Ichigo-llama3.1-s-base-v0.3 at refs/pr/1

metadata

datasets:
  - homebrewltd/instruction-speech-whispervq-v2
language:
  - en
license: apache-2.0
tags:
  - sound language model
pipeline_tag: audio-text-to-text

Model Details

We have developed and released the family llama3s. This family is natively understanding audio and text input.

We continual pretrain on the expanded vocabulary homebrewltd/llama3.1-s-whispervq-init with 900M tokens from homebrewltd/raw-speech-whispervq-v1 dataset.

Model developers Homebrew Research.

Input Text and sound.

Output Text.

Model Architecture Llama-3.

Language(s): English.

Intended Use

Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

MMLU:

Model	MMLU Score
llama3.5-instruct-8b	69.40
ichigo-llama3.1-s-v0.3: phase 3	63.79
ichigo-llama3.1-s-v0.3: phase 2	63.08
ichigo-llama3.1-s-base-v0.3	42.11
llama3.5-instruct-v0.2	50.27

Hardware

GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.

GPU Usage:

Continual Training: 30 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Continual Training
Epoch	1
Global batch size	480
Learning Rate	2e-4
Learning Scheduler	Cosine with warmup
Optimizer	AdamW fused
Warmup Steps	50
Weight Decay	0.01
Max Sequence Length	512

Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August},
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}

Acknowledgement

WhisperSpeech
Meta-Llama-3.1-8B-Instruct