Update README.md

1c2aa96 verified 9 days ago

2.88 kB

	---
	datasets:
	- homebrewltd/instruction-speech-whispervq-v2
	language:
	- en
	license: apache-2.0
	tags:
	- sound language model
	pipeline_tag: audio-text-to-text
	---

	## Model Details

	We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.

	We continual pretrain on the expanded vocabulary [homebrewltd/llama3.1-s-whispervq-init](https://huggingface.co/homebrewltd/llama3.1-s-whispervq-init) with 900M tokens from [homebrewltd/raw-speech-whispervq-v1](https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1) dataset.

	Model developers Homebrew Research.

	Input Text and sound.

	Output Text.

	Model Architecture Llama-3.

	Language(s): English.

	## Intended Use

	Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

	Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

	## Training process
	Training Metrics Image: Below is a snapshot of the training loss curve visualized.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/gtpDSs750SkMPJO0-UtFq.png)

	MMLU:

	\| Model \| MMLU Score \|
	\| --- \| --- \|
	\| llama3.5-instruct-8b \| 69.40 \|
	\| ichigo-llama3.1-s-v0.3: phase 3 \| 63.79 \|
	\| ichigo-llama3.1-s-v0.3: phase 2 \| 63.08 \|
	\| ichigo-llama3.1-s-base-v0.3 \| 42.11 \|
	\| llama3.5-instruct-v0.2 \| 50.27 \|

	### Hardware

	GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.

	GPU Usage:
	- Continual Training: 30 hours.

	### Training Arguments

	We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.

	\| Parameter \| Continual Training \|
	\|----------------------------\|-------------------------\|
	\| Epoch \| 1 \|
	\| Global batch size \| 480 \|
	\| Learning Rate \| 2e-4 \|
	\| Learning Scheduler \| Cosine with warmup \|
	\| Optimizer \| AdamW fused \|
	\| Warmup Steps \| 50 \|
	\| Weight Decay \| 0.01 \|
	\| Max Sequence Length \| 512 \|


	## Citation Information

	BibTeX:

	```
	@article{Llama3-S: Sound Instruction Language Model 2024,
	title={Llama3-S},
	author={Homebrew Research},
	year=2024,
	month=August},
	url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}
	```

	## Acknowledgement

	- [WhisperSpeech](https://github.com/collabora/WhisperSpeech)

	- [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)

	---
	datasets:
	- homebrewltd/instruction-speech-whispervq-v2
	language:
	- en
	license: apache-2.0
	tags:
	- sound language model
	pipeline_tag: audio-text-to-text
	---

	## Model Details

	We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.

	We continual pretrain on the expanded vocabulary [homebrewltd/llama3.1-s-whispervq-init](https://huggingface.co/homebrewltd/llama3.1-s-whispervq-init) with 900M tokens from [homebrewltd/raw-speech-whispervq-v1](https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1) dataset.

	Model developers Homebrew Research.

	Input Text and sound.

	Output Text.

	Model Architecture Llama-3.

	Language(s): English.

	## Intended Use

	Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

	Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

	## Training process
	Training Metrics Image: Below is a snapshot of the training loss curve visualized.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/gtpDSs750SkMPJO0-UtFq.png)

	MMLU:

	\| Model \| MMLU Score \|
	\| --- \| --- \|
	\| llama3.5-instruct-8b \| 69.40 \|
	\| ichigo-llama3.1-s-v0.3: phase 3 \| 63.79 \|
	\| ichigo-llama3.1-s-v0.3: phase 2 \| 63.08 \|
	\| ichigo-llama3.1-s-base-v0.3 \| 42.11 \|
	\| llama3.5-instruct-v0.2 \| 50.27 \|

	### Hardware

	GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.

	GPU Usage:
	- Continual Training: 30 hours.

	### Training Arguments

	We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation.

	\| Parameter \| Continual Training \|
	\|----------------------------\|-------------------------\|
	\| Epoch \| 1 \|
	\| Global batch size \| 480 \|
	\| Learning Rate \| 2e-4 \|
	\| Learning Scheduler \| Cosine with warmup \|
	\| Optimizer \| AdamW fused \|
	\| Warmup Steps \| 50 \|
	\| Weight Decay \| 0.01 \|
	\| Max Sequence Length \| 512 \|


	## Citation Information

	BibTeX:

	```
	@article{Llama3-S: Sound Instruction Language Model 2024,
	title={Llama3-S},
	author={Homebrew Research},
	year=2024,
	month=August},
	url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}
	```

	## Acknowledgement

	- [WhisperSpeech](https://github.com/collabora/WhisperSpeech)

	- [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)