Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,274 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
-
# Model Card for Model ID
|
7 |
-
|
8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
## Model Details
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
-
|
28 |
-
### Model Sources [optional]
|
29 |
-
|
30 |
-
<!-- Provide the basic links for the model. -->
|
31 |
-
|
32 |
-
- **Repository:** [More Information Needed]
|
33 |
-
- **Paper [optional]:** [More Information Needed]
|
34 |
-
- **Demo [optional]:** [More Information Needed]
|
35 |
-
|
36 |
-
## Uses
|
37 |
-
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
-
### Direct Use
|
41 |
-
|
42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
43 |
-
|
44 |
-
[More Information Needed]
|
45 |
-
|
46 |
-
### Downstream Use [optional]
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
|
58 |
-
|
59 |
|
60 |
-
|
61 |
|
62 |
-
|
63 |
|
64 |
-
|
65 |
-
|
66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
67 |
-
|
68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
69 |
|
70 |
## How to Get Started with the Model
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
|
99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
-
|
|
|
102 |
|
103 |
-
|
104 |
|
105 |
-
|
106 |
|
107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
|
109 |
-
|
|
|
110 |
|
111 |
-
|
|
|
112 |
|
113 |
-
|
114 |
|
115 |
-
|
116 |
|
117 |
-
|
118 |
|
119 |
-
|
|
|
120 |
|
121 |
-
|
122 |
|
123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
124 |
|
125 |
-
|
126 |
-
|
127 |
-
### Results
|
128 |
-
|
129 |
-
[More Information Needed]
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
|
175 |
**BibTeX:**
|
176 |
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
-
|
183 |
-
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
|
193 |
-
##
|
194 |
|
195 |
-
[
|
196 |
|
197 |
-
|
198 |
|
199 |
-
[
|
|
|
1 |
---
|
2 |
+
datasets:
|
3 |
+
- jan-hq/instruction-speech-v1.5
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
license: apache-2.0
|
7 |
+
tags:
|
8 |
+
- sound language model
|
9 |
---
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
## Model Details
|
12 |
|
13 |
+
We have developed and released the family [Jan-Llama3s](https://huggingface.co/collections/jan-hq/jan-llama3-668e4dad446c8736208dca4f). This family is natively understanding audio and text input.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
+
We continue to expand our last checkpoint [jan-hq/Jan-Llama3s-0708](https://huggingface.co/jan-hq/Jan-Llama3s-0708) with 1.3B tokens from [Instruction Speech v1.5](https://huggingface.co/datasets/jan-hq/instruction-speech-v1) dataset.
|
16 |
|
17 |
+
**Model developers** Homebrew Research.
|
18 |
|
19 |
+
**Input** Text and sound.
|
20 |
|
21 |
+
**Output** Text.
|
22 |
|
23 |
+
**Model Architecture** Llama-3.
|
24 |
|
25 |
+
**Language(s):** English.
|
26 |
|
27 |
+
## Intended Use
|
28 |
|
29 |
+
**Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
|
30 |
|
31 |
+
**Out-of-scope** The use of Jan-Llama3s in any manner that violates applicable laws or regulations is strictly prohibited.
|
|
|
|
|
|
|
|
|
32 |
|
33 |
## How to Get Started with the Model
|
34 |
|
35 |
+
First, we need to convert the audio file to sound tokens
|
36 |
+
|
37 |
+
```python
|
38 |
+
import torch
|
39 |
+
import torchaudio
|
40 |
+
from encodec import EncodecModel
|
41 |
+
from encodec.utils import convert_audio
|
42 |
+
|
43 |
+
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device="cuda"):
|
44 |
+
# Initialize Encodec
|
45 |
+
model = EncodecModel.encodec_model_24khz()
|
46 |
+
model.set_target_bandwidth(target_bandwidth)
|
47 |
+
model.to(device)
|
48 |
+
|
49 |
+
# Load and preprocess audio
|
50 |
+
wav, sr = torchaudio.load(audio_path)
|
51 |
+
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
|
52 |
+
wav = wav.unsqueeze(0).to(device)
|
53 |
+
|
54 |
+
# Encode audio
|
55 |
+
with torch.no_grad():
|
56 |
+
encoded_frames = model.encode(wav)
|
57 |
+
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1)
|
58 |
+
|
59 |
+
# Flatten codes
|
60 |
+
audio_code1, audio_code2 = codes[0][0], codes[0][1]
|
61 |
+
flatten_tokens = torch.stack((audio_code1, audio_code2), dim=1).flatten().tolist()
|
62 |
+
|
63 |
+
# Convert to sound tokens
|
64 |
+
result = ''.join(f'<|sound_{num:04d}|>' for num in flatten_tokens)
|
65 |
+
return f'<|sound_start|>{result}<|sound_end|>'
|
66 |
+
|
67 |
+
# Usage
|
68 |
+
sound_tokens = audio_to_sound_tokens("/path/to/your/audio/file")
|
69 |
+
```
|
70 |
+
|
71 |
+
Then, we can inference the model the same as any other LLM.
|
72 |
+
|
73 |
+
```python
|
74 |
+
import torch
|
75 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
|
76 |
+
|
77 |
+
def setup_pipeline(model_path, use_4bit=True):
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
79 |
+
|
80 |
+
model_kwargs = {"device_map": "auto"}
|
81 |
+
|
82 |
+
if use_4bit:
|
83 |
+
model_kwargs["quantization_config"] = BitsAndBytesConfig(
|
84 |
+
load_in_4bit=True,
|
85 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
86 |
+
bnb_4bit_use_double_quant=True,
|
87 |
+
bnb_4bit_quant_type="nf4",
|
88 |
+
)
|
89 |
+
else:
|
90 |
+
model_kwargs["torch_dtype"] = torch.bfloat16
|
91 |
+
|
92 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
|
93 |
+
|
94 |
+
return pipeline("text-generation", model=model, tokenizer=tokenizer)
|
95 |
+
|
96 |
+
def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
|
97 |
+
generation_args = {
|
98 |
+
"max_new_tokens": max_new_tokens,
|
99 |
+
"return_full_text": False,
|
100 |
+
"temperature": temperature,
|
101 |
+
"do_sample": do_sample,
|
102 |
+
}
|
103 |
+
|
104 |
+
output = pipe(messages, **generation_args)
|
105 |
+
return output[0]['generated_text']
|
106 |
+
|
107 |
+
# Usage
|
108 |
+
llm_path = "jan-hq/Jan-Llama3s-0719"
|
109 |
+
pipe = setup_pipeline(llm_path, use_4bit=True)
|
110 |
+
messages = [
|
111 |
+
{"role": "user", "content": sound_tokens},
|
112 |
+
]
|
113 |
+
generated_text = generate_text(pipe, messages)
|
114 |
+
print(generated_text)
|
115 |
+
```
|
116 |
+
|
117 |
+
## Training process
|
118 |
+
**Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
|
119 |
+
|
120 |
+
![train_loss_curve](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/JYUOqGs1c-7vXtJt06KH5.png)
|
121 |
+
|
122 |
+
### Hardware
|
123 |
+
|
124 |
+
**GPU Configuration**: Cluster of 8x NVIDIA H100-SXM-80GB.
|
125 |
+
**GPU Usage**:
|
126 |
+
- **Continual Training**: 14 hours.
|
127 |
+
|
128 |
+
### Training Arguments
|
129 |
+
|
130 |
+
| Parameter | Continual Training |
|
131 |
+
|----------------------------|-------------------------|
|
132 |
+
| **Epoch** | 1 |
|
133 |
+
| **Global batch size** | 128 |
|
134 |
+
| **Learning Rate** | 1.5e-4 |
|
135 |
+
| **Learning Scheduler** | Cosine with warmup |
|
136 |
+
| **Optimizer** | Adam torch fused |
|
137 |
+
| **Warmup Ratio** | 0.05 |
|
138 |
+
| **Weight Decay** | 0.01 |
|
139 |
+
| **beta1** | 0.9 |
|
140 |
+
| **beta2** | 0.98 |
|
141 |
+
| **epsilon** | 1e-6 |
|
142 |
+
| **Gradient Cliping** | 1.0 |
|
143 |
+
|
144 |
+
###
|
145 |
+
Accelerate FSDP Config
|
146 |
+
|
147 |
+
```
|
148 |
+
compute_environment: LOCAL_MACHINE
|
149 |
+
debug: false
|
150 |
+
distributed_type: FSDP
|
151 |
+
downcast_bf16: 'no'
|
152 |
+
enable_cpu_affinity: true
|
153 |
+
fsdp_config:
|
154 |
+
fsdp_activation_checkpointing: true
|
155 |
+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
|
156 |
+
fsdp_backward_prefetch: BACKWARD_PRE
|
157 |
+
fsdp_cpu_ram_efficient_loading: true
|
158 |
+
fsdp_forward_prefetch: false
|
159 |
+
fsdp_offload_params: false
|
160 |
+
fsdp_sharding_strategy: FULL_SHARD
|
161 |
+
fsdp_state_dict_type: SHARDED_STATE_DICT
|
162 |
+
fsdp_sync_module_states: true
|
163 |
+
fsdp_use_orig_params: false
|
164 |
+
machine_rank: 0
|
165 |
+
main_training_function: main
|
166 |
+
mixed_precision: bf16
|
167 |
+
num_machines: 1
|
168 |
+
num_processes: 8
|
169 |
+
rdzv_backend: static
|
170 |
+
same_network: true
|
171 |
+
tpu_env: []
|
172 |
+
tpu_use_cluster: false
|
173 |
+
tpu_use_sudo: false
|
174 |
+
use_cpu: false
|
175 |
+
```
|
176 |
+
|
177 |
+
## Examples
|
178 |
+
|
179 |
+
1. Good example:
|
180 |
+
<details>
|
181 |
+
<summary>Click to toggle Example 1`</summary>
|
182 |
+
|
183 |
+
```
|
184 |
+
# Question:
|
185 |
+
Is the Paris Climate Change agreement going to help slow pollution?
|
186 |
+
|
187 |
+
# Expected Output:
|
188 |
+
Yes, the Paris Climate Agreement is designed to help slow pollution and combat climate change on a global scale. The main goal of the agreement is to limit global warming to well below 2 degrees Celsius, preferably to 1.5 degrees Celsius, compared to pre-industrial levels. To achieve this, countries that have signed the agreement commit to reducing their greenhouse gas emissions.
|
189 |
+
|
190 |
+
# Model's Output:
|
191 |
+
Yes, the Paris Climate Agreement is a global commitment to limit global warming to well below 2 degrees Celsius above pre-industrial levels and pursue efforts to limit it to 1.5 degrees Celsius. The agreement was signed by 195 countries and entered into force on November 4, 2016.
|
192 |
+
```
|
193 |
+
|
194 |
+
<summary>Click to toggle Example 2`</summary>
|
195 |
+
|
196 |
+
```
|
197 |
+
# Question:
|
198 |
+
Is the Paris Climate Change agreement going to help slow pollution?
|
199 |
+
|
200 |
+
# Expected Output:
|
201 |
+
Yes, the Paris Climate Agreement is designed to help slow pollution and combat climate change on a global scale. The main goal of the agreement is to limit global warming to well below 2 degrees Celsius, preferably to 1.5 degrees Celsius, compared to pre-industrial levels. To achieve this, countries that have signed the agreement commit to reducing their greenhouse gas emissions.
|
202 |
|
203 |
+
# Model's Output:
|
204 |
+
Yes, the Paris Climate Agreement is a global commitment to limit global warming to well below 2 degrees Celsius above pre-industrial levels and pursue efforts to limit it to 1.5 degrees Celsius. The agreement was signed by 195 countries and entered into force on November 4, 2016.
|
205 |
+
```
|
206 |
+
</details>
|
207 |
+
|
208 |
+
|
209 |
+
2. Misunderstanding example:
|
210 |
+
```
|
211 |
+
# Question:
|
212 |
+
Why is gasoline a liquid?
|
213 |
|
214 |
+
# Expected Output:
|
215 |
+
Gasoline is a liquid because of the nature of its molecules and the forces acting between them. It is made up of hydrocarbons, which are molecules composed of hydrogen and carbon atoms. These molecules are held together by covalent bonds, which are quite strong.
|
216 |
|
217 |
+
The forces acting between the molecules of gasoline are known as Van der Waals forces, which are relatively weak intermolecular forces. These forces are strong enough to keep the molecules together in a liquid form under normal conditions (room temperature and pressure), but not strong enough to keep them in a solid form.
|
218 |
|
219 |
+
When the temperature drops significantly, gasoline can become a solid, and when it's heated, it becomes a gas. But under most conditions we encounter in daily life, these forces make gasoline a liquid.
|
220 |
|
221 |
+
# Model's Output:
|
222 |
+
В случае, когда вы хотите перевести фразу "The first step is to identify the language of the text." на русский язык, вы можете использовать следующую фразу: "Первый шаг - определить язык текста."
|
223 |
+
```
|
224 |
+
|
225 |
+
3. Off-tracked example:
|
226 |
+
```
|
227 |
+
# Quesion:
|
228 |
+
Answer the following question: - The roots of the growing plants start to break up the rock - The plant acids dissolve the rock - The rock is broken down into smaller pieces - Erosion begins. What might be the first step of the process?
|
229 |
+
Answer:
|
230 |
+
|
231 |
+
# Expected Output:
|
232 |
+
The first step of the process might be: The roots of the growing plants start to break up the rock.
|
233 |
+
|
234 |
+
# Model's Output:
|
235 |
+
Step 1: Identify the key elements in the premise and hypothesis.
|
236 |
|
237 |
+
Premise: "A man in a blue shirt and a red hat is holding a sign."
|
238 |
+
Key elements: man, blue shirt, red hat, holding, sign
|
239 |
|
240 |
+
Hypothesis: "A man is holding a sign."
|
241 |
+
Key elements: man, holding, sign
|
242 |
|
243 |
+
Step 2: Compare the key elements in the premise and hypothesis.
|
244 |
|
245 |
+
In the premise, we have a man wearing a blue shirt and a red hat, while in the hypothesis, we only have a man. The premise also mentions that the man is holding a sign, which is also mentioned in the hypothesis.
|
246 |
|
247 |
+
Step 3: Determine if the hypothesis is entailed by the premise.
|
248 |
|
249 |
+
Since the hypothesis states that a man is holding a sign, and the premise confirms that a man is holding a sign, we can conclude that the hypothesis is entailed by the premise. The additional information about the man's clothing in the premise does not contradict or negate the
|
250 |
+
```
|
251 |
|
252 |
+
Despite being undertrained, the model demonstrates an emerging grasp of sound-text semantics.
|
253 |
|
|
|
254 |
|
255 |
+
## Citation Information
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
256 |
|
257 |
**BibTeX:**
|
258 |
|
259 |
+
```
|
260 |
+
@article{Llama-3-Sound: Sound Instruction LLM 2024,
|
261 |
+
title={Llama-3-Sound},
|
262 |
+
author={Homebrew Research},
|
263 |
+
year=2024,
|
264 |
+
month=July},
|
265 |
+
url={https://huggingface.co/jan-hq/Jan-Llama3-0708}
|
266 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
267 |
|
268 |
+
## Acknowledgement
|
269 |
|
270 |
+
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
|
271 |
|
272 |
+
- **[Encodec](https://github.com/facebookresearch/encodec)**
|
273 |
|
274 |
+
- **[Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)**
|