Update README.md
Browse files
README.md
CHANGED
@@ -29,14 +29,12 @@ margin. Our work reveals that LLMs can be an excellent compressor for music, but
|
|
29 |
|
30 |
## Training Data
|
31 |
|
32 |
-
ChatMusician is pretrained on the 🤗 [MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile), which is the first pretraining corpus for **developing musical abilities** in large language models. Check out the dataset card for more details.
|
33 |
-
And supervised finetuned on 1.1M samples(2:1 ratio between music scores
|
34 |
-
and music knowledge & music summary data) from MusicPile. Check our [paper](http://arxiv.org/abs/2402.16153) for more details.
|
35 |
|
36 |
## Training Procedure
|
37 |
|
38 |
We initialized a fp16-precision ChatMusician-Base from the LLaMA2-7B-Base weights, and applied a continual pre-training plus fine-tuning pipeline. LoRA adapters were integrated into the attention and MLP layers, with additional training on embeddings and all linear layers. The maximum sequence length
|
39 |
-
was 2048. We utilized 16 80GB-A800 GPUs for one epoch pre-training
|
40 |
|
41 |
## Intended Uses
|
42 |
These models are trained for research purposes. They are designed to solve general math problems. They can be used in educational software, tutoring systems, or any application where a solution to a math problem is needed. The models can generate both a chain of thought (CoT) rationale and a program of thought (PoT) rationale, providing a comprehensive solution to a given math problem.
|
|
|
29 |
|
30 |
## Training Data
|
31 |
|
32 |
+
ChatMusician-Base is pretrained on the 🤗 [MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile), which is the first pretraining corpus for **developing musical abilities** in large language models. Check out the dataset card for more details.
|
|
|
|
|
33 |
|
34 |
## Training Procedure
|
35 |
|
36 |
We initialized a fp16-precision ChatMusician-Base from the LLaMA2-7B-Base weights, and applied a continual pre-training plus fine-tuning pipeline. LoRA adapters were integrated into the attention and MLP layers, with additional training on embeddings and all linear layers. The maximum sequence length
|
37 |
+
was 2048. We utilized 16 80GB-A800 GPUs for one epoch pre-training. DeepSpeed was employed for memory efficiency, and the AdamW optimizer was used with a 1e-4 learning rate and a 5% warmup cosine scheduler. Gradient clipping was set at 1.0. The LoRA parameters dimension, alpha, and dropout were set to 64, 16, and 0.1, with a batch size of 8.
|
38 |
|
39 |
## Intended Uses
|
40 |
These models are trained for research purposes. They are designed to solve general math problems. They can be used in educational software, tutoring systems, or any application where a solution to a math problem is needed. The models can generate both a chain of thought (CoT) rationale and a program of thought (PoT) rationale, providing a comprehensive solution to a given math problem.
|