zaydzuhri commited on
Commit
123b76a
1 Parent(s): b2223b3

Model save

Browse files
Files changed (2) hide show
  1. README.md +54 -165
  2. generation_config.json +6 -0
README.md CHANGED
@@ -1,165 +1,54 @@
1
- <div align="center">
2
-
3
- # 🔥 Flame
4
-
5
- </div>
6
-
7
- A minimal framework for training FLA models, whether from scratch or through finetuning.
8
-
9
- Built on the robust infrastructure of 🤗, `flame` enables you to train large language models with just a few lines of code:
10
- we use `datasets` for data processing, `transformers` for model definitions, and `accelerate`[^1] for seamless distributed training.
11
-
12
- In this README, we will guide you through the process of using `flame` to train GLA models.
13
-
14
- ## Setup
15
-
16
- To get started, you'll need to install the required packages.
17
- Both `fla` and `flame` have minimal dependencies.
18
- Clone the `fla` repository and install the necessary packages as follows:
19
-
20
- ```bash
21
- git clone https://github.com/sustcsonglin/flash-linear-attention.git
22
- pip install .
23
- pip install accelerate wandb
24
- pip3 install deepspeed
25
- ```
26
-
27
- > [!CAUTION]
28
- > The 🤗 `tokenizers` have some [memory leak issues](https://github.com/huggingface/tokenizers/issues/1539) when processing very long documents.
29
- > To address this, please ensure you install `tokenizers>=0.20.4`.
30
-
31
- ## Preprocessing
32
-
33
- Before training, you need to download and pre-tokenize your dataset.
34
- We provide a straightforward script for this.
35
- For instance, to tokenize a 10B sample of the `fineweb-edu` dataset, run:
36
-
37
- ```bash
38
- python preprocess.py \
39
- --dataset HuggingFaceFW/fineweb-edu \
40
- --name sample-10BT \
41
- --split train \
42
- --context_length 2048
43
- ```
44
- or an even smaller example, just for testing:
45
- ```bash
46
- python preprocess.py \
47
- --dataset alturing/gutenberg-texts \
48
- --split train \
49
- --context_length 2048
50
- ```
51
-
52
- This will cache the processed dataset at `data/HuggingFaceFW/fineweb-edu/sample-10BT/train`.
53
-
54
- GLA utilizes a subset of Slimpajama for pretraining [in the paper](https://proceedings.mlr.press/v235/yang24ab.html).
55
- Given the size of the dataset, the fastest way to download it is using `git lfs` (refer to [this issue](https://huggingface.co/datasets/cerebras/SlimPajama-627B/discussions/2)).
56
- ```bash
57
- git lfs install
58
- git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
59
- python preprocess.py \
60
- --dataset SlimPajama-627B \
61
- --split train \
62
- --context_length 2048
63
- ```
64
-
65
- ## Training from scratch
66
-
67
- To train your 340M model from scratch, execute the following command:
68
-
69
- ```bash
70
- bash train.sh \
71
- type=gla \
72
- lr=3e-4 \
73
- steps=20480 \
74
- batch=8 \
75
- update=1 \
76
- warmup=1024 \
77
- context=2048 \
78
- path=exp/gla-340M-10B \
79
- project=fla \
80
- model=configs/gla_340M.json \
81
- data=HuggingFaceFW/fineweb-edu \
82
- name=sample-10BT \
83
- cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train
84
- ```
85
- or for testing SCAN:
86
- ```bash
87
- bash train.sh \
88
- type=scan \
89
- lr=3e-4 \
90
- steps=1000 \
91
- batch=8 \
92
- update=1 \
93
- warmup=100 \
94
- context=2048 \
95
- path=exp/scan-340M-test \
96
- project=fla \
97
- model=configs/scan_340M.json \
98
- data=alturing/gutenberg-texts \
99
- name=sample-10BT \
100
- cache=data/alturing/gutenberg-texts/train
101
- ```
102
-
103
- `flame` also supports resuming interrupted training by specifying the checkpoint path.
104
- Simply use the following command to resume training:
105
-
106
- ```bash
107
- bash train.sh \
108
- type=gla \
109
- lr=3e-4 \
110
- steps=20480 \
111
- batch=8 \
112
- update=1 \
113
- warmup=1024 \
114
- context=2048 \
115
- path=exp/gla-340M-10B \
116
- project=fla \
117
- model=configs/gla_340M.json \
118
- data=HuggingFaceFW/fineweb-edu \
119
- name=sample-10BT \
120
- cache=data/HuggingFaceFW/fineweb-edu/sample-10BT/train \
121
- checkpoint=exp/gla-340M-10B/checkpoint-8192
122
- ```
123
-
124
- You can also use `wandb` to monitor your training process effectively.
125
-
126
- ![wandb](https://github.com/user-attachments/assets/05ca031c-1cae-41c9-bfcb-5b6b6d0df729)
127
-
128
- ## Continual Pretraining
129
-
130
- `flame` supports continual training from a pretrained checkpoint.
131
- Below, we provide an example of how to finetune Mistral-7B to GLA.
132
- You can follow similar steps to reproduce the results in the [GSA paper](https://arxiv.org/abs/2409.07146):
133
-
134
- 1. Initialize a brand-new GLA-7B model from the config and copy the mathced pretrained weights from Mistral-7B:
135
- ```bash
136
- cd ../utils
137
- python convert_from_llama.py \
138
- --model mistralai/Mistral-7B-v0.1 \
139
- --config ../training/configs/gla_7B.json \
140
- --output ../training/converted/gla-7B
141
- cd -
142
- ```
143
-
144
- 2. Directly launch training from the converted checkpoint:
145
- ```bash
146
- bash train.sh \
147
- type=gla \
148
- lr=3e-5 \
149
- steps=10240 \
150
- batch=4 \
151
- update=8 \
152
- warmup=512 \
153
- context=2048 \
154
- path=exp/gla-7B-20B \
155
- project=fla \
156
- model=converted/gla-7B \
157
- data=SlimPajama-627B \
158
- cache=data/SlimPajama-627B/train
159
- ```
160
-
161
- Please be aware that finetuning on a single node may not be the most efficient approach.
162
- If available, consider leveraging multi-node GPUs for optimal performance.
163
- You can find guidance on how to launch a multi-node job in the [accelerate tutorial](https://github.com/huggingface/accelerate/blob/main/examples/slurm/submit_multinode.sh).
164
-
165
- [^1]: The `accelerate` library supports various distributed frameworks, like `deepspeed` and `megatron` for large-scale training. We use `deepspeed` in our case.
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: configs/gsa_16M.json
4
+ tags:
5
+ - generated_from_trainer
6
+ model-index:
7
+ - name: gsa-16M-test
8
+ results: []
9
+ ---
10
+
11
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
+ should probably proofread and complete it, then remove this comment. -->
13
+
14
+ # gsa-16M-test
15
+
16
+ This model is a fine-tuned version of [configs/gsa_16M.json](https://huggingface.co/configs/gsa_16M.json) on an unknown dataset.
17
+
18
+ ## Model description
19
+
20
+ More information needed
21
+
22
+ ## Intended uses & limitations
23
+
24
+ More information needed
25
+
26
+ ## Training and evaluation data
27
+
28
+ More information needed
29
+
30
+ ## Training procedure
31
+
32
+ ### Training hyperparameters
33
+
34
+ The following hyperparameters were used during training:
35
+ - learning_rate: 0.0003
36
+ - train_batch_size: 8
37
+ - eval_batch_size: 8
38
+ - seed: 42
39
+ - distributed_type: multi-GPU
40
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
41
+ - lr_scheduler_type: cosine_with_min_lr
42
+ - lr_scheduler_warmup_steps: 200
43
+ - training_steps: 5000
44
+
45
+ ### Training results
46
+
47
+
48
+
49
+ ### Framework versions
50
+
51
+ - Transformers 4.47.0
52
+ - Pytorch 2.5.1+cu124
53
+ - Datasets 3.2.0
54
+ - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.47.0"
6
+ }