readme: update information about final xLSTM model (one epoch over corpus)
Browse files
README.md
CHANGED
@@ -16,6 +16,7 @@ Initially, we integrated xLSTM model training into Flair - for more information
|
|
16 |
|
17 |
# Changelog
|
18 |
|
|
|
19 |
- 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
|
20 |
- 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
|
21 |
|
@@ -49,6 +50,8 @@ training:
|
|
49 |
log_every_step: 10
|
50 |
generate_every_step: 5000
|
51 |
wandb_project: "xlstm"
|
|
|
|
|
52 |
|
53 |
model:
|
54 |
num_blocks: 24
|
@@ -73,6 +76,12 @@ tokenizer:
|
|
73 |
pretrained_id: "meta-llama/Llama-2-7b-hf"
|
74 |
```
|
75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
# Usage
|
77 |
|
78 |
It is possible to use the model to generate some text:
|
@@ -96,11 +105,3 @@ print(generated_text)
|
|
96 |
|
97 |
Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
|
98 |
Also downstream experiments are coming very soon.
|
99 |
-
|
100 |
-
Unfortunately, there are nan's occuring in the training (after 7h 33m 14s of training on a single RTX 4090):
|
101 |
-
|
102 |
-
![Training Loss](training-loss.png)
|
103 |
-
|
104 |
-
This is very likely due to missing grad norm - which will be added soon with `Accelerator.clip_grad_norm_`.
|
105 |
-
|
106 |
-
The uploaded model checkpoint is from 80k steps.
|
|
|
16 |
|
17 |
# Changelog
|
18 |
|
19 |
+
- 29.08.2024: Uploaded re-trained model for 1 epoch over complete German Wikipedia corpus. Training was done with gradient clipping (0.25).
|
20 |
- 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
|
21 |
- 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
|
22 |
|
|
|
50 |
log_every_step: 10
|
51 |
generate_every_step: 5000
|
52 |
wandb_project: "xlstm"
|
53 |
+
max_grad_norm: 0.25
|
54 |
+
# wandb_project: "lovecraftxlstm"
|
55 |
|
56 |
model:
|
57 |
num_blocks: 24
|
|
|
76 |
pretrained_id: "meta-llama/Llama-2-7b-hf"
|
77 |
```
|
78 |
|
79 |
+
The training loss curve can be seen here:
|
80 |
+
|
81 |
+
![Training Loss](training-loss.png)
|
82 |
+
|
83 |
+
The uploaded model checkpoint is from 458,431 steps (1 epoch over corpus). Training took 1d 3h 17m 58s on a single RTX 4090.
|
84 |
+
|
85 |
# Usage
|
86 |
|
87 |
It is possible to use the model to generate some text:
|
|
|
105 |
|
106 |
Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
|
107 |
Also downstream experiments are coming very soon.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|