stefan-it commited on
Commit
48b8ed3
1 Parent(s): aed4ef8

readme: update information about final xLSTM model (one epoch over corpus)

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -16,6 +16,7 @@ Initially, we integrated xLSTM model training into Flair - for more information
16
 
17
  # Changelog
18
 
 
19
  - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
20
  - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
21
 
@@ -49,6 +50,8 @@ training:
49
  log_every_step: 10
50
  generate_every_step: 5000
51
  wandb_project: "xlstm"
 
 
52
 
53
  model:
54
  num_blocks: 24
@@ -73,6 +76,12 @@ tokenizer:
73
  pretrained_id: "meta-llama/Llama-2-7b-hf"
74
  ```
75
 
 
 
 
 
 
 
76
  # Usage
77
 
78
  It is possible to use the model to generate some text:
@@ -96,11 +105,3 @@ print(generated_text)
96
 
97
  Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
98
  Also downstream experiments are coming very soon.
99
-
100
- Unfortunately, there are nan's occuring in the training (after 7h 33m 14s of training on a single RTX 4090):
101
-
102
- ![Training Loss](training-loss.png)
103
-
104
- This is very likely due to missing grad norm - which will be added soon with `Accelerator.clip_grad_norm_`.
105
-
106
- The uploaded model checkpoint is from 80k steps.
 
16
 
17
  # Changelog
18
 
19
+ - 29.08.2024: Uploaded re-trained model for 1 epoch over complete German Wikipedia corpus. Training was done with gradient clipping (0.25).
20
  - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
21
  - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
22
 
 
50
  log_every_step: 10
51
  generate_every_step: 5000
52
  wandb_project: "xlstm"
53
+ max_grad_norm: 0.25
54
+ # wandb_project: "lovecraftxlstm"
55
 
56
  model:
57
  num_blocks: 24
 
76
  pretrained_id: "meta-llama/Llama-2-7b-hf"
77
  ```
78
 
79
+ The training loss curve can be seen here:
80
+
81
+ ![Training Loss](training-loss.png)
82
+
83
+ The uploaded model checkpoint is from 458,431 steps (1 epoch over corpus). Training took 1d 3h 17m 58s on a single RTX 4090.
84
+
85
  # Usage
86
 
87
  It is possible to use the model to generate some text:
 
105
 
106
  Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
107
  Also downstream experiments are coming very soon.