stefan-it
/

xlstm-german-wikipedia

@@ -16,6 +16,7 @@ Initially, we integrated xLSTM model training into Flair - for more information
 # Changelog
 - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
 - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
@@ -49,6 +50,8 @@ training:
   log_every_step: 10
   generate_every_step: 5000
   wandb_project: "xlstm"
 model:
   num_blocks: 24
@@ -73,6 +76,12 @@ tokenizer:
   pretrained_id: "meta-llama/Llama-2-7b-hf"
 ```
 # Usage
 It is possible to use the model to generate some text:
@@ -96,11 +105,3 @@ print(generated_text)
 Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
 Also downstream experiments are coming very soon.
-Unfortunately, there are nan's occuring in the training (after 7h 33m 14s of training on a single RTX 4090):
-![Training Loss](training-loss.png)
-This is very likely due to missing grad norm - which will be added soon with `Accelerator.clip_grad_norm_`.
-The uploaded model checkpoint is from 80k steps.

 # Changelog
+- 29.08.2024: Uploaded re-trained model for 1 epoch over complete German Wikipedia corpus. Training was done with gradient clipping (0.25).
 - 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
 - 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co/stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.
   log_every_step: 10
   generate_every_step: 5000
   wandb_project: "xlstm"
+  max_grad_norm: 0.25
+  # wandb_project: "lovecraftxlstm"
 model:
   num_blocks: 24
   pretrained_id: "meta-llama/Llama-2-7b-hf"
 ```
+The training loss curve can be seen here:
+![Training Loss](training-loss.png)
+The uploaded model checkpoint is from 458,431 steps (1 epoch over corpus). Training took 1d 3h 17m 58s on a single RTX 4090.
 # Usage
 It is possible to use the model to generate some text:
 Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
 Also downstream experiments are coming very soon.