munish0838 commited on
Commit
6a1e808
1 Parent(s): 270bee9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ base_model: Sao10K/L3-8B-Stheno-v3.3-32K
7
+ ---
8
+
9
+ # QuantFactory/L3-8B-Stheno-v3.3-32K-GGUF
10
+ This is quantized version of [Sao10K/L3-8B-Stheno-v3.3-32K](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K) created using llama.cpp
11
+
12
+ # Model Description
13
+
14
+ Trained with compute from [Backyard.ai](https://backyard.ai/)
15
+
16
+ Training Details:
17
+ <br>Trained at 8K Context -> Expanded to 32K Context with PoSE training.
18
+
19
+ Dataset Modifications:
20
+ <br>\- Further Cleaned up Roleplaying Samples -> Quality Check
21
+ <br>\- Removed Low Quality Samples from Manual Check -> Increased Baseline Quality Floor
22
+ <br>\- More Creative Writing Samples -> 2x Samples
23
+ <br>\- Remade and Refined Detailed Instruct Data
24
+
25
+ Notes:
26
+ <br>\- Training run is much less aggressive than previous Stheno versions.
27
+ <br>\- This model works when tested in bf16 with the same configs as within the file.
28
+ <br>\- I do not know the effects quantisation has on it.
29
+ <br>\- Roleplays pretty well. Feels nice in my opinion.
30
+ <br>\- It has some issues on long context understanding and reasoning. Much better vs rope scaling normally though, so that is a plus.
31
+ <br>\- Reminder, this isn't a native 32K model. It has it's issues, but it's coherent and working well.
32
+
33
+ Sanity Check // Needle in a Haystack Results:
34
+ <br>\- This is not as complex as RULER or NIAN, but it's a basic evaluator. Some improper train examples had Haystack scores ranging from Red to Orange for most of the extended contexts.
35
+ ![Results](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/haystack.png)
36
+
37
+ Wandb Run:
38
+ ![Wandb](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/resolve/main/wandb.png)
39
+
40
+ ---
41
+
42
+ Relevant Axolotl Configurations:
43
+ <br>-> Taken from [winglian/Llama-3-8b-64k-PoSE](https://huggingface.co/winglian/Llama-3-8b-64k-PoSE)
44
+ <br>\- I tried to find my own configs, hours of tinkering but the one he used worked best, so I stuck to it.
45
+ <br>\- 2M Rope Theta had the best loss results during training compared to other values.
46
+ <br>\- Leaving it at 500K rope wasn't that much worse, but 4M and 8M Theta made the grad_norm values worsen even if loss drops fast.
47
+ <br>\- Mixing in Pretraining Data was a PITA. Made it a lot worse with formatting.
48
+ <br>\- Pretraining / Noise made it worse at Haystack too? It wasn't all Green, Mainly Oranges.
49
+ <br>\- Improper / Bad Rope Theta shows in Grad_Norm exploding to thousands. It'll drop to low values alright, but it's a scary fast drop even with gradient clipping.
50
+
51
+ ```
52
+ sequence_len: 8192
53
+ use_pose: true
54
+ pose_max_context_len: 32768
55
+
56
+ overrides_of_model_config:
57
+ rope_theta: 2000000.0
58
+ max_position_embeddings: 32768
59
+
60
+ # peft_use_dora: true
61
+ adapter: lora
62
+ peft_use_rslora: true
63
+ lora_model_dir:
64
+ lora_r: 256
65
+ lora_alpha: 256
66
+ lora_dropout: 0.1
67
+ lora_target_linear: true
68
+ lora_target_modules:
69
+ - gate_proj
70
+ - down_proj
71
+ - up_proj
72
+ - q_proj
73
+ - v_proj
74
+ - k_proj
75
+ - o_proj
76
+
77
+ warmup_steps: 80
78
+ gradient_accumulation_steps: 6
79
+ micro_batch_size: 1
80
+ num_epochs: 2
81
+ optimizer: adamw_bnb_8bit
82
+ lr_scheduler: cosine_with_min_lr
83
+ learning_rate: 0.00004
84
+ lr_scheduler_kwargs:
85
+ min_lr: 0.000004
86
+ ```