Update README.md
Browse files
README.md
CHANGED
@@ -83,7 +83,7 @@ With LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended
|
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
85 |
Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly to catch the new semantics. I did not modify the LM head, as no significant performance improvements were observed.
|
86 |
-
DORA training introduced the concept of training a magnitude parameter, which can help guide or vectorize the LLM model in a new direction, but the training was up to 4x longer, making it too costly for this purpose
|
87 |
|
88 |
For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow the model to retain its general knowledge while incorporating more specific domain knowledge about quantum research. Although the results were close to those obtained with LoRA, they were consistently slightly worse.
|
89 |
|
|
|
83 |
After applying the [linear scaling rule](https://arxiv.org/pdf/1706.02677), I settled on a batch size of 8 and found that a starting learning rate of 10^-4 yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
|
84 |
|
85 |
Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly to catch the new semantics. I did not modify the LM head, as no significant performance improvements were observed.
|
86 |
+
DORA training introduced the concept of training a magnitude parameter, which can help guide or vectorize the LLM model in a new direction, but the training was up to 4x longer, making it too costly for this purpose, while yielding the same accuracy as LORA+.
|
87 |
|
88 |
For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow the model to retain its general knowledge while incorporating more specific domain knowledge about quantum research. Although the results were close to those obtained with LoRA, they were consistently slightly worse.
|
89 |
|