HaileyStorm
/

chess-mamba-vs-xformer

Model card Files Files and versions Community

HaileyStorm commited on May 18

Commit

fc6a1f1

•

1 Parent(s): b193bf7

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -99,7 +99,7 @@ The detailed results of the d_state ablation experiments are summarized in the f
 Based on the corrected evaluations and insights from the ablation experiments, the final version of the Mamba 50M model (v3) was configured with 29 layers, a d_model of 512, a d_state of 32, and an auto dt_rank of 32. This version showed significant improvements over the previous versions, with better performance in win rate, legal moves, and other key metrics.
-The final Mamba 50M model was selected based on a combination of empirical performance data and theoretical considerations derived from the ablation experiments.
 ![Mamba v1 v2 v3 Training Comparison](mamba-v1-v2-v3-training-comparison.png)
 *Figure 2: Evaluation results for the three versions of the Mamba 50M model. Version 3 clearly outperforms versions 1 and 2, demonstrating the effectiveness of the final configuration.*
@@ -247,7 +247,7 @@ This bug was particularly problematic because after seeming to point at issues w
 Once the bug was fixed, all evaluations were redone (where checkpoints were still available), revealing the true performance of the models. The third and final Mamba 50M model, configured largely based on proper evaluation results, showed significant improvement over the previous versions.
 ### 7.2 Lessons for Mamba Model Configuration
-The experiments conducted provide strong evidence that the Mamba architecture is capable of learning and generalizing from raw game data, largely surpassing Transformer models in this domain. One key lesson from this work, supported by both my ablations and the findings from the Mamba Othello work by alxndrTL (https://github.com/alxndrTL/othello_mamba), is the importance of adjusting the number of layers based on the architecture. A general rule derived from these experiments is to start with the number of layers a Transformer would use and multiply by 2, then reduce slightly to accommodate a higher d_state and maintain or exceed the d_model of the Transformer model.
 This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_state and d_model parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.

 Based on the corrected evaluations and insights from the ablation experiments, the final version of the Mamba 50M model (v3) was configured with 29 layers, a d_model of 512, a d_state of 32, and an auto dt_rank of 32. This version showed significant improvements over the previous versions, with better performance in win rate, legal moves, and other key metrics.
+The final Mamba 50M model was selected based on a combination of empirical performance data from the ablation experiments, my previous Mamba models, and the the [Mamba Othello work by alxndrTL](https://github.com/alxndrTL/othello_mamba), along with theoretical considerations and imperical data from the Mamba paper.
 ![Mamba v1 v2 v3 Training Comparison](mamba-v1-v2-v3-training-comparison.png)
 *Figure 2: Evaluation results for the three versions of the Mamba 50M model. Version 3 clearly outperforms versions 1 and 2, demonstrating the effectiveness of the final configuration.*
 Once the bug was fixed, all evaluations were redone (where checkpoints were still available), revealing the true performance of the models. The third and final Mamba 50M model, configured largely based on proper evaluation results, showed significant improvement over the previous versions.
 ### 7.2 Lessons for Mamba Model Configuration
+The experiments conducted provide strong evidence that the Mamba architecture is capable of learning and generalizing from raw game data, largely surpassing Transformer models in this domain. One key lesson from this work, supported by both my ablations and the findings from Mamba Othello project, is the importance of adjusting the number of layers based on the architecture. A general rule derived from these experiments is to start with the number of layers a Transformer would use and multiply by 2, then reduce slightly to accommodate a higher d_state and maintain or exceed the d_model of the Transformer model.
 This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_state and d_model parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.