HaileyStorm commited on
Commit
fc6a1f1
1 Parent(s): b193bf7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -99,7 +99,7 @@ The detailed results of the d_state ablation experiments are summarized in the f
99
 
100
  Based on the corrected evaluations and insights from the ablation experiments, the final version of the Mamba 50M model (v3) was configured with 29 layers, a d_model of 512, a d_state of 32, and an auto dt_rank of 32. This version showed significant improvements over the previous versions, with better performance in win rate, legal moves, and other key metrics.
101
 
102
- The final Mamba 50M model was selected based on a combination of empirical performance data and theoretical considerations derived from the ablation experiments.
103
 
104
  ![Mamba v1 v2 v3 Training Comparison](mamba-v1-v2-v3-training-comparison.png)
105
  *Figure 2: Evaluation results for the three versions of the Mamba 50M model. Version 3 clearly outperforms versions 1 and 2, demonstrating the effectiveness of the final configuration.*
@@ -247,7 +247,7 @@ This bug was particularly problematic because after seeming to point at issues w
247
  Once the bug was fixed, all evaluations were redone (where checkpoints were still available), revealing the true performance of the models. The third and final Mamba 50M model, configured largely based on proper evaluation results, showed significant improvement over the previous versions.
248
 
249
  ### 7.2 Lessons for Mamba Model Configuration
250
- The experiments conducted provide strong evidence that the Mamba architecture is capable of learning and generalizing from raw game data, largely surpassing Transformer models in this domain. One key lesson from this work, supported by both my ablations and the findings from the Mamba Othello work by alxndrTL (https://github.com/alxndrTL/othello_mamba), is the importance of adjusting the number of layers based on the architecture. A general rule derived from these experiments is to start with the number of layers a Transformer would use and multiply by 2, then reduce slightly to accommodate a higher d_state and maintain or exceed the d_model of the Transformer model.
251
 
252
  This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_state and d_model parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.
253
 
 
99
 
100
  Based on the corrected evaluations and insights from the ablation experiments, the final version of the Mamba 50M model (v3) was configured with 29 layers, a d_model of 512, a d_state of 32, and an auto dt_rank of 32. This version showed significant improvements over the previous versions, with better performance in win rate, legal moves, and other key metrics.
101
 
102
+ The final Mamba 50M model was selected based on a combination of empirical performance data from the ablation experiments, my previous Mamba models, and the the [Mamba Othello work by alxndrTL](https://github.com/alxndrTL/othello_mamba), along with theoretical considerations and imperical data from the Mamba paper.
103
 
104
  ![Mamba v1 v2 v3 Training Comparison](mamba-v1-v2-v3-training-comparison.png)
105
  *Figure 2: Evaluation results for the three versions of the Mamba 50M model. Version 3 clearly outperforms versions 1 and 2, demonstrating the effectiveness of the final configuration.*
 
247
  Once the bug was fixed, all evaluations were redone (where checkpoints were still available), revealing the true performance of the models. The third and final Mamba 50M model, configured largely based on proper evaluation results, showed significant improvement over the previous versions.
248
 
249
  ### 7.2 Lessons for Mamba Model Configuration
250
+ The experiments conducted provide strong evidence that the Mamba architecture is capable of learning and generalizing from raw game data, largely surpassing Transformer models in this domain. One key lesson from this work, supported by both my ablations and the findings from Mamba Othello project, is the importance of adjusting the number of layers based on the architecture. A general rule derived from these experiments is to start with the number of layers a Transformer would use and multiply by 2, then reduce slightly to accommodate a higher d_state and maintain or exceed the d_model of the Transformer model.
251
 
252
  This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_state and d_model parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.
253