HaileyStorm commited on
Commit
78bc24e
1 Parent(s): 46797f2

Update Report/REPORT.md

Browse files
Files changed (1) hide show
  1. Report/REPORT.md +1 -1
Report/REPORT.md CHANGED
@@ -261,7 +261,7 @@ The experiments conducted provide strong evidence that the Mamba architecture is
261
  This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_model and especially d_state parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.
262
 
263
  ### 7.3 Hyperparameter Schedule Insights
264
- As detailed in Section 4.2, reviewing results and logs indicated some of the.
265
 
266
  Despite these potential refinements, the training evaluations indicate that use of AutoClip and a verion of the Warmup-Stable-Decay (WSD) learning rate schedule contributed to achieving models of similar strength to Adam's original model while training on significantly less data (34.8B tokens vs. 61.4B). This provides evidence supporting the effectiveness of these methods, including their use with the Mamba architecture, though future work should focus on refining hyperparameter schedules to further enhance training performance.
267
 
 
261
  This approach ensures that the Mamba models have a sufficient width and dimensionality to develop a robust internal state and sufficient depth to track variables across long sequences, while also balancing the computational cost of increased d_model and especially d_state parameters. The evidence from the linear probes further supports the conclusion that Mamba's internal state is well-developed and capable of tracking key aspects of the game, such as position strength and board state.
262
 
263
  ### 7.3 Hyperparameter Schedule Insights
264
+ As detailed in Section 4.2, reviewing results and logs indicated some of the hyperparameter adjustments were sub-optimal.
265
 
266
  Despite these potential refinements, the training evaluations indicate that use of AutoClip and a verion of the Warmup-Stable-Decay (WSD) learning rate schedule contributed to achieving models of similar strength to Adam's original model while training on significantly less data (34.8B tokens vs. 61.4B). This provides evidence supporting the effectiveness of these methods, including their use with the Mamba architecture, though future work should focus on refining hyperparameter schedules to further enhance training performance.
267