datasysdev commited on
Commit
8b19534
·
verified ·
1 Parent(s): 9040a9d

Update model card with all32 status

Browse files
Files changed (1) hide show
  1. README.md +16 -0
README.md CHANGED
@@ -131,6 +131,14 @@ Per-layer step-500 mass@K at K=128:
131
 
132
  The next run reserves `[0, 1, 2, 35]` and trains layers `3..34`.
133
 
 
 
 
 
 
 
 
 
134
  ## Positioning against related methods
135
 
136
  The paper frames this method as closest in asymptotic shape to Reformer and
@@ -152,6 +160,14 @@ superiority. The clean result proves the approach for the six-layer pilot; the
152
  active all32 reserved-layer run tests whether broad near-whole-model
153
  substitution can preserve that quality.
154
 
 
 
 
 
 
 
 
 
155
  ## Checkpoints
156
 
157
  Important checkpoint paths in this HF repo:
 
131
 
132
  The next run reserves `[0, 1, 2, 35]` and trains layers `3..34`.
133
 
134
+ First diagnostic from the active all32 run:
135
+
136
+ | Step | Recall@K eval | PPL gap | Read |
137
+ |---:|---:|---:|---|
138
+ | 250 | 0.812 | +2.28% | already better than all36 best training eval |
139
+
140
+ This is not a final result; the run is continuing toward step 1000.
141
+
142
  ## Positioning against related methods
143
 
144
  The paper frames this method as closest in asymptotic shape to Reformer and
 
160
  active all32 reserved-layer run tests whether broad near-whole-model
161
  substitution can preserve that quality.
162
 
163
+ This method targets a different deployment scenario than native
164
+ sliding-window/state-space/hybrid architectures such as Mistral-style sliding
165
+ window, Mamba, or Qwen3.6 Gated DeltaNet hybrids. Those models are trained from
166
+ scratch with their sparse or hybrid mechanism in place. This work is post-hoc:
167
+ train a base model with full attention for maximum expressivity, then add
168
+ lightweight retrieval projections afterward to make inference sub-linear without
169
+ changing base weights.
170
+
171
  ## Checkpoints
172
 
173
  Important checkpoint paths in this HF repo: