Question about causal safety in DeepSeek-V4 CSA prefill retrieval

#172
by Prachi01 - opened

possible causality ambiguity during the prefill phase.

From the public implementation, the Lightning Indexer computes top-k retrieval over all compressed memories:

scores = q @ compressed_kv.transpose(-1, -2)
scores = F.relu(scores)
index_scores = (scores * weights.unsqueeze(-1)).sum(dim=2)
topk = index_scores.topk(k, dim=-1).indices

However, I could not find an explicit causal masking step before topk selection.

During prefill (seq_len > 1), this seems to imply that an earlier query token could potentially retrieve compressed memories corresponding to future token regions. For example:

q0 could retrieve M3
where M3 summarizes future tokens later in the prompt

I understand that later attention masking may block some retrieved entries, but I could not find discussion in the paper/blog about:

1 whether top-k retrieval itself is causally constrained
2 whether future compressed blocks are masked before retrieval
3 how causal safety is guaranteed during prefill for CSA sparse retrieval

Am I missing something or misunderstood
Would appreciate clarification from anyone familiar with the implementation details.

Sign up or log in to comment