Fix short-conv padding masking on transformers >=5

#1
by Satyen - opened
Liquid AI org
edited 6 days ago

On transformers >=5, Lfm2Model.forward routes the raw 2D padding mask (not the
4D additive mask) to short-conv layers. The shipped _noncausal_shortconv_forward
then runs apply_mask_to_padding_states, which is a no-op on a 4D mask (the 4.56
path the checkpoint was trained with) but zeroes padding/query-expansion states
on a 2D mask, shifting per-token embeddings ColBERT scores in MaxSim.

Fix: gate the masking to the flash_attention_2 path only; eager/sdpa match
training behavior on every transformers version.

en NanoBEIR NDCG@10 (identical eval stack):
transformers 5.3.0 fp32 unfixed 0.6506 -> fixed 0.6863 (= card 0.687)
transformers 5.3.0 bf16 unfixed 0.6412 -> fixed 0.6771

EdoardoMosca changed pull request status to merged

Sign up or log in to comment