kv cache

by FrankWu - opened

in the paper, i see the model cache Compress kv. but in the model file, it seems to still cache legacy kv:

key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
        key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
        key_states[:, :, :, self.qk_nope_head_dim :] = k_pe
        if past_key_value is not None:
            cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
            key_states, value_states = past_key_value.update(
                key_states, value_states, self.layer_idx, cache_kwargs

Do I misunderstand something?

I also found this problem, and the linear transformation matrices of K and V in the code are not coupled to the transformation matrices of Q and O.

Sign up or log in to comment