How did you trained your LatentAttentionLayer?

#36
by juneonetwothree - opened

Hello.
I am wondering how you trained your latent attention layer.
From your technical report, you've mentioned that you used LoRA with rank 16.
This makes sense with layers and weights that came from initial model Mistral-7B-v0.1, but I am confusing whether you used LoRA in latent attention layer too.
Did you trained LoRA for latent attention layer? If so, Are all initial weights for latent attention in base model frozen to be 0?
Did you use same learing-rate for decoder-layer and latent attention layer?
Could you notice how you trained your model adding latent attention layer?

Thank you.

NVIDIA org

Hi, @juneonetwothree . Thanks for asking the question. We did not use LoRA technique for latent attention layer. Only decoder-only LLM are trained with LoRA technique. The decoder-only LLM and latent-attention layers are trained in an end-to-end manner (it's the same learning rate).

Sign up or log in to comment