Confusing ArcticDecoderLayer::forward() implementation

#11
by sszymczyk - opened

I'm a bit confused about the ArcticDecoderLayer::forward() method implementation in the model:

  1. Does the model work correctly with parallel_attn_mlp_res set to false?
  2. There is a normalization layer called post_attention_layernorm. Do I understand correctly that if parallel_attn_mlp_res is set to true then it actually normalizes the layer input, not the attention output?

Sign up or log in to comment