Is the CausalML model from HuggingFace truly causal?

#46
by Cyrile - opened

Hello, I have a technical question. It's about training for text generation in a causal manner. I noticed that the training objective is cross-entropy based on a simple shift of the input_ids. However, the attention mechanism is causal thanks to the mask, but the feed-forward part is non-causal, am I correct? Therefore, isn't the way the model is trained in the HuggingFace library incorrect? Shouldn't we apply cross-entropy only on the prediction of the last token or also put a causal-mask on the MLP part?

Cyrile changed discussion status to closed

Excuse me, I made a mistake and this is wrong.

Sign up or log in to comment