Generated Text have issues
Hi, thanks for the work. I have fine-tuned this model like we do other casual language models for example EleutherAI/gpt-j-6B, EleutherAI/gpt-neo-2.7B etc. using my own data set. But the generated texts are only numbers like 0, 1, etc.
While the models have no such issues.
I would highly appreciate any suggestion/help in this regard.
Some of the results are:
GeneratedText:
................................................................................................................................
InputText = "Bitcoin is "
{'Generated Text:', 'Bitcoin is 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 '}
@asifhugs Thank you for reaching out! I believe this is the causal mask issue for training.
To achieve bidirectional attention for inference, we set zeros the causal mask , by layer.bias[:] = 0
. This is fine for inference, as the model naturally cannot see future tokens. So removing causal mask won't cause any problem.
In order to do training / fine-tuning, we should revert this back, and manually control the causal mask for each sequence – the prompt part should be all zeros, and the generation part should be causal mask. Otherwise, there will be information leakage (each token can see the entire sequence) in training so the model won't learn any meaningful things.
Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:
model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
1, 1, max_positions, max_positions
)
After doing this, this model becomes a pure causal language model.
If you want to keep a PrefixLM-style training. You should pass a prefix_mask
as an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :
# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
bsz = query.size(0)
causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
causal_mask[prefix_mask.bool()] = 1
causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)
Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:
model = ... max_positions = 2048 for i in range(len(model.transformer.h)): model.transformer.h[i].bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view( 1, 1, max_positions, max_positions )
After doing this, this model becomes a pure causal language model.
If you want to keep a PrefixLM-style training. You should pass a
prefix_mask
as an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen) if prefix_mask is not None: bsz = query.size(0) causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len) causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len) causal_mask[prefix_mask.bool()] = 1 causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)
Thanks a lot
@juewang
for the detailed comment. I will try these and will let you know.
Thanks again!
Hi @asifhugs, did you succeed in training?
Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:
model = ... max_positions = 2048 for i in range(len(model.transformer.h)): model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view( 1, 1, max_positions, max_positions )
After doing this, this model becomes a pure causal language model.
If you want to keep a PrefixLM-style training. You should pass a
prefix_mask
as an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen) if prefix_mask is not None: bsz = query.size(0) causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len) causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len) causal_mask[prefix_mask.bool()] = 1 causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)
Would it be possible to just pass the attention_mask in the forward pass during training.
In such a way that is possible to train prefix style without having to change the underlying code?
Thank you!
@JacopoBandoni
I am afraid no.. attention_mask
is used to indicate padding tokens, which should be masked; prefix_mask
is used to indicate the bidirectional context.
You might want to have a look at this as a reference for fine-tuning :)
Hi @asifhugs, did you succeed in training?
Hi, no not yet!