togethercomputer/GPT-JT-6B-v1 · Generated Text have issues

Feb 1, 2023

Hi, thanks for the work. I have fine-tuned this model like we do other casual language models for example EleutherAI/gpt-j-6B, EleutherAI/gpt-neo-2.7B etc. using my own data set. But the generated texts are only numbers like 0, 1, etc.

While the models have no such issues.

I would highly appreciate any suggestion/help in this regard.

asifahmed

Feb 1, 2023

Some of the results are:

GeneratedText:

................................................................................................................................

asifahmed

Feb 1, 2023

InputText = "Bitcoin is "
{'Generated Text:', 'Bitcoin is 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 '}

juewang

Together org Feb 2, 2023

@asifhugs Thank you for reaching out! I believe this is the causal mask issue for training.

To achieve bidirectional attention for inference, we set zeros the causal mask , by layer.bias[:] = 0. This is fine for inference, as the model naturally cannot see future tokens. So removing causal mask won't cause any problem.

In order to do training / fine-tuning, we should revert this back, and manually control the causal mask for each sequence – the prompt part should be all zeros, and the generation part should be causal mask. Otherwise, there will be information leakage (each token can see the entire sequence) in training so the model won't learn any meaningful things.

asifahmed

Feb 3, 2023

Hi @juewang ,
Thank you so much for the response and identifying the issue.
Currently, I am using this script for the task, could you please check it and mention/identify the possible changes in it which will lead to avoiding the issue?
Would be very much thankful!

Thanks a lot again,
Asif

juewang

Together org Feb 5, 2023

•

edited Feb 6, 2023

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:

model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )

After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :

# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

asifahmed

Feb 5, 2023

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:
model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )
After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :
# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

Thanks a lot @juewang for the detailed comment. I will try these and will let you know.
Thanks again!

JacopoBandoni

Mar 13, 2023

Hi @asifhugs, did you succeed in training?

JacopoBandoni

Mar 13, 2023

•

edited Mar 13, 2023

@juewang

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:
model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )
After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :
# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

Would it be possible to just pass the attention_mask in the forward pass during training.
In such a way that is possible to train prefix style without having to change the underlying code?
Thank you!

juewang

Together org Mar 16, 2023

@JacopoBandoni I am afraid no.. attention_mask is used to indicate padding tokens, which should be masked; prefix_mask is used to indicate the bidirectional context.
You might want to have a look at this as a reference for fine-tuning :)

asifahmed

Mar 16, 2023

Hi @asifhugs, did you succeed in training?

Hi, no not yet!