Text Generation
Transformers
PyTorch
English
gptj
Inference Endpoints

Generated Text have issues

#22
by asifahmed - opened

Hi, thanks for the work. I have fine-tuned this model like we do other casual language models for example EleutherAI/gpt-j-6B, EleutherAI/gpt-neo-2.7B etc. using my own data set. But the generated texts are only numbers like 0, 1, etc.

While the models have no such issues.

I would highly appreciate any suggestion/help in this regard.

Some of the results are:

GeneratedText:

................................................................................................................................

InputText = "Bitcoin is "
{'Generated Text:', 'Bitcoin is 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 '}

Together org

@asifhugs Thank you for reaching out! I believe this is the causal mask issue for training.

To achieve bidirectional attention for inference, we set zeros the causal mask , by layer.bias[:] = 0. This is fine for inference, as the model naturally cannot see future tokens. So removing causal mask won't cause any problem.

In order to do training / fine-tuning, we should revert this back, and manually control the causal mask for each sequence – the prompt part should be all zeros, and the generation part should be causal mask. Otherwise, there will be information leakage (each token can see the entire sequence) in training so the model won't learn any meaningful things.

Hi @juewang ,
Thank you so much for the response and identifying the issue.
Currently, I am using this script for the task, could you please check it and mention/identify the possible changes in it which will lead to avoiding the issue?
Would be very much thankful!

Thanks a lot again,
Asif

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:

model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )

After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :

# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:

model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )

After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :

# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

Thanks a lot @juewang for the detailed comment. I will try these and will let you know.
Thanks again!

Hi @asifhugs, did you succeed in training?

@juewang

Hi @asifhugs, a quick fix is to reset the causal mask after loading the trained model, e.g.:

model = ...
max_positions = 2048
for i in range(len(model.transformer.h)):
    model.transformer.h[i].attn.bias[:] = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
        1, 1, max_positions, max_positions
    )

After doing this, this model becomes a pure causal language model.

If you want to keep a PrefixLM-style training. You should pass a prefix_maskas an argument to tell the model which part is the prefix/prompt, and write your custom model class to let the model attend to the whole context of prompt. For example, we can insert the following code after here :

# `prefix_mask` is passed as an argument with a shape of (bsz, seqlen)
if prefix_mask is not None:
    bsz = query.size(0)
    causal_mask = causal_mask.repeat(bsz, 1, 1, 1) # (bsz, 1, src_len, tgt_len)
    causal_mask = causal_mask.permute(0, 3, 1, 2) # (bsz, tgt_len, 1, src_len)
    causal_mask[prefix_mask.bool()] = 1
    causal_mask = causal_mask.permute(0, 2, 3, 1) # (bsz, 1, src_len, tgt_len)

Would it be possible to just pass the attention_mask in the forward pass during training.
In such a way that is possible to train prefix style without having to change the underlying code?
Thank you!

Together org

@JacopoBandoni I am afraid no.. attention_mask is used to indicate padding tokens, which should be masked; prefix_mask is used to indicate the bidirectional context.
You might want to have a look at this as a reference for fine-tuning :)

Hi @asifhugs, did you succeed in training?

Hi, no not yet!

Sign up or log in to comment