microsoft/phi-1_5 · Attention mask for generation function in the future?

Sep 12, 2023

In the card it states:

In the generation function, our model currently does not support beam search (num_beams >1) and `attention_mask' parameters. Furthermore, in the forward pass of the model, we currently do not support outputing hidden states or attention values, or using custom input embeddings (instead of the model's).

I was just wondering if there is intention to support attention_mask parameters in the future? Has this just not been implemented just yet?

gugarosa

Microsoft org Sep 12, 2023

Hello @rchan26 ! I hope everything is going well with you.

This was our first deployment of a model to HF, so we wanted to be sure everything was running smoothly. We already have the attention_mask working out locally and our plan is to update both Phi-1 and Phi-1.5 over the next days.

Regards,
Gustavo.

rchan26

Sep 12, 2023

This comment has been hidden

rchan26

Sep 13, 2023

Thanks! looking forward to testing it out! 😄

bennicholl

Sep 24, 2023

•

edited Sep 24, 2023

Hey @gugarosa
I assume the attention masks parameter isn't set up yet because you're using a torch.nn.Sequential wrapper.
Has the team created a custom torch class to square this away yet? Looking to fine tune and run inference in batches. If it's not something in the pipeline, I'll prolly just write the custom torch class myself. But if it is something that gonna get squared away soon, I won't waste my time. Lemme know and thanks.

gugarosa

Microsoft org Sep 26, 2023

Hello @rchan26 and @bennicholl !

I just updated the model files and added the attention_mask support. Sorry for taking so much time. This should server as a proxy till Phi get's fully implemented in transformers.

However, please note that we still do not have support for attention_mask during training/fine-tuning, only inference. But this shouldn't be a problem in adding in the next upcoming days.

bennicholl

Sep 26, 2023

Thanks for working on this! When I try and perform inference with below code

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))

I get a value error
ValueError: not enough values to unpack (expected 3, got 2)

It seems the error is derived from below
541 kv = update_kv_cache(qkv[:, :, 1:], past_key_values, self.layer_idx)
543 if attention_mask is not None:
--> 544 attention_mask, cu_seqlens, max_seqlen = attention_mask
545 attention_mask = attention_mask.to(qkv.device)
547 attention_kwargs = {"attention_mask": attention_mask}

But I'm not sure whats going on with that code

gugarosa

Microsoft org Sep 26, 2023

My bad @bennicholl !

Just fixed this. We use the flash-attn style for performing cached inference and the attention layer was not aware that attention_mask could be passed as a single tensor.

Should be working now, tested an inference in the way you posted and it worked!

bennicholl

Sep 27, 2023

•

edited Sep 27, 2023

@gugarosa Thanks for the quick response man! I think they're may be a bug in the attention masking. The output for a sentence is different if I run two examples instead of one. Here is some code to reproduce.

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
tokenizer.pad_token = tokenizer.eos_token

#HERE IS CODE RUNNING ONE SENTENCE
encoded_inputs = tokenizer(["this is the first sentence"])
print(encoded_inputs ) = {'input_ids': [[5661, 318, 262, 717, 6827]], 'attention_mask': [[1, 1, 1, 1, 1]]}

tokened_words = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']
model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))
OUTPUT:
CausalLMOutputWithPast(loss=None, logits=tensor([[[15.9766, 16.5625, 13.4219, ..., 2.6074, 2.6074, 2.6074],
[12.3047, 15.2344, 10.3672, ..., 2.3027, 2.3047, 2.3027],
[ 8.8672, 11.7188, 6.6055, ..., 1.0361, 1.0371, 1.0371],
[12.4844, 13.6406, 7.1406, ..., 0.2700, 0.2722, 0.2703],
[20.4688, 22.5625, 14.8438, ..., 3.3477, 3.3477, 3.3457]]],

#HERE IS CODE RUNNING TWO SENTENCES
encoded_inputs = tokenizer(["this is the first sentence", "this is another sentence and is longer than the first"], padding = 'longest')
print(encoded_inputs ) = {'input_ids': [[5661, 318, 262, 717, 6827, 50256, 50256, 50256, 50256, 50256], [5661, 318, 1194, 6827, 290, 318, 2392, 621, 262, 717]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]} # MASKING LOOKS CORRECT

tokened_words = encoded_inputs['input_ids']
attention_mask = encoded_inputs['attention_mask']
OUTPUT:
CausalLMOutputWithPast(loss=None, logits=tensor([[[15.9688, 16.5625, 13.4219, ..., 2.6074, 2.6094, 2.6074],
[12.3125, 15.2344, 10.3672, ..., 2.3047, 2.3047, 2.3047], # NOTICE SOME OF THE VALUES, SUCH AS THE VERY FIRST VALUE IN UPPER LEFT HAND CORNER IS DIFFERENT THAN THE
[ 8.8672, 11.7188, 6.6055, ..., 1.0391, 1.0400, 1.0400], # VALUE IN THE SAME LOCATION IN THE FIRST MATRIX
...,
[13.9922, 17.0156, 18.8750, ..., 2.4453, 2.4453, 2.4453],
[13.8750, 16.8750, 18.7500, ..., 2.4082, 2.4062, 2.4062],
[13.7109, 16.6094, 18.5625, ..., 2.3477, 2.3457, 2.3457]],

    [[15.9688, 16.5625, 13.4219,  ...,  2.6074,  2.6094,  2.6074],
     [12.3125, 15.2344, 10.3672,  ...,  2.3047,  2.3047,  2.3047],
     [12.3125, 14.6250,  7.8828,  ...,  0.5962,  0.5967,  0.5972],
     ...,
     [10.6875, 15.7188,  9.0234,  ...,  1.4434,  1.4424,  1.4414],
     [ 8.5469, 12.7188,  6.2656,  ...,  0.2693,  0.2688,  0.2676],
     [17.5000, 20.3906, 12.9453,  ...,  2.2891,  2.2891,  2.2891]]],

some of the values in the first matrix are different than the values in the first output of the second matrix. For example the first value in the upper right hand corner, 15.9766 and 15.9688 should be the same, but they are slightly different

gugarosa

Microsoft org Sep 27, 2023

•

edited Sep 27, 2023

@bennicholl I found the issue, it was related to the precision.torch_dtype="auto" was forcing the model to use FP16 (maybe with model.half()), whereas the model is expected to be used with AMP, as follows:

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
  model(input_ids = torch.tensor(tokened_words), attention_mask = torch.tensor(attention_mask))

I compared the logits and now they seem to match, and updated the readme with this information. Regarding the source of the issue, I need to double check, but it should be related to the RotaryEmbedding class.

bennicholl

Sep 28, 2023

Awesome, thanks man. Seems to be good now

gugarosa changed discussion status to closed Oct 3, 2023

zokica

Oct 8, 2023

Hello @rchan26 and @bennicholl !

I just updated the model files and added the attention_mask support. Sorry for taking so much time. This should server as a proxy till Phi get's fully implemented in transformers.

However, please note that we still do not have support for attention_mask during training/fine-tuning, only inference. But this shouldn't be a problem in adding in the next upcoming days.

@gugarosa

Hello, thanks for adding the attention_mask, seem like it still does not work for fine-tuning. Is it possible to add it, or maybe disable it somehow, in the HF trainer (trainer = transformers.Trainer) where data is data_collator=transformers.DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)

lysandre

Microsoft org Oct 9, 2023

Maybe cc @muellerzr regarding the trainer question

SinclairWang

Oct 13, 2023

Any plan for supporting training for the phi series?

bennicholl

Oct 13, 2023

•

edited Oct 13, 2023

What do we mean when we say this model is not supported for training? if I call

loss = phi_model(x_batch, attention_mask=mask_batch, labels = labels_batch)[0]
loss.backward()
optimizer.step()
optimizer.zero_grad()

will the gradient no compute properly? Is this due to the mask not zero'ing out the tokens that should be masked during backprop?

SinclairWang

Oct 14, 2023

Yeah, maybe. During my fine-tuning, I encountered a warning:

`attention_mask` is not supported during training. Using it might lead to unexpected results.
{'loss': 1.3228, 'learning_rate': 1.999875577156579e-05, 'epoch': 0.02}
  1%|▍                                                                                             | 300/59745 [06:19<20:47:29,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 301/59745 [06:20<20:48:14,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 302/59745 [06:22<20:48:01,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 303/59745 [06:23<20:47:31,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 304/59745 [06:24<20:48:13,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 305/59745 [06:25<20:49:27,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 306/59745 [06:27<20:48:52,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 307/59745 [06:28<20:48:29,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 308/59745 [06:29<20:49:14,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
  1%|▍                                                                                             | 309/59745 [06:30<20:49:49,  1.26s/it]`attention_mask` is not supported during training. Using it might lead to unexpected results.
{'loss': 1.5263, 'learning_rate': 1.9998671442394832e-05, 'epoch': 0.02}

bennicholl

Oct 14, 2023

@SinclairWang I've encountered that error as well. While fine tuning my loss was continuing to go down, but my outputs for my specific task were clearly not improving. That's why I'm curious as to the reason for the warning and how the attention mask could work for feed forward but not for backprop.

SinclairWang

Oct 14, 2023

•

edited Oct 14, 2023

It is my training loss curve.

"While fine tuning my loss was continuing to go down, but my outputs for my specific task were clearly not improving. "

I also observed the same case. It confused me. I may not continue to fine-tune this model as I can not be sure the processing is ok due to the issue of the attention mask until this issue is solved.

zokica

Oct 14, 2023

I also get bad results when fine-tuning probably because of the attention mask problem.

I also get the same warning.

SinclairWang

Oct 14, 2023

So, any solutions?

zokica

Oct 14, 2023

•

edited Oct 14, 2023

I hope @gugarosa will help. He said they will fix that as well. It is probably not easy to fix and test such thing.

I am not aware of any model of similar size and performance.

SinclairWang

Oct 14, 2023

I am also looking for powerful models with about 1B parameters.