Masking the image tokens during training

#68
by jchiu1234 - opened

Should you mask the image tokens in the decoder output during training? I'm trying to wrap my head around this. An argument for why you wouldn't is that maybe you want the model to know the previous inputs are images? Could someone suggest whether the model should be trained with the image tokens masked or not?

Sign up or log in to comment