Different tokenization method

#1
by akashe - opened

The tokenizer for the model tokenizes a sequence as:
[CLS] question_tokens [QUESTION] . [SEP] context_tokens [SEP]

whereas the paper implementation is:
'context_tokens [SEP] question_tokens [QUESTION]'

Can someone confirm with which sequence type the model is trained with?

Tel Aviv University org

Hi @akashe , thanks for using our model!

First, the Splinter tokenizer by default tokenizes as follows (note the . is before the [QUESTION] token):

[CLS] question_tokens  . [SEP] context_tokens [SEP]

This is consistent with our repo, but not with Figure 3 from our paper, my bad :)

In any case, it doesn't really matter because this model wasn't trained in this format - this format is only relevant for fine-tuning.
The model splinter-base-qass isn't fine-tuned, you can fine-tune it in both manners and will probably get pretty much the same results.

Best,
Ori

Hi Ori,

So the QASS layer isn't finetuned on any dataset and is just random initialization? Is there someplace where we can find the finetuned QASS head?

Best,
Akash

Tel Aviv University org

It's not fine-tuned, but also not randomly initialized :)
The QASS layer is pretrained along with the model.

oriram changed discussion status to closed

Sign up or log in to comment