tau/splinter-base-qass · Different tokenization method

akashe

Nov 7, 2022

The tokenizer for the model tokenizes a sequence as:
[CLS] question_tokens [QUESTION] . [SEP] context_tokens [SEP]

whereas the paper implementation is:
'context_tokens [SEP] question_tokens [QUESTION]'

Can someone confirm with which sequence type the model is trained with?

oriram

Tel Aviv University org Nov 8, 2022

Hi @akashe , thanks for using our model!

First, the Splinter tokenizer by default tokenizes as follows (note the . is before the [QUESTION] token):

[CLS] question_tokens  . [SEP] context_tokens [SEP]

This is consistent with our repo, but not with Figure 3 from our paper, my bad :)

In any case, it doesn't really matter because this model wasn't trained in this format - this format is only relevant for fine-tuning.
The model splinter-base-qass isn't fine-tuned, you can fine-tune it in both manners and will probably get pretty much the same results.

Best,
Ori

akashe

Nov 8, 2022

Hi Ori,

So the QASS layer isn't finetuned on any dataset and is just random initialization? Is there someplace where we can find the finetuned QASS head?

Best,
Akash

oriram

Tel Aviv University org Nov 8, 2022

It's not fine-tuned, but also not randomly initialized :)
The QASS layer is pretrained along with the model.

oriram changed discussion status to closed Nov 17, 2022