Tokenization of more than 2 sequences

#21
by jaandoui - opened

In the finetuning, we have the choice of having the data in 2 formats: (sequence, label) or (sequence1, sequence2, label).
I investigated, and sequence1 and sequence2 are only separated by the special token "2", and of course there is only one CLS token (token "1").

I would like to work with more sequences, (seq1,seq2,seq3,...,seqN, label). N is predefined.

So I thought I could change the code make it work:

  • One could loop over the tokenizer and then concat the output at the end. The problem would be the padding, max_length and other parameters that one needs to handle.
    ? Why not just concat the sequences before ? Answer: I need the special tokens to define where each sequence ends.
    The way I find best is: If an unvalid letter is in the sequence, e.g. X, then it will be mapped to "0". We could use that: Add an invalid letter at the end of each token, then change "0" to "2".
    I will make sure my sequence doesn't have any unvalid letters expect the intended ones.
    Is this a good way of doing so ? Any pitfalls? It is very easy, but I wish I had a formal way of doing so, maybe adding at the end of each sequence.

Thank you!

A possible pitfall is the attention_mask: should I change anything in it since I'm introducing special tokens. Especially that when printing out the attention_mask, it has 1,1,1... everywhere except the last one. I thought it might be related to the special token "2", meaning that when we have "2" as token in the BPE sequence, we have "0" in the attention_mask. Seems like the "0" is not for that! The "0" is for the padding, because we don't want to attend to it. So nothing needs to be changed, the BPE token "2" needs to be attended to.
Please correct me if I'm wrong.

Sign up or log in to comment