EOS and PAD tokens

#9
by dvruette - opened

The special_tokens_map.json specifies the eos and pad tokens as # and " respectively, which seems like a weird choice.

{
  "eos_token": "#",
  "pad_token": "\"",
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

Is this correct? Has the model been trained on these token maps? Has the model seen the <|endoftext|> token during training?

I'm also seeing that, I don't know how that would affect the future, also I don't see a template

IME after finetuning it leads to model preferring single quotes instead of double quotes as it really confuses DataCollatorForLanguageModeling.

collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
features = tokenizer('I said "Hi"', return_tensors="pt")
collator([features])

produces
"{'input_ids': tensor([[[ 40, 531, 220, 1, 17250, 1]]]), 'attention_mask': tensor([[[1, 1, 1, 1, 1, 1]]]), 'labels': tensor([[[ 40, 531, 220, -100, 17250, -100]]])}"

the model never learns to output a single double quote.

also I don't see a template

It's not a chat model.

The special_tokens_map.json specifies the eos and pad tokens as # and " respectively, which seems like a weird choice.

{
  "eos_token": "#",
  "pad_token": "\"",
  "unk_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

Is this correct? Has the model been trained on these token maps? Has the model seen the <|endoftext|> token during training?

I have the same question. It's very strange.

Sign up or log in to comment