model_max_length and max_seq_length

#2
by thusken - opened

Hi!

First of all great job on this sBERT model!

Secondly, it looks like something weird is going on with the model_max_length and max_seq_length attributes when instantiating the model via AutoModel / AutoTokenizer and SentenceTransformer, respectively.

The sentence-transformers implementation gives a max length of 75:

model_st = SentenceTransformer('NbAiLab/nb-sbert-base')
model_st.max_seq_length

# 75

While loading the tokenizer through HF's AutoTokenizer gives a very different max length:

model = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base')
tokenizer.model_max_length

# 1000000000000000019884624838656

The second one is clearly incorrect, but is 75 the correct max sequence length for this model? If I remember correctly, BERT models have a sequence length of 512, or has that changed when finetuning this model?

This also means for sequences longer than 75, both implementations will give different embeddings, which may be worth mentioning.

Nasjonalbiblioteket AI Lab org
edited Dec 19, 2023

Hi.

The sequence length of 75 comes from the training script we use.
The other one seems to come from having no max length. The nb-bert-base model has the same value.
The correct one would be 75, but I wouldn't be surprised if you could change max length and input sequences up to 512 with good success.

Thanks for the quick reply and good to know! I'll experiment with input sequences of 512 to see how they compare with the 75-length sequences.

thusken changed discussion status to closed

Sign up or log in to comment