pritamdeka/S-PubMedBert-MS-MARCO Maximum input length

#2
by johncrtz - opened

Hello, i would like to know what how many words i can input into the pritamdeka/S-PubMedBert-MS-MARCO embedding model before it gets truncated. The documentation says something about max-sequence-length of 384 but does that refer to tokens or words? Generally i know that BERT allows for as much as 512 tokens, but since sentence level embedding doesnt require tokenization and lets you just input the sentence as it is, how can i make sure that i dont exceed such a token limit?
Please tell me if i mixed something up, im quite new to this topic and apreciate any feedback!

Hi @johncrtz The model will truncate longer sentences to 350 tokens. Tokens in BERT is not exactly words since the tokenizer uses special symbols for tokenization. "A common value for BERT & Co. are 512 word pieces, which correspond to about 300-400 words (for English). Longer texts than this are truncated to the first x word pieces." This is from SBERT documentation. So as an approximation, if your sentence has around more than 128 words, it's better to shorten the sentences. You can do one thing, split the sentences, calculate the embeddings for the individual split sentences and then add the embeddings. Although I am not sure without experimenting how the resulting embedding quality would be. But you can experiment both approaches, without splitting and with splitting. Do lemme know how your experiments work out. Cheers!!!!

Hi @pritamdeka thank you very much the clarification and the suggestions, i really appreciate them.
What i tried so far was first of all analyzing how big my paragraphs are that i want to embed by running them through the same tokenizer model that pritamdeka/S-PubMedBert-MS-MARCO is using internally. It turns out that most of them are below 350 tokens, however there are still some above that which means i ultimately have to split them up in some way. The approach that you suggested seems reasonable to me, split every paragraph after n words, create embeddings for each sub-paragraph and then combine them in some way e.g. pooling.
I was just wondering if there is any way to exactly control how many tokens are fed into the sentence transformer without having to rely on an approximation. 128 words might work but i think there will be many cases where a paragraph would be divided allthough it would actually fit in which would create unnecessary distortion.

distribution of size per token.png

I though about solving this by running each word of a paragraph through the pritamdeka/S-PubMedBert-MS-MARCO tokenizer(via AutoTokenizer.from_pretrained('pritamdeka/S-PubMedBert-MS-MARCO')) and at the same time counting the resulting tokens. Whenever we reached 350 tokens we cut the sentence where we took the last word from and this would result in one chunk. We continue for the rest of the paragraph, embed each chunk and then later on combine them by a pooling strategy.
This way we make optimal use of the token limit, have control over how many tokens we are feeding in and avoid unnecessary pooling.

What do you think of this approach? Is it unnecessarily complicated or are you aware of better approaches on how to tackle this?

Best regards and a happy new year,
John

Hi @johncrtz .. Happy new year to you too. I guess your approach is suitable, however, instead of counting each word and tokenizing them, I would suggest you can use max_seq_length. What you can do is if the paragraph maximum seq length exceeds 350 tokens, you can split it up at the last word. If it doesn't exceed that value you don't need to split anything. So basically it will involve an if-else loop for the code. That would be less complicated. You can try both approaches, the one you suggested and the one I suggested and use the better one. Do lemme know how it works out.. Cheers...

@pritamdeka
Would mean pooling of such chunks be appropriate?

If so, imagine that you are doing this and you end up with chunks of size 200, 200, 50. Should you weight the chunks by length when pooling?

@JHolmes89 you could try mean pooling as well as max pool. Without knowing the experiments it's a bit difficult to say which method will be effective but you can experiment with the methods and do lemme know how it works out... Cheers!!!

pritamdeka changed discussion status to closed

Sign up or log in to comment