Is using SciBERT as a basis still a good idea?

#2
by cormak - opened

Hello AllenAI, thank you for the great release,
I am highly interested in Specter, as to my knowledge it is one of the best models to do academic document embedding representation. This is probably due to the reference-aware triplet loss, which I think brings desirable properties to the generated embeddings.

However, I wonder if using SciBERT as a basis is still a good idea, the model is getting a bit old, and I think the vocabulary would also benefit from an update (I think it did not change since the original SciBERT paper from 2019) so, for instance, there is no token about coronavirus. Also, limiting the number of numeric tokens is maybe a good idea. For instance, I noted some tokens, such as "30)", "[48]", "1981", "##/10.100", ... those actually take a sizeable part of the vocabulary and are maybe not very useful in explaining the final model accuracy.

Thanks again for the great contribution to the academic NLP community, and I hope my suggestions are useful.

Hello,
Thank you very much for listening to my suggestion, I will try the refreshed version ASAP.

Best regards,

@cormak Any suggestions for embedding models to try for scientific papers? I am building a RAG for scientific papers.

Sign up or log in to comment