sentence similarity between spacy and sentence-transformer
I am working on project when to check the similarity between two sentences. I have tried a lot of model to do comparison to find out that which model is the best especially for spacy large language model, en_core_web_lg and all-miniLM-L6-v2. My question is:
- is there any comparison to show that which model is performed well? (all-miniLM-L6-v2 and en_core_web_lg).
2.If I am using all-miniLM-L6-v2, do I still need to do preprocessing? (all the NLP preprocessing task: tokenize, extract key word. - How can I look in-dept of how both model work?
Hello!
These models are quite different. all-MiniLM-L6-v2
is a Sentence Transformer model trained with the purpose of producing embeddings that can be used to compute sentence similarities. On the other hand, en_core_web_lg
is a large spaCy model with a lot of different components. It does technically also seem to have a "token-to-vector" component, but I would be surprised if it was better than all-MiniLM-L6-v2
.
In short, all-MiniLM-L6-v2
is created with your task in mind, and en_core_web_lg
is not.
- Not really, but primarily because
en_core_web_lg
is not on any embedding benchmarks. - No. See this documentation for information on the usage, you can just provide full sentences, no tokenization necessary.
- You can use this guide to get a bit of information on how this Sentence Transformer model works. As for
en_core_web_lg
, you can try looking at the spaCy models documentation perhaps.
- Tom Aarsen
Hi!
Thanks for the replying, I have a better understanding on the difference between these two models now!