sentence similarity between spacy and sentence-transformer

#42
by YouXiang - opened

I am working on project when to check the similarity between two sentences. I have tried a lot of model to do comparison to find out that which model is the best especially for spacy large language model, en_core_web_lg and all-miniLM-L6-v2. My question is:

  1. is there any comparison to show that which model is performed well? (all-miniLM-L6-v2 and en_core_web_lg).
    2.If I am using all-miniLM-L6-v2, do I still need to do preprocessing? (all the NLP preprocessing task: tokenize, extract key word.
  2. How can I look in-dept of how both model work?
Sentence Transformers org

Hello!

These models are quite different. all-MiniLM-L6-v2 is a Sentence Transformer model trained with the purpose of producing embeddings that can be used to compute sentence similarities. On the other hand, en_core_web_lg is a large spaCy model with a lot of different components. It does technically also seem to have a "token-to-vector" component, but I would be surprised if it was better than all-MiniLM-L6-v2.

In short, all-MiniLM-L6-v2 is created with your task in mind, and en_core_web_lg is not.

  1. Not really, but primarily because en_core_web_lg is not on any embedding benchmarks.
  2. No. See this documentation for information on the usage, you can just provide full sentences, no tokenization necessary.
  3. You can use this guide to get a bit of information on how this Sentence Transformer model works. As for en_core_web_lg, you can try looking at the spaCy models documentation perhaps.
  • Tom Aarsen

Hi!

Thanks for the replying, I have a better understanding on the difference between these two models now!

Sign up or log in to comment