It is possible to use this model in Vector Similarity Based Approach (cosine) to detect paraphrases?

#2
by fikavec - opened

On "merionum/ru_paraphraser" (test samples) paraphrase dataset I got the following results:

  • This model accuracy: 0.8497920997920998
  • sentence-transformers/LaBSE accuracy: 0.7785862785862786
  • sentence-transformers/paraphrase-multilingual-mpnet-base-v2 accuracy: 0.7791060291060291
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 accuracy: 0.7567567567567568
  • sentence-transformers/distiluse-base-multilingual-cased-v1 accuracy: 0.752079002079002

Maybe some other models should be tested?

It is possible to use this model in Vector Similarity Based Approach (cosine) to detect paraphrases

No, this model is not intended to produce embeddings pluggable into cosine similarity computation. Instead, you should use it as a cross-encoder (the example code snipped is in the model card).

Maybe some other models should be tested?

If you want a pipeline "embeddings+cosine similarities", I can recommend looking at models from this benchmark: https://github.com/avidale/encodechka

If you want a cross-encoder, maybe you can try this model (trained by me): https://huggingface.co/s-nlp/ruRoberta-large-paraphrase-v1

Thanks for the reply, just for scientific understanding (I will be happy to even "yes"/"no" answers ):

  • Do cross-encoders exceed the maximum achievable results of the "embeddings+cosine similarities" approaches in terms of performance in the task of detecting paraphrasing?
  • Are there any methods of scaling (precomputing) cross-encoders, because O(n^2) doesn't look like the best option for many tasks?
  • In the encodechka table, is the paraphrasing detection task quality equivalent to the quality of the models in the STS task?

It's strange, but 's-nlp/ruRoberta-large-paraphrase-v1' showed the lowest accuracy of 0.59 (with threshold tune 0.66) among the compared models:

from datasets import load_dataset
HF_DATASET = "merionum/ru_paraphraser"
records = list(load_dataset(HF_DATASET, split="test"))
for item in records:
    print(item['text_1'],item['text_2'],item['class'])
    break
    
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = 's-nlp/ruRoberta-large-paraphrase-v1'
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

def compare_texts(text1, text2):
    batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
    with torch.inference_mode():
        proba = torch.softmax(model(**batch).logits, -1).cpu().numpy()
    return proba[0] # p(non-paraphrase), p(paraphrase)

print(compare_texts('Сегодня на улице хорошая погода', 'Сегодня на улице отвратительная погода'))
# [0.9936753 0.0063247]
print(compare_texts('Сегодня на улице хорошая погода', 'Отличная погодка сегодня выдалась'))
# [0.00542064 0.99457943]

from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from tqdm.auto import tqdm
y_true = []
y_pred = []
y_probas = []
for item in tqdm(records):
    #print(item['text_1'],item['text_2'],item['class'])
    y_true.append(0) if ( int(item['class']) < 0) else y_true.append(1)
    
    paraphrase_proba = compare_texts(item['text_1'],item['text_2'])[1]
    if ( paraphrase_proba >= 0.5):
        y_pred.append(1)
    else:
        y_pred.append(0)
    y_probas.append(paraphrase_proba)
print("Accuracy:",accuracy_score(y_true, y_pred))
print("ROC AUC:",roc_auc_score(y_true, y_probas))
print(classification_report(y_true, y_pred))

image.png

Replacing model_name = 'cointegrated/rubert-base-cased-dp-paraphrase-detection' in the code above shows an accuracy of 0.85:

image.png

fikavec changed discussion status to closed

Do cross-encoders exceed the maximum achievable results of the "embeddings+cosine similarities" approaches in terms of performance in the task of detecting paraphrasing?

I don't know; probably "the maximum achievable results" depend on the difficulty of the paraphrases that you want to detect.

Are there any methods of scaling (precomputing) cross-encoders, because O(n^2) doesn't look like the best option for many tasks?

No, AFAIC. If you need to compare many-to-many texts, you'll have to use bi-encoders (or poly-encoders, which are a hybrid)

In the encodechka table, is the paraphrasing detection task quality equivalent to the quality of the models in the STS task?

No; performance in these two tasks is correlated, but not identical.

I appreciate your reply. There are a few thoughts that I need to report by you:

tmp = tmp.rename({'FactRuTask': 'NE1',
 'InappropriatenessTask': 'IA',
 'IntentsTask': 'IC',
 'IntentsXTask': 'ICX',
 'ParaphraserTask': 'PI',
 'RudrTask': 'NE2',
 'STSBTask': 'STS',
 'SentimentTask': 'SA',
 'ToxicityTask': 'TI',
 'XnliTask': 'NLI',
 'cpu_speed': 'CPU',
 'disk_size': 'size',
 'gpu_speed': 'GPU',
 'mean_s': 'Mean S',
 'mean_sw': 'Mean S+W',}, axis=1)
  • To run encodechka colab v2023 in jupyter under windows, I had to small fix replace in tasks.py:
# all encountered of open function
with open(find_file(filename), 'r') as f:
# replaced to open with encoding='utf-8'
with open(find_file(filename), 'r', encoding='utf-8') as f:
# otherwise , there was UnicodeDecodeError
  • Maybe add to encodechka README.md (into similar projects paragraph) link to MTEB https://huggingface.co/spaces/mteb/leaderboard as an additional source of comparison of test models on english and other languages
  • Maybe add to encodechka README.md to leaderbord models descriptions after dim: Multilang field: is multilingual (and maybe how many langs supported)? Multilang alignment field: is embeddings cross-language alignment (language-agnostic embeddings) like LASER, LaBSE?
  • Some suggestions for the 'intfloat/multilingual-e5...' model in the encodechka:
    • the 'intfloat/multilingual-e5...' model card says that you always need to add "Each input text should start with "query: " or "passage: ", even for non-English texts." to model input, but I didn't find this place (and words "query: " or "passage: ") in encodechka source code. If I really haven't overlooked it in source code, then on my e5 tests, it turned out that this moment really matters, up to +10-15% quality.
    • Small and Base multilingual-e5 models are also interesting (maybe add it to the benchmark), the quality does not drop much on many tasks (and still surpasses Labs), and the speed, especially on the CPU, varies significantly (and on the CPU it can be twice as fast as LaBSE):
      image.png
  • P.S. 'cointegrated/rubert-base-cased-dp-paraphrase-detection' still the best accuracy: 0.85 with time: 37.2 s on my paraphrase test: large-e5 accuracy: 0.82 (with time: 6+ minutes), but e5-small accuracy: 0.82 with time: 48.7 s.

Sign up or log in to comment