cointegrated/rubert-base-cased-dp-paraphrase-detection · It is possible to use this model in Vector Similarity Based Approach (cosine) to detect paraphrases?

fikavec

Aug 8, 2023

On "merionum/ru_paraphraser" (test samples) paraphrase dataset I got the following results:

This model accuracy: 0.8497920997920998
sentence-transformers/LaBSE accuracy: 0.7785862785862786
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 accuracy: 0.7791060291060291
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 accuracy: 0.7567567567567568
sentence-transformers/distiluse-base-multilingual-cased-v1 accuracy: 0.752079002079002

Maybe some other models should be tested?

cointegrated

Owner Aug 9, 2023

It is possible to use this model in Vector Similarity Based Approach (cosine) to detect paraphrases

No, this model is not intended to produce embeddings pluggable into cosine similarity computation. Instead, you should use it as a cross-encoder (the example code snipped is in the model card).

cointegrated

Owner Aug 9, 2023

•

edited Aug 9, 2023

Maybe some other models should be tested?

If you want a pipeline "embeddings+cosine similarities", I can recommend looking at models from this benchmark: https://github.com/avidale/encodechka

If you want a cross-encoder, maybe you can try this model (trained by me): https://huggingface.co/s-nlp/ruRoberta-large-paraphrase-v1

fikavec

Aug 9, 2023

Thanks for the reply, just for scientific understanding (I will be happy to even "yes"/"no" answers ):

Do cross-encoders exceed the maximum achievable results of the "embeddings+cosine similarities" approaches in terms of performance in the task of detecting paraphrasing?
Are there any methods of scaling (precomputing) cross-encoders, because O(n^2) doesn't look like the best option for many tasks?
In the encodechka table, is the paraphrasing detection task quality equivalent to the quality of the models in the STS task?

It's strange, but 's-nlp/ruRoberta-large-paraphrase-v1' showed the lowest accuracy of 0.59 (with threshold tune 0.66) among the compared models:

from datasets import load_dataset
HF_DATASET = "merionum/ru_paraphraser"
records = list(load_dataset(HF_DATASET, split="test"))
for item in records:
    print(item['text_1'],item['text_2'],item['class'])
    break
    
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = 's-nlp/ruRoberta-large-paraphrase-v1'
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

def compare_texts(text1, text2):
    batch = tokenizer(text1, text2, return_tensors='pt').to(model.device)
    with torch.inference_mode():
        proba = torch.softmax(model(**batch).logits, -1).cpu().numpy()
    return proba[0] # p(non-paraphrase), p(paraphrase)

print(compare_texts('Сегодня на улице хорошая погода', 'Сегодня на улице отвратительная погода'))
# [0.9936753 0.0063247]
print(compare_texts('Сегодня на улице хорошая погода', 'Отличная погодка сегодня выдалась'))
# [0.00542064 0.99457943]

from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from tqdm.auto import tqdm
y_true = []
y_pred = []
y_probas = []
for item in tqdm(records):
    #print(item['text_1'],item['text_2'],item['class'])
    y_true.append(0) if ( int(item['class']) < 0) else y_true.append(1)
    
    paraphrase_proba = compare_texts(item['text_1'],item['text_2'])[1]
    if ( paraphrase_proba >= 0.5):
        y_pred.append(1)
    else:
        y_pred.append(0)
    y_probas.append(paraphrase_proba)
print("Accuracy:",accuracy_score(y_true, y_pred))
print("ROC AUC:",roc_auc_score(y_true, y_probas))
print(classification_report(y_true, y_pred))

Replacing model_name = 'cointegrated/rubert-base-cased-dp-paraphrase-detection' in the code above shows an accuracy of 0.85:

fikavec changed discussion status to closed Aug 13, 2023

cointegrated

Owner Aug 14, 2023

Do cross-encoders exceed the maximum achievable results of the "embeddings+cosine similarities" approaches in terms of performance in the task of detecting paraphrasing?

I don't know; probably "the maximum achievable results" depend on the difficulty of the paraphrases that you want to detect.

cointegrated

Owner Aug 14, 2023

Are there any methods of scaling (precomputing) cross-encoders, because O(n^2) doesn't look like the best option for many tasks?

No, AFAIC. If you need to compare many-to-many texts, you'll have to use bi-encoders (or poly-encoders, which are a hybrid)

cointegrated

Owner Aug 14, 2023

•

edited Aug 14, 2023

In the encodechka table, is the paraphrasing detection task quality equivalent to the quality of the models in the STS task?

No; performance in these two tasks is correlated, but not identical.

fikavec

Aug 14, 2023

•

edited Aug 14, 2023

I appreciate your reply. There are a few thoughts that I need to report by you:

maybe add a transcript of abbreviations to table header or before encodechka task tables in README.md of https://github.com/avidale/encodechka from colab v2023:

tmp = tmp.rename({'FactRuTask': 'NE1',
 'InappropriatenessTask': 'IA',
 'IntentsTask': 'IC',
 'IntentsXTask': 'ICX',
 'ParaphraserTask': 'PI',
 'RudrTask': 'NE2',
 'STSBTask': 'STS',
 'SentimentTask': 'SA',
 'ToxicityTask': 'TI',
 'XnliTask': 'NLI',
 'cpu_speed': 'CPU',
 'disk_size': 'size',
 'gpu_speed': 'GPU',
 'mean_s': 'Mean S',
 'mean_sw': 'Mean S+W',}, axis=1)

To run encodechka colab v2023 in jupyter under windows, I had to small fix replace in tasks.py:

# all encountered of open function
with open(find_file(filename), 'r') as f:
# replaced to open with encoding='utf-8'
with open(find_file(filename), 'r', encoding='utf-8') as f:
# otherwise , there was UnicodeDecodeError

Maybe add to encodechka README.md (into similar projects paragraph) link to MTEB https://huggingface.co/spaces/mteb/leaderboard as an additional source of comparison of test models on english and other languages
Maybe add to encodechka README.md to leaderbord models descriptions after dim: Multilang field: is multilingual (and maybe how many langs supported)? Multilang alignment field: is embeddings cross-language alignment (language-agnostic embeddings) like LASER, LaBSE?
Some suggestions for the 'intfloat/multilingual-e5...' model in the encodechka:
- the 'intfloat/multilingual-e5...' model card says that you always need to add "Each input text should start with "query: " or "passage: ", even for non-English texts." to model input, but I didn't find this place (and words "query: " or "passage: ") in encodechka source code. If I really haven't overlooked it in source code, then on my e5 tests, it turned out that this moment really matters, up to +10-15% quality.
- Small and Base multilingual-e5 models are also interesting (maybe add it to the benchmark), the quality does not drop much on many tasks (and still surpasses Labs), and the speed, especially on the CPU, varies significantly (and on the CPU it can be twice as fast as LaBSE):
P.S. 'cointegrated/rubert-base-cased-dp-paraphrase-detection' still the best accuracy: 0.85 with time: 37.2 s on my paraphrase test: large-e5 accuracy: 0.82 (with time: 6+ minutes), but e5-small accuracy: 0.82 with time: 48.7 s.