--- tags: salesken license: apache-2.0 inference: false --- We have trained a model to evaluate if a paraphrase is a semantic variation to the input query or just a surface level variation. Data augmentation by adding Surface level variations does not add much value to the NLP model training. if the approach to paraphrase generation is "OverGenerate and Rank" , Its important to have a robust model of scoring/ ranking paraphrases. NLG Metrics like bleu ,BleuRT, gleu , Meteor have not proved very effective in scoring paraphrases. ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import pandas as pd import numpy as np tokenizer = AutoTokenizer.from_pretrained("salesken/paraphrase_diversity_ranker") model = AutoModelForSequenceClassification.from_pretrained("salesken/paraphrase_diversity_ranker") input_query = ["tough challenges make you stronger."] paraphrases = [ "tough problems make you stronger", "tough problems will make you stronger", "tough challenges make you stronger", "tough challenges will make you a stronger person", "tough challenges will make you stronger", "tough tasks make you stronger", "the tough task makes you stronger", "tough stuff makes you stronger", "if tough times make you stronger", "the tough part makes you stronger", "tough issues strengthens you", "tough shit makes you stronger", "tough tasks force you to be stronger", "tough challenge is making you stronger", "tough problems make you have more strength"] para_pairs=list(pd.MultiIndex.from_product([input_query, paraphrases])) features = tokenizer(para_pairs, padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits label_mapping = ['surface_level_variation', 'semantic_variation'] labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)] sorted_diverse_paraphrases= np.array(para_pairs)[scores[:,1].sort(descending=True).indices].tolist() print(sorted_diverse_paraphrases) # to identify the type of paraphrase (surface-level variation or semantic variation) print("Paraphrase type detection=====", list(zip(para_pairs, labels))) ``` ============================================================================ For more robust results, filter out the paraphrases which are not semantically similar using a model trained on NLI, STS task and then apply the ranker . ```python from transformers import AutoTokenizer, AutoModelWithLMHead from transformers import AutoModelForSequenceClassification from sentence_transformers import SentenceTransformer, util import torch import pandas as pd import numpy as np tokenizer = AutoTokenizer.from_pretrained("salesken/paraphrase_diversity_ranker") model = AutoModelForSequenceClassification.from_pretrained("salesken/paraphrase_diversity_ranker") embedder = SentenceTransformer('stsb-bert-large') input_query = ["tough challenges make you stronger."] paraphrases = [ "tough problems make you stronger", "tough problems will make you stronger", "tough challenges make you stronger", "tough challenges will make you a stronger person", "tough challenges will make you stronger", "tough tasks make you stronger", "the tough task makes you stronger", "tough stuff makes you stronger", "tough people make you stronger", "if tough times make you stronger", "the tough part makes you stronger", "tough issues strengthens you", "tough shit makes you stronger", "tough tasks force you to be stronger", "tough challenge is making you stronger", "tough problems make you have more strength"] corpus_embeddings = embedder.encode(paraphrases, convert_to_tensor=True) query_embedding = embedder.encode(input_query, convert_to_tensor=True) cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0] para_set=np.array(paraphrases) a=cos_scores.sort(descending=True) para= para_set[a.indices[a.values>=0.7].cpu()].tolist() para_pairs=list(pd.MultiIndex.from_product([input_query, para])) import torch features = tokenizer(para_pairs, padding=True, truncation=True, return_tensors="pt") model.eval() with torch.no_grad(): scores = model(**features).logits label_mapping = ['surface_level_variation', 'semantic_variation'] labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)] sorted_diverse_paraphrases= np.array(para)[scores[:,1].sort(descending=True).indices].tolist() print("Paraphrases sorted by diversity:=======",sorted_diverse_paraphrases) # to identify the type of paraphrase (surface-level variation or semantic variation) print("Paraphrase type detection=====", list(zip(para_pairs, labels))) ```