Update README.md

b8597d7 almost 4 years ago

4.9 kB

	---
	tags: salesken
	license: apache-2.0
	inference: false

	---

	We have trained a model to evaluate if a paraphrase is a semantic variation to the input query or just a surface level variation. Data augmentation by adding Surface level variations does not add much value to the NLP model training. if the approach to paraphrase generation is "OverGenerate and Rank" , Its important to have a robust model of scoring/ ranking paraphrases. NLG Metrics like bleu ,BleuRT, gleu , Meteor have not proved very effective in scoring paraphrases.

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import pandas as pd
	import numpy as np
	tokenizer = AutoTokenizer.from_pretrained("salesken/paraphrase_diversity_ranker")
	model = AutoModelForSequenceClassification.from_pretrained("salesken/paraphrase_diversity_ranker")

	input_query = ["tough challenges make you stronger."]
	paraphrases = [
	"tough problems make you stronger",
	"tough problems will make you stronger",
	"tough challenges make you stronger",
	"tough challenges will make you a stronger person",
	"tough challenges will make you stronger",
	"tough tasks make you stronger",
	"the tough task makes you stronger",
	"tough stuff makes you stronger",
	"if tough times make you stronger",
	"the tough part makes you stronger",
	"tough issues strengthens you",
	"tough shit makes you stronger",
	"tough tasks force you to be stronger",
	"tough challenge is making you stronger",
	"tough problems make you have more strength"]
	para_pairs=list(pd.MultiIndex.from_product([input_query, paraphrases]))


	features = tokenizer(para_pairs, padding=True, truncation=True, return_tensors="pt")
	model.eval()
	with torch.no_grad():
	scores = model(**features).logits
	label_mapping = ['surface_level_variation', 'semantic_variation']
	labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)]

	sorted_diverse_paraphrases= np.array(para_pairs)[scores[:,1].sort(descending=True).indices].tolist()
	print(sorted_diverse_paraphrases)


	# to identify the type of paraphrase (surface-level variation or semantic variation)
	print("Paraphrase type detection=====", list(zip(para_pairs, labels)))
	```

	============================================================================

	For more robust results, filter out the paraphrases which are not semantically
	similar using a model trained on NLI, STS task and then apply the ranker .

	```python
	from transformers import AutoTokenizer, AutoModelWithLMHead
	from transformers import AutoModelForSequenceClassification
	from sentence_transformers import SentenceTransformer, util
	import torch
	import pandas as pd
	import numpy as np

	tokenizer = AutoTokenizer.from_pretrained("salesken/paraphrase_diversity_ranker")
	model = AutoModelForSequenceClassification.from_pretrained("salesken/paraphrase_diversity_ranker")
	embedder = SentenceTransformer('stsb-bert-large')

	input_query = ["tough challenges make you stronger."]
	paraphrases = [
	"tough problems make you stronger",
	"tough problems will make you stronger",
	"tough challenges make you stronger",
	"tough challenges will make you a stronger person",
	"tough challenges will make you stronger",
	"tough tasks make you stronger",
	"the tough task makes you stronger",
	"tough stuff makes you stronger",
	"tough people make you stronger",
	"if tough times make you stronger",
	"the tough part makes you stronger",
	"tough issues strengthens you",
	"tough shit makes you stronger",
	"tough tasks force you to be stronger",
	"tough challenge is making you stronger",
	"tough problems make you have more strength"]


	corpus_embeddings = embedder.encode(paraphrases, convert_to_tensor=True)
	query_embedding = embedder.encode(input_query, convert_to_tensor=True)
	cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
	para_set=np.array(paraphrases)
	a=cos_scores.sort(descending=True)
	para= para_set[a.indices[a.values>=0.7].cpu()].tolist()


	para_pairs=list(pd.MultiIndex.from_product([input_query, para]))

	import torch
	features = tokenizer(para_pairs, padding=True, truncation=True, return_tensors="pt")
	model.eval()
	with torch.no_grad():
	scores = model(**features).logits
	label_mapping = ['surface_level_variation', 'semantic_variation']
	labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)]

	sorted_diverse_paraphrases= np.array(para)[scores[:,1].sort(descending=True).indices].tolist()
	print("Paraphrases sorted by diversity:=======",sorted_diverse_paraphrases)

	# to identify the type of paraphrase (surface-level variation or semantic variation)
	print("Paraphrase type detection=====", list(zip(para_pairs, labels)))

	```