cointegrated
/

rubert-tiny2-sentence-compression

Token Classification Transformers PyTorch Safetensors bert Inference Endpoints

Model card Files Files and versions Community

rubert-tiny2-sentence-compression / README.md

cointegrated's picture

Update README.md

6976c2d about 1 year ago

|

raw history blame contribute delete

No virus

2.81 kB

	This model can be used for sentence compression (aka extractive sentence summarization).

	It predicts for each word, whether the word can be dropped from the sentence without severely affecting its meaning.

	The resulting sentences are often ungrammatical, but they still can be useful.

	The model is [rubert-tiny2]() fine-tuned on the dataset from the paper
	[Sentence compression for Russian: dataset and baselines](https://www.dialog-21.ru/media/5106/kuvshinovat-050.pdf)
	(the data can be found [here](https://drive.google.com/drive/folders/1WWqq187pN4aHHbRUwlhaKW4JP1FZ_9zh)).

	Example usage:

	```python
	import torch
	from transformers import AutoModelForTokenClassification, AutoTokenizer
	model_name = 'cointegrated/rubert-tiny2-sentence-compression'
	model = AutoModelForTokenClassification.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)


	def compress(text, threshold=0.5, keep_ratio=None):
	""" Compress a sentence by removing the least important words.
	Parameters:
	threshold: cutoff for predicted probabilities of word removal
	keep_ratio: proportion of words to preserve
	By default, threshold of 0.5 is used.
	"""
	with torch.inference_mode():
	tok = tokenizer(text, return_tensors='pt').to(model.device)
	proba = torch.softmax(model(**tok).logits, -1).cpu().numpy()[0, :, 1]
	if keep_ratio is not None:
	threshold = sorted(proba)[int(len(proba) * keep_ratio)]
	kept_toks = []
	keep = False
	prev_word_id = None
	for word_id, score, token in zip(tok.word_ids(), proba, tok.input_ids[0]):
	if word_id is None:
	keep = True
	elif word_id != prev_word_id:
	keep = score < threshold
	if keep:
	kept_toks.append(token)
	prev_word_id = word_id
	return tokenizer.decode(kept_toks, skip_special_tokens=True)


	text = 'Кроме того, можно взять идею, рожденную из сердца, и выразить ее в рамках одной '\
	'из этих структур, без потери искренности идеи и смысла песни.'

	print(compress(text))
	print(compress(text, threshold=0.3))
	print(compress(text, threshold=0.1))
	# можно взять идею, рожденную из сердца, и выразить ее в рамках одной из этих структур.
	# можно взять идею, рожденную из сердца выразить ее в рамках одной из этих структур.
	# можно взять идею рожденную выразить структур.

	print(compress(text, keep_ratio=0.5))
	# можно взять идею, рожденную из сердца выразить ее в рамках структур.
	```