cointegrated
/

rubert-tiny2-sentence-compression

Token Classification Transformers PyTorch Safetensors bert Inference Endpoints

Model card Files Files and versions Community

cointegrated commited on May 19, 2022

Commit

e8a4782

•

1 Parent(s): 4226f31

Create README.md

Files changed (1) hide show

README.md +58 -0

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+This model can be used for sentence compression (aka extractive sentence summarization).
+It predicts for each word, whether the word can be dropped from the sentence without severely affecting its meaning.
+The resulting sentences are often ungrammatical, but they still can be useful.
+The model is [rubert-tiny2]() fine-tuned on the dataset from the paper
+[Sentence compression for Russian: dataset and baselines](https://www.dialog-21.ru/media/5106/kuvshinovat-050.pdf).
+Example usage:
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+model_name = 'cointegrated/rubert-tiny2-sentence-compression'
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+def compress(text, threshold=0.5, keep_ratio=None):
+    """ Compress a sentence by removing the least important words.
+    Parameters:
+        threshold: cutoff for predicted probabilities of word removal
+        keep_ratio: proportion of words to preserve
+    By default, threshold of 0.5 is used.
+    """
+    with torch.inference_mode():
+        tok = tokenizer(text, return_tensors='pt').to(model.device)
+        proba = torch.softmax(model(**tok).logits, -1).cpu().numpy()[0, :, 1]
+    if keep_ratio is not None:
+        threshold = sorted(proba)[int(len(proba) * keep_ratio)]
+    kept_toks = []
+    keep = False
+    prev_word_id = None
+    for word_id, score, token in zip(tok.word_ids(), proba, tok.input_ids[0]):
+        if word_id is None:
+            keep = True
+        elif word_id != prev_word_id:
+            keep = score < threshold
+        if keep:
+            kept_toks.append(token)
+        prev_word_id = word_id
+    return tokenizer.decode(kept_toks, skip_special_tokens=True)
+text = 'Кроме того, можно взять идею, рожденную из сердца, и выразить ее в рамках одной '\
+    'из этих структур, без потери искренности идеи и смысла песни.'
+print(compress(text))
+print(compress(text, threshold=0.3))
+print(compress(text, threshold=0.1))
+# можно взять идею, рожденную из сердца, и выразить ее в рамках одной из этих структур.
+# можно взять идею, рожденную из сердца выразить ее в рамках одной из этих структур.
+# можно взять идею рожденную выразить структур.
+print(compress(text, keep_ratio=0.5))
+# можно взять идею, рожденную из сердца выразить ее в рамках структур.
+```