cointegrated commited on
Commit
e8a4782
1 Parent(s): 4226f31

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This model can be used for sentence compression (aka extractive sentence summarization).
2
+
3
+ It predicts for each word, whether the word can be dropped from the sentence without severely affecting its meaning.
4
+
5
+ The resulting sentences are often ungrammatical, but they still can be useful.
6
+
7
+ The model is [rubert-tiny2]() fine-tuned on the dataset from the paper
8
+ [Sentence compression for Russian: dataset and baselines](https://www.dialog-21.ru/media/5106/kuvshinovat-050.pdf).
9
+
10
+ Example usage:
11
+
12
+ ```python
13
+ import torch
14
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
15
+ model_name = 'cointegrated/rubert-tiny2-sentence-compression'
16
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
17
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
18
+
19
+
20
+ def compress(text, threshold=0.5, keep_ratio=None):
21
+ """ Compress a sentence by removing the least important words.
22
+ Parameters:
23
+ threshold: cutoff for predicted probabilities of word removal
24
+ keep_ratio: proportion of words to preserve
25
+ By default, threshold of 0.5 is used.
26
+ """
27
+ with torch.inference_mode():
28
+ tok = tokenizer(text, return_tensors='pt').to(model.device)
29
+ proba = torch.softmax(model(**tok).logits, -1).cpu().numpy()[0, :, 1]
30
+ if keep_ratio is not None:
31
+ threshold = sorted(proba)[int(len(proba) * keep_ratio)]
32
+ kept_toks = []
33
+ keep = False
34
+ prev_word_id = None
35
+ for word_id, score, token in zip(tok.word_ids(), proba, tok.input_ids[0]):
36
+ if word_id is None:
37
+ keep = True
38
+ elif word_id != prev_word_id:
39
+ keep = score < threshold
40
+ if keep:
41
+ kept_toks.append(token)
42
+ prev_word_id = word_id
43
+ return tokenizer.decode(kept_toks, skip_special_tokens=True)
44
+
45
+
46
+ text = 'Кроме того, можно взять идею, рожденную из сердца, и выразить ее в рамках одной '\
47
+ 'из этих структур, без потери искренности идеи и смысла песни.'
48
+
49
+ print(compress(text))
50
+ print(compress(text, threshold=0.3))
51
+ print(compress(text, threshold=0.1))
52
+ # можно взять идею, рожденную из сердца, и выразить ее в рамках одной из этих структур.
53
+ # можно взять идею, рожденную из сердца выразить ее в рамках одной из этих структур.
54
+ # можно взять идею рожденную выразить структур.
55
+
56
+ print(compress(text, keep_ratio=0.5))
57
+ # можно взять идею, рожденную из сердца выразить ее в рамках структур.
58
+ ```