unikei commited on
Commit
4d2aa5a
1 Parent(s): eaf8fb8

Task + how to run

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -1,3 +1,56 @@
1
  ---
2
  license: bigscience-openrail-m
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bigscience-openrail-m
3
+ tags:
4
+ - split and rephrase
5
  ---
6
+
7
+
8
+ # T5 model for splitting complex sentences to simple sentences in English
9
+ Split-and-rephrase is the task of splitting a complex input sentence into shorter sentences while preserving meaning. (Narayan et al., 2017)
10
+
11
+ E.g.:
12
+ ```
13
+ Cystic Fibrosis (CF) is an autosomal recessive disorder that affects multiple organs,
14
+ which is common in the Caucasian population, symptomatically affecting 1 in 2500 newborns in the UK,
15
+ and more than 80,000 individuals globally.
16
+ ```
17
+ could be split into
18
+ ```
19
+ Cystic Fibrosis is an autosomal recessive disorder that affects multiple organs.
20
+ ```
21
+ ```
22
+ Cystic Fibrosis is common in the Caucasian population.
23
+ ```
24
+ ```
25
+ Cystic Fibrosis affects 1 in 2500 newborns in the UK.
26
+ ```
27
+ ```
28
+ Cystic Fibrosis affects more than 80,000 individuals globally.
29
+ ```
30
+
31
+ ## How to use it in your code:
32
+ ```python
33
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
34
+ checkpoint="unikei/t5-base-split-and-rephrase"
35
+ tokenizer = T5Tokenizer.from_pretrained(checkpoint)
36
+ model = T5ForConditionalGeneration.from_pretrained(checkpoint)
37
+
38
+ complex_sentence = "Cystic Fibrosis (CF) is an autosomal recessive disorder that \
39
+ affects multiple organs, which is common in the Caucasian \
40
+ population, symptomatically affecting 1 in 2500 newborns in \
41
+ the UK, and more than 80,000 individuals globally."
42
+ complex_tokenized = tokenizer(complex_sentence,
43
+ padding="max_length",
44
+ truncation=True,
45
+ max_length=256,
46
+ return_tensors='pt')
47
+
48
+ simple_tokenized = model.generate(complex_tokenized['input_ids'], attention_mask = complex_tokenized['attention_mask'], max_length=256, num_beams=5)
49
+ simple_sentences = tokenizer.batch_decode(simple_tokenized, skip_special_tokens=True)
50
+ print(simple_sentences)
51
+
52
+ """
53
+ Output:
54
+ Cystic Fibrosis is an autosomal recessive disorder that affects multiple organs. Cystic Fibrosis affects 1 in 2500 newborns in the UK. Cystic Fibrosis affects more than 80,000 individuals globally. Cystic Fibrosis is common in the Caucasian population.
55
+ """
56
+ ```