bhadresh-savani commited on
Commit
92b41cc
2 Parent(s): 5e5f37d ffb1729

Merge branch 'main' of https://huggingface.co/flax-community/t5-v1_1-base-wikisplit into main

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +53 -0
  3. config.json +1 -0
.gitattributes CHANGED
@@ -14,3 +14,4 @@
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
 
 
14
  *.pb filter=lfs diff=lfs merge=lfs -text
15
  *.pt filter=lfs diff=lfs merge=lfs -text
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - wiki_split
4
+
5
+ widget:
6
+ - text: "Mary likes to play football in her freetime whenever she meets with her friends that are very nice people."
7
+
8
+ license: mit
9
+ ---
10
+ # T5 model for sentence splitting in English
11
+
12
+ Sentence Split is the task of dividing a long sentence into multiple sentences.
13
+ E.g.:
14
+ ```
15
+ Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.
16
+ ```
17
+ could be split into
18
+ ```
19
+ Mary likes to play football in her freetime whenever she meets with her friends.
20
+ ```
21
+ ```
22
+ Her friends are very nice people.
23
+ ```
24
+
25
+ ## How to use it in your code:
26
+ ```python
27
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
28
+ tokenizer = AutoTokenizer.from_pretrained("flax-community/t5-v1_1-base-wikisplit")
29
+ model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/t5-v1_1-base-wikisplit")
30
+
31
+ complex_sentence = "This comedy drama is produced by Tidy , the company she co-founded in 2008 with her husband David Peet , who is managing director ."
32
+ sample_tokenized = tokenizer(complex_sentence, return_tensors="pt")
33
+
34
+ answer = model.generate(sample_tokenized['input_ids'], attention_mask = sample_tokenized['attention_mask'], max_length=256, num_beams=5)
35
+ gene_sentence = tokenizer.decode(answer[0], skip_special_tokens=True)
36
+ gene_sentence
37
+
38
+ """
39
+ Output:
40
+ This comedy drama is produced by Tidy. She co-founded Tidy in 2008 with her husband David Peet, who is managing director.
41
+ """
42
+ ```
43
+ ## Datasets:
44
+ [Wiki_Split](https://research.google/tools/datasets/wiki-split/)
45
+
46
+ ## Current Basline from [paper](https://arxiv.org/abs/1907.12461)
47
+ ![baseline](./baseline.png)
48
+
49
+ ## Our Results:
50
+ | Model | Exact | SARI | BLEU |
51
+ | --- | --- | --- | --- |
52
+ | t5-base-wikisplit | 17.93 | 67.5438 | 76.9 |
53
+ | t5-v1_1-base-wikisplit | 16.84 | 66.38 | 76.32 |
config.json CHANGED
@@ -6,6 +6,7 @@
6
  "d_ff": 2048,
7
  "d_kv": 64,
8
  "d_model": 768,
 
9
  "decoder_start_token_id": 0,
10
  "dropout_rate": 0.1,
11
  "eos_token_id": 1,
 
6
  "d_ff": 2048,
7
  "d_kv": 64,
8
  "d_model": 768,
9
+ "max_length": 256,
10
  "decoder_start_token_id": 0,
11
  "dropout_rate": 0.1,
12
  "eos_token_id": 1,