zolekode commited on
Commit
b31c62a
1 Parent(s): 1b465c9

t5-small wav2vec2 grammar fixer model and tokenizer

Browse files

Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # flexudy-pipe-question-generation-v2
2
+ After transcribing your audio with Wav2Vec2, you might be interested in a post processor.
3
+
4
+ I trained it with only 42K paragraphs from the SQUAD dataset. All paragraphs had at most 128 tokens (separated by white spaces)
5
+
6
+ ```python
7
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
8
+
9
+ model_name = "flexudy/t5-small-wav2vec2-grammar-fixer"
10
+
11
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
12
+
13
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
14
+
15
+ sent = """GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS"""
16
+
17
+ input_text = "fix: { " + sent + " } </s>"
18
+
19
+ input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=256, truncation=True, add_special_tokens=True)
20
+
21
+ outputs = model.generate(
22
+ input_ids=input_ids,
23
+ max_length=256,
24
+ num_beams=4,
25
+ repetition_penalty=1.0,
26
+ length_penalty=1.0,
27
+ early_stopping=True
28
+ )
29
+
30
+ sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
31
+
32
+ print(f"{sentence}")
33
+ ```
34
+
35
+ INPUT 1:
36
+ ```
37
+ BEFORE HE HAD TIME TO ANSWER A MUCH ENCUMBERED VERA BURST INTO THE ROOM WITH THE QUESTION I SAY CAN I LEAVE THESE HERE IN TWO THOUSAND AND TWO THESE WERE A SMALL BLACK PIG AND A LUSTY SPECIMEN OF BLACK RED GAME COCK
38
+ ```
39
+ OUTPUT 1:
40
+ ```
41
+ Before he had time to answer a much-enumbered era burst into the room with the question, I say, "Can I leave these here?" In 2002, these were a small black pig and a dusty specimen of black red game cock.
42
+ ```
43
+
44
+ INPUT 2:
45
+ ```
46
+ GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS
47
+ ```
48
+
49
+ OUTPUT 2:
50
+ ```
51
+ Going along Slushy Country Roads and speaking to damp audiences in Droughty School rooms day after day for a fortnight, he'll have to put in an appearance at some place of worship on Sunday morning and he can come to us immediately afterwards.
52
+ ```
53
+ I strongly recommend improving the performance via further fine-tuning or by training more examples.
54
+ - Possible Quick Rule based improvements: Align the transcribed version and the generated version. If the similarity of two words (case-insensitive) vary by more than some threshold based on some similarity metric (e.g. Levenshtein), then keep the transcribed word.