cfinley commited on
Commit
7abcf18
1 Parent(s): 686253f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -4
README.md CHANGED
@@ -24,7 +24,7 @@ should probably proofread and complete it, then remove this comment. -->
24
 
25
  # punct_restore_fr
26
 
27
- This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on an unkown dataset.
28
  It achieves the following results on the evaluation set:
29
  - Loss: 0.0301
30
  - Precision: 0.9601
@@ -34,15 +34,25 @@ It achieves the following results on the evaluation set:
34
 
35
  ## Model description
36
 
37
- More information needed
38
 
39
  ## Intended uses & limitations
40
 
41
- More information needed
42
 
43
  ## Training and evaluation data
44
 
45
- More information needed
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Training procedure
48
 
 
24
 
25
  # punct_restore_fr
26
 
27
+ This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on a raw opensubtitles dataset.
28
  It achieves the following results on the evaluation set:
29
  - Loss: 0.0301
30
  - Precision: 0.9601
 
34
 
35
  ## Model description
36
 
37
+ Classifies tokens based on beginning of sentence (B-SENT) and not (O).
38
 
39
  ## Intended uses & limitations
40
 
41
+ This model aims to help punctuation restoration on French YouTube auto-generated subtitles.
42
 
43
  ## Training and evaluation data
44
 
45
+ 1 million Open Subtitles (French) sentences. 80%/10%/10% training/validation/test split.
46
+
47
+ The sentences:
48
+
49
+ - were lower-cased
50
+ - had end punctuation (.?!) removed
51
+ - were of length between 7 and 70 words
52
+ - had beginning word of sentence tagged with B-SENT.
53
+ - All other words marked with O.
54
+
55
+ Token/tag pairs batched together in groups of 64. This helps show variety of positions for B-SENT and O tags. This also keeps training examples from just being one sentence. Otherwise, this leads to having the first word and only the first word in a sequence being labeled B-SENT.
56
 
57
  ## Training procedure
58