oliverguhr commited on
Commit
6bf504c
1 Parent(s): 381031e

added new readme

Browse files
Files changed (1) hide show
  1. README.md +64 -47
README.md CHANGED
@@ -1,47 +1,64 @@
1
- ---
2
- language:
3
- - nl
4
- tags:
5
- - punctuation prediction
6
- - punctuation
7
- datasets: wmt/europarl
8
- license: mit
9
- widget:
10
- - text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
11
- example_title: "Euro Parl"
12
- metrics:
13
- - f1
14
- ---
15
-
16
- ## Performance
17
-
18
- ```
19
- precision recall f1-score support
20
-
21
- 0 0.992584 0.994595 0.993588 9627605
22
- . 0.960450 0.962452 0.961450 433554
23
- , 0.816974 0.804882 0.810883 379759
24
- ? 0.871368 0.826812 0.848506 13494
25
- - 0.619905 0.367690 0.461591 27341
26
- : 0.718636 0.602076 0.655212 18305
27
-
28
- accuracy 0.983874 10500058
29
- macro avg 0.829986 0.759751 0.788538 10500058
30
- weighted avg 0.983302 0.983874 0.983492 10500058
31
-
32
- ```
33
-
34
- Usage:
35
-
36
- ```bash
37
- pip install deepmultilingualpunctuation
38
- ```
39
-
40
- ```python
41
- from deepmultilingualpunctuation import PunctuationModel
42
-
43
- model = PunctuationModel(model="oliverguhr/fullstop-dutch-punctuation-prediction")
44
- text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
45
- result = model.restore_punctuation(text)
46
- print(result)
47
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - nl
4
+ tags:
5
+ - punctuation prediction
6
+ - punctuation
7
+ datasets: wmt/europarl
8
+ license: mit
9
+ widget:
10
+ - text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
11
+ example_title: "Euro Parl"
12
+ metrics:
13
+ - f1
14
+ ---
15
+
16
+ This model predicts the punctuation of Dutch texts. We developed it to restore the punctuation of transcribed spoken language.
17
+ This model was trained on the [Europarl Dataset](https://huggingface.co/datasets/wmt/europarl).
18
+ The model restores the following punctuation markers: **"." "," "?" "-" ":"**
19
+ ## Sample Code
20
+ We provide a simple python package that allows you to process text of any length.
21
+ ## Install
22
+ To get started install the package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):
23
+ ```bash
24
+ pip install deepmultilingualpunctuation
25
+ ```
26
+ ### Restore Punctuation
27
+ ```python
28
+ from deepmultilingualpunctuation import PunctuationModel
29
+ model = PunctuationModel(model="oliverguhr/fullstop-dutch-punctuation-prediction")
30
+ text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
31
+ result = model.restore_punctuation(text)
32
+ print(result)
33
+ ```
34
+ **output**
35
+ > hervatting van de zitting ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.
36
+ ### Predict Labels
37
+ ```python
38
+ from deepmultilingualpunctuation import PunctuationModel
39
+
40
+ model = PunctuationModel(model="oliverguhr/fullstop-dutch-punctuation-prediction")
41
+ text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
42
+ clean_text = model.preprocess(text)
43
+ labled_words = model.predict(clean_text)
44
+ print(labled_words)
45
+ ```
46
+ **output**
47
+ > [['hervatting', '0', 0.9999777], ['van', '0', 0.99998415], ['de', '0', 0.999987], ['zitting', '0', 0.9992779], ['ik', '0', 0.9999889], ['verklaar', '0', 0.99998295], ['de', '0', 0.99998856], ['zitting', '0', 0.9999895], ['van', '0', 0.9999902], ['het', '0', 0.999992], ['europees', '0', 0.9999924], ['parlement', ',', 0.9915131], ['die', '0', 0.99997807], ['op', '0', 0.9999882], ['vrijdag', '0', 0.9999746], ['17', '0', 0.99998784], ['december', '0', 0.99997866], ['werd', '0', 0.9999888], ['onderbroken', ',', 0.99287957], ['te', '0', 0.9999864], ['zijn', '0', 0.99998176], ['hervat', '.', 0.99762934]]
48
+
49
+ ## Results
50
+ The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores:
51
+
52
+ | Label | Dutch |
53
+ | ------------- | -------- |
54
+ | 0 | 0.993588 |
55
+ | . | 0.961450 |
56
+ | ? | 0.848506 |
57
+ | , | 0.810883 |
58
+ | : | 0.655212 |
59
+ | - | 0.461591 |
60
+ | macro average | 0.788538 |
61
+ | micro average | 0.983492 |
62
+
63
+ ## References
64
+ TBD