File size: 6,492 Bytes
70c9c43
 
 
 
 
 
 
0a7f17f
70c9c43
 
 
ca76b17
 
 
70c9c43
 
 
 
 
 
 
 
 
 
a4b84ab
 
70c9c43
 
 
 
ca76b17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31085da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f72bf70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca76b17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
language:
- en
- de
- fr
- it
- nl
- multilingual
tags:
- punctuation prediction
- punctuation
datasets: 
- wmt/europarl
- SoNaR
license: mit
widget:
- text: "Ho sentito che ti sei laureata il che mi fa molto piacere"
  example_title: "Italian"
- text: "Tous les matins vers quatre heures mon père ouvrait la porte de ma chambre"
  example_title: "French"
- text: "Ist das eine Frage Frau Müller"
  example_title: "German"
- text: "My name is Clara and I live in Berkeley California"
  example_title: "English"  
- text: "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
  example_title: "Dutch"
metrics:
- f1
---



This model predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language. 

This multilanguage model was trained on the [Europarl Dataset](https://huggingface.co/datasets/wmt/europarl) provided by the [SEPP-NLG Shared Task](https://sites.google.com/view/sentence-segmentation) and for the Dutch language we included the [SoNaR Dataset](http://hdl.handle.net/10032/tm-a2-h5). *Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.*

The model restores the following punctuation markers: **"." "," "?" "-" ":"**
## Sample Code
We provide a simple python package that allows you to process text of any length.

## Install 

To get started install the package from [pypi](https://pypi.org/project/deepmultilingualpunctuation/):

```bash
pip install deepmultilingualpunctuation
```
### Restore Punctuation
```python
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)
```

**output**
> My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?


### Predict Labels 
```python
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)
```

**output**

> [['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '.', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['Müller', '?', 0.972517]]



## Results 

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

| Label         | English  | German | French|Italian| Dutch |
| ------------- | -------- | ------ | ----- | ----- | ----- |
| 0             | 0.990    |  0.996 | 0.991 | 0.988 | 0.994 |
| .             | 0.924    |  0.951 | 0.921 | 0.917 | 0.959 |
| ?             | 0.825    |  0.829 | 0.800 | 0.736 | 0.817 |
| ,             | 0.798    |  0.937 | 0.811 | 0.778 | 0.813 |
| :             | 0.535    |  0.608 | 0.578 | 0.544 | 0.657 |
| -             | 0.345    |  0.384 | 0.353 | 0.344 | 0.464 |
| macro average | 0.736    |  0.784 | 0.742 | 0.718 | 0.784 |
| micro average | 0.975    |  0.987 | 0.977 | 0.972 | 0.983 |

## Languages

### Models

| Languages                                  | Model                                                        |
| ------------------------------------------ | ------------------------------------------------------------ |
| English, Italian, French and German        | [oliverguhr/fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) |
| English, Italian, French, German and Dutch | [oliverguhr/fullstop-punctuation-multilingual-sonar-base](https://huggingface.co/oliverguhr/fullstop-punctuation-multilingual-sonar-base) |
| Dutch                                      | [oliverguhr/fullstop-dutch-sonar-punctuation-prediction](https://huggingface.co/oliverguhr/fullstop-dutch-sonar-punctuation-prediction) |

### Community Models

| Languages                                  | Model                                                        |
| ------------------------------------------ | ------------------------------------------------------------ |
|English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian| [kredor/punctuate-all](https://huggingface.co/kredor/punctuate-all)                                                             |
| Catalan                                    | [softcatala/fullstop-catalan-punctuation-prediction](https://huggingface.co/softcatala/fullstop-catalan-punctuation-prediction) |

You can use different models by setting the model parameter:

```python
model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")
```


## How to cite us

```
@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

```

```
@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}

```