README.md · oliverguhr/fullstop-punctuation-multilingual-sonar-base at main

metadata

language:
  - en
  - de
  - fr
  - it
  - nl
  - multilingual
tags:
  - punctuation prediction
  - punctuation
datasets:
  - wmt/europarl
  - SoNaR
license: mit
widget:
  - text: Ho sentito che ti sei laureata il che mi fa molto piacere
    example_title: Italian
  - text: Tous les matins vers quatre heures mon père ouvrait la porte de ma chambre
    example_title: French
  - text: Ist das eine Frage Frau Müller
    example_title: German
  - text: My name is Clara and I live in Berkeley California
    example_title: English
  - text: >-
      hervatting van de zitting ik verklaar de zitting van het europees
      parlement die op vrijdag 17 december werd onderbroken te zijn hervat
    example_title: Dutch
metrics:
  - f1

This model predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This multilanguage model was trained on the Europarl Dataset provided by the SEPP-NLG Shared Task and for the Dutch language we included the SoNaR Dataset. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The model restores the following punctuation markers: "." "," "?" "-" ":"

Sample Code

We provide a simple python package that allows you to process text of any length.

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '.', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['Müller', '?', 0.972517]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label	English	German	French	Italian	Dutch
0	0.990	0.996	0.991	0.988	0.994
.	0.924	0.951	0.921	0.917	0.959
?	0.825	0.829	0.800	0.736	0.817
,	0.798	0.937	0.811	0.778	0.813
:	0.535	0.608	0.578	0.544	0.657
-	0.345	0.384	0.353	0.344	0.464
macro average	0.736	0.784	0.742	0.718	0.784
micro average	0.975	0.987	0.977	0.972	0.983

Languages

Models

Languages	Model
English, Italian, French and German	oliverguhr/fullstop-punctuation-multilang-large
English, Italian, French, German and Dutch	oliverguhr/fullstop-punctuation-multilingual-sonar-base
Dutch	oliverguhr/fullstop-dutch-sonar-punctuation-prediction

Community Models

Languages	Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian	kredor/punctuate-all
Catalan	softcatala/fullstop-catalan-punctuation-prediction

You can use different models by setting the model parameter:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

How to cite us

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}