techiaith/fullstop-welsh-punctuation-prediction

This model predicts the punctuation of Welsh language texts. It has been created to restore punctuation of transcribed from speech recognition models such as https://huggingface.co/techiaith/wav2vec2-xlsr-ft-cy. The model restores the following punctuation markers: "." "," "?" "-" ":"

The model was trained on Welsh texts extracted from the Welsh Parliament / Senedd Record of Proceedings between 1999-2010 and 2016 to the present day. Please note that the training data consists of originally spoken and translated political speeches. Therefore the model might perform differently on texts from other domains.

Based on the work of https://github.com/oliverguhr/fullstop-deep-punctuation-prediction and softcatala/fullstop-catalan-punctuation-prediction

Install

To get started install the deepmultilingualpunctuation package from pypi:

pip install deepmultilingualpunctuation

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel("techiaith/fullstop-welsh-punctuation-prediction")
text = "A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn ne ddwyrain Cymru"
result = model.restore_punctuation(text)
print(result)

output

[
  {
    "entity_group": "LABEL_0",
    "score": 0.9999812841415405,
    "word": "A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn",
    "start": 0,
    "end": 58
  },
  {
    "entity_group": "LABEL_4",
    "score": 0.9787278771400452,
    "word": "ne",
    "start": 59,
    "end": 61
  },
  {
    "entity_group": "LABEL_0",
    "score": 0.9999902248382568,
    "word": "ddwyrain",
    "start": 62,
    "end": 70
  },
  {
    "entity_group": "LABEL_3",
    "score": 0.9484745860099792,
    "word": "Cymru",
    "start": 71,
    "end": 76
  }
]

A yw'r gweinidog yn cytuno bod angen gwell gwasanaethau yn ne-ddwyrain Cymru?

Results

The model achieves the following F1 scores for the different punctuation markers:

Label	Precision	Recall	f1-score	Support
0	0.99	0.99	0.99	12124280
.	0.88	0.89	0.88	455896
,	0.84	0.82	0.83	771813
?	0.92	0.88	0.90	54878
-	0.95	0.94	0.95	31545
:	0.91	0.87	0.89	39618

accuracy			0.98	13478030
macro avg	0.91	0.90	0.91	13478030
weighted avg	0.97	0.98	0.97	13478030

techiaith
/

fullstop-welsh-punctuation-prediction

Install

Restore Punctuation

Results

Collection including techiaith/fullstop-welsh-punctuation-prediction

Speech Recognition Models