license: mit
base_model: xlm-roberta-base
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
metrics:
- f1
⚠️ Warning: An updated version of this model is available here This model is no longer maintained.
Please refer to our Segment any Text paper for more details: https://arxiv.org/abs/2406.16678
xlmr-multilingual-sentence-segmentation
This model is a fine-tuned version of xlm-roberta-base on a corrupted version of the universal dependency datasets. It achieves the following results on the (also corrupted) evaluation set:
- Loss: 0.0074
- Precision: 0.9664
- Recall: 0.9677
- F1: 0.9670
Test set performance
Results
All results here are percentage F1:
Opus100 [2]
Who wins most? XLM-RoBERTa: 56, WtPSplit: 12, Spacy (multilingual): 8
af | am | ar | az | be | bg | bn | ca | cs | cy | da | de | el | en | eo | es | et | eu | fa | fi | fr | fy | ga | gd | gl | gu | ha | he | hi | hu | hy | id | is | it | ja | ka | kk | km | kn | ko | ku | ky | lt | lv | mg | mk | ml | mn | mr | ms | my | ne | nl | pa | pl | ps | pt | ro | ru | si | sk | sl | sq | sr | sv | ta | te | th | tr | uk | ur | uz | vi | xh | yi | zh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Spacy (multilingual) | 42.61 | 6.69 | 58.52 | 73.59 | 34.78 | 93.74 | 38.04 | 88.76 | 87.70 | 26.30 | 90.52 | 74.15 | 89.75 | 89.25 | 88.77 | 90.95 | 87.26 | 81.20 | 55.40 | 93.28 | 85.77 | 21.49 | 60.61 | 36.83 | 88.77 | 5.59 | 89.39 | 92.21 | 53.33 | 93.26 | 24.14 | 90.13 | 95.38 | 86.32 | 0.20 | 38.24 | 42.39 | 0.10 | 9.66 | 51.79 | 27.64 | 21.77 | 76.91 | 77.02 | 83.60 | 93.74 | 39.09 | 33.23 | 86.56 | 87.39 | 0.10 | 6.59 | 93.65 | 5.26 | 92.42 | 2.41 | 92.07 | 91.63 | 75.95 | 75.91 | 92.13 | 93.00 | 92.96 | 95.01 | 93.52 | 36.97 | 64.59 | 21.64 | 94.05 | 89.68 | 29.17 | 64.99 | 90.59 | 64.89 | 4.14 | 0.09 |
WtPSplit | 76.90 | 59.08 | 68.08 | 76.42 | 71.29 | 93.97 | 79.76 | 89.79 | 89.36 | 73.21 | 90.02 | 80.74 | 92.80 | 91.91 | 92.24 | 92.11 | 84.47 | 87.24 | 59.97 | 91.96 | 88.53 | 65.84 | 79.49 | 83.33 | 90.31 | 70.51 | 82.43 | 90.58 | 66.70 | 93.00 | 87.14 | 89.80 | 94.77 | 87.43 | 41.79 | 91.26 | 73.25 | 69.54 | 68.98 | 56.21 | 79.12 | 83.94 | 81.33 | 82.70 | 89.33 | 92.87 | 80.81 | 73.26 | 89.20 | 88.51 | 65.54 | 71.33 | 92.63 | 64.11 | 92.72 | 62.84 | 91.05 | 90.91 | 84.23 | 80.32 | 92.30 | 92.19 | 90.32 | 94.76 | 92.08 | 63.48 | 76.49 | 68.88 | 93.30 | 89.60 | 52.59 | 77.79 | 91.29 | 80.28 | 75.70 | 71.64 |
XLM-RoBERTa (ours) | 83.97 | 41.59 | 81.56 | 81.30 | 85.68 | 94.34 | 84.10 | 91.80 | 91.23 | 78.72 | 92.64 | 86.73 | 93.87 | 94.50 | 94.57 | 93.18 | 90.19 | 90.28 | 74.79 | 94.06 | 90.46 | 81.76 | 84.33 | 85.62 | 92.55 | 67.26 | 86.61 | 91.22 | 72.69 | 94.53 | 89.83 | 92.24 | 93.78 | 89.27 | 41.43 | 78.39 | 89.15 | 36.60 | 70.51 | 82.77 | 58.14 | 89.41 | 89.99 | 88.25 | 86.82 | 92.81 | 86.14 | 94.73 | 93.25 | 92.44 | 49.39 | 66.02 | 93.60 | 69.22 | 93.51 | 61.86 | 92.84 | 93.19 | 89.47 | 86.24 | 92.95 | 93.46 | 91.79 | 94.16 | 93.93 | 72.74 | 81.77 | 74.49 | 93.17 | 92.15 | 62.92 | 75.65 | 93.41 | 84.89 | 56.85 | 77.07 |
Universal Dependencies [3]
Who wins most? XLM-RoBERTa: 24, WtPSplit: 17 Spacy (multilingual): 13
af | ar | be | bg | bn | ca | cs | cy | da | de | el | en | es | et | eu | fa | fi | fr | ga | gd | gl | he | hi | hu | hy | id | is | it | ja | jv | kk | ko | la | lt | lv | mr | nl | pl | pt | ro | ru | sk | sl | sq | sr | sv | ta | th | tr | uk | ur | vi | zh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Spacy (multilingual) | 98.47 | 80.38 | 80.27 | 93.62 | 51.85 | 98.95 | 89.68 | 98.89 | 94.96 | 88.02 | 94.16 | 92.20 | 98.70 | 93.77 | 95.79 | 99.83 | 92.88 | 96.33 | 96.67 | 63.04 | 92.37 | 94.37 | 0.32 | 98.45 | 11.39 | 98.01 | 95.41 | 92.49 | 0.37 | 98.03 | 96.21 | 99.80 | 0.09 | 93.86 | 98.52 | 92.13 | 92.86 | 97.02 | 94.91 | 98.05 | 84.31 | 90.26 | 98.23 | 100.00 | 97.84 | 94.91 | 66.67 | 1.95 | 97.63 | 94.16 | 0.37 | 96.40 | 0.40 |
WtPSplit | 98.27 | 83.00 | 89.28 | 98.16 | 99.12 | 98.52 | 92.98 | 99.26 | 94.56 | 96.13 | 96.94 | 94.73 | 97.60 | 94.09 | 97.24 | 97.29 | 94.69 | 96.71 | 86.60 | 72.17 | 98.87 | 95.79 | 96.78 | 96.08 | 96.80 | 98.41 | 86.39 | 95.45 | 95.84 | 98.18 | 96.28 | 99.11 | 91.43 | 97.67 | 96.42 | 91.84 | 93.61 | 95.92 | 96.13 | 81.50 | 86.28 | 95.57 | 96.85 | 99.17 | 98.45 | 95.86 | 97.54 | 70.26 | 96.00 | 92.08 | 93.79 | 92.97 | 97.25 |
XLM-RoBERTa (ours) | 96.81 | 78.99 | 91.60 | 97.89 | 99.12 | 95.99 | 96.05 | 97.17 | 96.62 | 96.29 | 94.33 | 94.76 | 95.73 | 96.20 | 97.37 | 97.49 | 96.34 | 95.70 | 89.78 | 84.20 | 95.72 | 95.95 | 97.51 | 96.24 | 95.62 | 97.22 | 92.93 | 96.88 | 94.23 | 96.29 | 98.40 | 97.46 | 96.35 | 95.82 | 96.91 | 95.92 | 96.27 | 97.24 | 95.83 | 94.63 | 91.59 | 95.88 | 96.43 | 98.36 | 96.83 | 94.95 | 95.93 | 89.26 | 96.52 | 94.59 | 96.20 | 97.31 | 95.12 |
Ersatz [4]
Who wins most? XLM-RoBERTa: 10, WtPSplit: 8, Spacy (multilingual): 4
ar | cs | de | en | es | et | fi | fr | gu | hi | ja | kk | km | lt | lv | pl | ps | ro | ru | ta | tr | zh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Spacy (multilingual) | 91.26 | 96.46 | 93.89 | 94.40 | 97.31 | 97.15 | 94.99 | 96.43 | 4.44 | 18.41 | 0.18 | 97.11 | 0.08 | 93.53 | 98.73 | 93.69 | 94.44 | 94.87 | 93.45 | 68.65 | 95.39 | 0.10 |
WtPSplit | 89.45 | 93.41 | 95.93 | 97.16 | 98.74 | 95.84 | 97.10 | 97.61 | 90.62 | 94.87 | 82.14 | 95.94 | 82.89 | 96.74 | 97.22 | 95.16 | 86.99 | 97.55 | 97.82 | 94.76 | 93.53 | 89.02 |
XLM-RoBERTa (ours) | 79.78 | 96.94 | 97.02 | 96.10 | 97.06 | 96.80 | 97.67 | 96.33 | 93.73 | 95.34 | 77.54 | 97.28 | 78.94 | 96.13 | 96.45 | 96.71 | 92.33 | 96.24 | 97.15 | 95.94 | 95.76 | 90.11 |
German--English code-switching [5]
de | |
---|---|
Spacy (multilingual) | 79.55 |
WtPSplit | 77.41 |
XLM-RoBERTa (ours) | 85.78 |
[1] Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation (Minixhofer et al., ACL 2023)
[2] Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation (Zhang et al., ACL 2020)
[3] Universal Dependencies (de Marneffe et al., CL 2021)
[4] A unified approach to sentence segmentation of punctuated text in many languages (Wicks & Post, ACL-IJCNLP 2021)
[5] The Denglisch Corpus of German-English Code-Switching (Osmelak & Wintner, SIGTYP 2023)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
Training results
Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 |
---|---|---|---|---|---|---|
No log | 0.2 | 100 | 0.0125 | 0.9320 | 0.9487 | 0.9403 |
No log | 0.4 | 200 | 0.0099 | 0.9547 | 0.9513 | 0.9530 |
No log | 0.6 | 300 | 0.0092 | 0.9616 | 0.9506 | 0.9561 |
No log | 0.81 | 400 | 0.0083 | 0.9584 | 0.9618 | 0.9601 |
0.0212 | 1.01 | 500 | 0.0082 | 0.9551 | 0.9642 | 0.9596 |
0.0212 | 1.21 | 600 | 0.0084 | 0.9630 | 0.9614 | 0.9622 |
0.0212 | 1.41 | 700 | 0.0079 | 0.9606 | 0.9648 | 0.9627 |
0.0212 | 1.61 | 800 | 0.0077 | 0.9609 | 0.9661 | 0.9635 |
0.0212 | 1.81 | 900 | 0.0076 | 0.9623 | 0.9649 | 0.9636 |
0.0067 | 2.02 | 1000 | 0.0077 | 0.9598 | 0.9689 | 0.9643 |
0.0067 | 2.22 | 1100 | 0.0075 | 0.9614 | 0.9680 | 0.9647 |
0.0067 | 2.42 | 1200 | 0.0073 | 0.9626 | 0.9682 | 0.9654 |
0.0067 | 2.62 | 1300 | 0.0075 | 0.9617 | 0.9692 | 0.9654 |
0.0067 | 2.82 | 1400 | 0.0073 | 0.9658 | 0.9648 | 0.9653 |
0.0054 | 3.02 | 1500 | 0.0076 | 0.9656 | 0.9663 | 0.9660 |
0.0054 | 3.23 | 1600 | 0.0073 | 0.9625 | 0.9703 | 0.9664 |
0.0054 | 3.43 | 1700 | 0.0073 | 0.9658 | 0.9659 | 0.9658 |
0.0054 | 3.63 | 1800 | 0.0073 | 0.9626 | 0.9707 | 0.9666 |
0.0054 | 3.83 | 1900 | 0.0073 | 0.9659 | 0.9677 | 0.9668 |
0.0046 | 4.03 | 2000 | 0.0075 | 0.9671 | 0.9659 | 0.9665 |
0.0046 | 4.23 | 2100 | 0.0075 | 0.9654 | 0.9687 | 0.9671 |
0.0046 | 4.44 | 2200 | 0.0075 | 0.9662 | 0.9676 | 0.9669 |
0.0046 | 4.64 | 2300 | 0.0074 | 0.9657 | 0.9684 | 0.9670 |
0.0046 | 4.84 | 2400 | 0.0074 | 0.9664 | 0.9678 | 0.9671 |
Framework versions
- Transformers 4.39.1
- Pytorch 2.2.1+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2
Citation
Please consider citing our paper if this model has helped you:
@inproceedings{frohman-etal-2024-segment,
title = "Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation",
author={Markus Frohmann and Igor Sterner and Ivan Vulić and Benjamin Minixhofer and Markus Schedl},
month = nov,
year = "2024",
publisher = "Association for Computational Linguistics",
}