--- license: mit base_model: xlm-roberta-base language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - 'no' - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh metrics: - f1 --- # xlmr-multilingual-sentence-segmentation This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on a corrupted version of the universal dependency datasets. It achieves the following results on the (also corrupted) evaluation set: - Loss: 0.0074 - Precision: 0.9664 - Recall: 0.9677 - F1: 0.9670 # Test set performance # Results All results here are percentage F1: ## Opus100 [2] Who wins most? XLM-RoBERTa: 56, WtPSplit: 12, Spacy (multilingual): 8 | | af | am | ar | az | be | bg | bn | ca | cs | cy | da | de | el | en | eo | es | et | eu | fa | fi | fr | fy | ga | gd | gl | gu | ha | he | hi | hu | hy | id | is | it | ja | ka | kk | km | kn | ko | ku | ky | lt | lv | mg | mk | ml | mn | mr | ms | my | ne | nl | pa | pl | ps | pt | ro | ru | si | sk | sl | sq | sr | sv | ta | te | th | tr | uk | ur | uz | vi | xh | yi | zh | |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------| | Spacy (multilingual) | 42.61 | 6.69 | 58.52 | 73.59 | 34.78 | 93.74 | 38.04 | 88.76 | 87.70 | 26.30 | 90.52 | 74.15 | 89.75 | 89.25 | 88.77 | 90.95 | 87.26 | 81.20 | 55.40 | 93.28 | 85.77 | 21.49 | 60.61 | 36.83 | 88.77 | 5.59 | **89.39** | **92.21** | 53.33 | 93.26 | 24.14 | 90.13 | **95.38** | 86.32 | 0.20 | 38.24 | 42.39 | 0.10 | 9.66 | 51.79 | 27.64 | 21.77 | 76.91 | 77.02 | 83.60 | **93.74** | 39.09 | 33.23 | 86.56 | 87.39 | 0.10 | 6.59 | **93.65** | 5.26 | 92.42 | 2.41 | 92.07 | 91.63 | 75.95 | 75.91 | 92.13 | 93.00 | **92.96** | **95.01** | 93.52 | 36.97 | 64.59 | 21.64 | **94.05** | 89.68 | 29.17 | 64.99 | 90.59 | 64.89 | 4.14 | 0.09 | | WtPSplit | 76.90 | **59.08** | 68.08 | 76.42 | 71.29 | 93.97 | 79.76 | 89.79 | 89.36 | 73.21 | 90.02 | 80.74 | 92.80 | 91.91 | 92.24 | 92.11 | 84.47 | 87.24 | 59.97 | 91.96 | 88.53 | 65.84 | 79.49 | 83.33 | 90.31 | **70.51** | 82.43 | 90.58 | 66.70 | 93.00 | 87.14 | 89.80 | 94.77 | 87.43 | **41.79** | **91.26** | 73.25 | **69.54** | 68.98 | 56.21 | **79.12** | 83.94 | 81.33 | 82.70 | **89.33** | 92.87 | 80.81 | 73.26 | 89.20 | 88.51 | **65.54** | **71.33** | 92.63 | 64.11 | 92.72 | **62.84** | 91.05 | 90.91 | 84.23 | 80.32 | 92.30 | 92.19 | 90.32 | 94.76 | 92.08 | 63.48 | 76.49 | 68.88 | 93.30 | 89.60 | 52.59 | **77.79** | 91.29 | 80.28 | **75.70** | 71.64 | | XLM-RoBERTa (ours) | **83.97** | 41.59 | **81.56** | **81.30** | **85.68** | **94.34** | **84.10** | **91.80** | **91.23** | **78.72** | **92.64** | **86.73** | **93.87** | **94.50** | **94.57** | **93.18** | **90.19** | **90.28** | **74.79** | **94.06** | **90.46** | **81.76** | **84.33** | **85.62** | **92.55** | 67.26 | 86.61 | 91.22 | **72.69** | **94.53** | **89.83** | **92.24** | 93.78 | **89.27** | 41.43 | 78.39 | **89.15** | 36.60 | **70.51** | **82.77** | 58.14 | **89.41** | **89.99** | **88.25** | 86.82 | 92.81 | **86.14** | **94.73** | **93.25** | **92.44** | 49.39 | 66.02 | 93.60 | **69.22** | **93.51** | 61.86 | **92.84** | **93.19** | **89.47** | **86.24** | **92.95** | **93.46** | 91.79 | 94.16 | **93.93** | **72.74** | **81.77** | **74.49** | 93.17 | **92.15** | **62.92** | 75.65 | **93.41** | **84.89** | 56.85 | **77.07** | ## Universal Dependencies [3] Who wins most? XLM-RoBERTa: 24, WtPSplit: 17 Spacy (multilingual): 13 | | af | ar | be | bg | bn | ca | cs | cy | da | de | el | en | es | et | eu | fa | fi | fr | ga | gd | gl | he | hi | hu | hy | id | is | it | ja | jv | kk | ko | la | lt | lv | mr | nl | pl | pt | ro | ru | sk | sl | sq | sr | sv | ta | th | tr | uk | ur | vi | zh | |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:-----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------| | Spacy (multilingual) | **98.47** | 80.38 | 80.27 | 93.62 | 51.85 | **98.95** | 89.68 | 98.89 | 94.96 | 88.02 | 94.16 | 92.20 | **98.70** | 93.77 | 95.79 | **99.83** | 92.88 | 96.33 | **96.67** | 63.04 | 92.37 | 94.37 | 0.32 | **98.45** | 11.39 | 98.01 | **95.41** | 92.49 | 0.37 | 98.03 | 96.21 | **99.80** | 0.09 | 93.86 | **98.52** | 92.13 | 92.86 | 97.02 | 94.91 | **98.05** | 84.31 | 90.26 | **98.23** | **100.00** | 97.84 | 94.91 | 66.67 | 1.95 | **97.63** | 94.16 | 0.37 | 96.40 | 0.40 | | WtPSplit | 98.27 | **83.00** | 89.28 | **98.16** | **99.12** | 98.52 | 92.98 | **99.26** | 94.56 | 96.13 | **96.94** | 94.73 | 97.60 | 94.09 | 97.24 | 97.29 | 94.69 | **96.71** | 86.60 | 72.17 | **98.87** | 95.79 | 96.78 | 96.08 | **96.80** | **98.41** | 86.39 | 95.45 | **95.84** | **98.18** | 96.28 | 99.11 | 91.43 | **97.67** | 96.42 | 91.84 | 93.61 | 95.92 | **96.13** | 81.50 | 86.28 | 95.57 | 96.85 | 99.17 | **98.45** | **95.86** | **97.54** | 70.26 | 96.00 | 92.08 | 93.79 | 92.97 | **97.25** | | XLM-RoBERTa (ours) | 96.81 | 78.99 | **91.60** | 97.89 | **99.12** | 95.99 | **96.05** | 97.17 | **96.62** | **96.29** | 94.33 | **94.76** | 95.73 | **96.20** | **97.37** | 97.49 | **96.34** | 95.70 | 89.78 | **84.20** | 95.72 | **95.95** | **97.51** | 96.24 | 95.62 | 97.22 | 92.93 | **96.88** | 94.23 | 96.29 | **98.40** | 97.46 | **96.35** | 95.82 | 96.91 | **95.92** | **96.27** | **97.24** | 95.83 | 94.63 | **91.59** | **95.88** | 96.43 | 98.36 | 96.83 | 94.95 | 95.93 | **89.26** | 96.52 | **94.59** | **96.20** | **97.31** | 95.12 | ## Ersatz [4] Who wins most? XLM-RoBERTa: 10, WtPSplit: 8, Spacy (multilingual): 4 | | ar | cs | de | en | es | et | fi | fr | gu | hi | ja | kk | km | lt | lv | pl | ps | ro | ru | ta | tr | zh | |:---------------------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------|:----------| | Spacy (multilingual) | **91.26** | 96.46 | 93.89 | 94.40 | 97.31 | **97.15** | 94.99 | 96.43 | 4.44 | 18.41 | 0.18 | 97.11 | 0.08 | 93.53 | **98.73** | 93.69 | **94.44** | 94.87 | 93.45 | 68.65 | 95.39 | 0.10 | | WtPSplit | 89.45 | 93.41 | 95.93 | **97.16** | **98.74** | 95.84 | 97.10 | **97.61** | 90.62 | 94.87 | **82.14** | 95.94 | **82.89** | **96.74** | 97.22 | 95.16 | 86.99 | **97.55** | **97.82** | 94.76 | 93.53 | 89.02 | | XLM-RoBERTa (ours) | 79.78 | **96.94** | **97.02** | 96.10 | 97.06 | 96.80 | **97.67** | 96.33 | **93.73** | **95.34** | 77.54 | **97.28** | 78.94 | 96.13 | 96.45 | **96.71** | 92.33 | 96.24 | 97.15 | **95.94** | **95.76** | **90.11** | ## German--English code-switching [5] | | de | |:---------------------|:----------| | Spacy (multilingual) | 79.55 | | WtPSplit | 77.41 | | XLM-RoBERTa (ours) | **85.78** | [1] [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398) (Minixhofer et al., ACL 2023) [2] [Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation](https://aclanthology.org/2020.acl-main.148) (Zhang et al., ACL 2020) [3] [Universal Dependencies](https://aclanthology.org/2021.cl-2.11) (de Marneffe et al., CL 2021) [4] [A unified approach to sentence segmentation of punctuated text in many languages](https://aclanthology.org/2021.acl-long.309) (Wicks & Post, ACL-IJCNLP 2021) [5] [The Denglisch Corpus of German-English Code-Switching](https://aclanthology.org/2023.sigtyp-1.5) (Osmelak & Wintner, SIGTYP 2023) ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 64 - eval_batch_size: 64 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 5 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:| | No log | 0.2 | 100 | 0.0125 | 0.9320 | 0.9487 | 0.9403 | | No log | 0.4 | 200 | 0.0099 | 0.9547 | 0.9513 | 0.9530 | | No log | 0.6 | 300 | 0.0092 | 0.9616 | 0.9506 | 0.9561 | | No log | 0.81 | 400 | 0.0083 | 0.9584 | 0.9618 | 0.9601 | | 0.0212 | 1.01 | 500 | 0.0082 | 0.9551 | 0.9642 | 0.9596 | | 0.0212 | 1.21 | 600 | 0.0084 | 0.9630 | 0.9614 | 0.9622 | | 0.0212 | 1.41 | 700 | 0.0079 | 0.9606 | 0.9648 | 0.9627 | | 0.0212 | 1.61 | 800 | 0.0077 | 0.9609 | 0.9661 | 0.9635 | | 0.0212 | 1.81 | 900 | 0.0076 | 0.9623 | 0.9649 | 0.9636 | | 0.0067 | 2.02 | 1000 | 0.0077 | 0.9598 | 0.9689 | 0.9643 | | 0.0067 | 2.22 | 1100 | 0.0075 | 0.9614 | 0.9680 | 0.9647 | | 0.0067 | 2.42 | 1200 | 0.0073 | 0.9626 | 0.9682 | 0.9654 | | 0.0067 | 2.62 | 1300 | 0.0075 | 0.9617 | 0.9692 | 0.9654 | | 0.0067 | 2.82 | 1400 | 0.0073 | 0.9658 | 0.9648 | 0.9653 | | 0.0054 | 3.02 | 1500 | 0.0076 | 0.9656 | 0.9663 | 0.9660 | | 0.0054 | 3.23 | 1600 | 0.0073 | 0.9625 | 0.9703 | 0.9664 | | 0.0054 | 3.43 | 1700 | 0.0073 | 0.9658 | 0.9659 | 0.9658 | | 0.0054 | 3.63 | 1800 | 0.0073 | 0.9626 | 0.9707 | 0.9666 | | 0.0054 | 3.83 | 1900 | 0.0073 | 0.9659 | 0.9677 | 0.9668 | | 0.0046 | 4.03 | 2000 | 0.0075 | 0.9671 | 0.9659 | 0.9665 | | 0.0046 | 4.23 | 2100 | 0.0075 | 0.9654 | 0.9687 | 0.9671 | | 0.0046 | 4.44 | 2200 | 0.0075 | 0.9662 | 0.9676 | 0.9669 | | 0.0046 | 4.64 | 2300 | 0.0074 | 0.9657 | 0.9684 | 0.9670 | | 0.0046 | 4.84 | 2400 | 0.0074 | 0.9664 | 0.9678 | 0.9671 | ### Framework versions - Transformers 4.39.1 - Pytorch 2.2.1+cu121 - Datasets 2.18.0 - Tokenizers 0.15.2