cjvt
/

Martin97Bozic's picture
Update README.md
1e35f80
|
raw
history blame
2.44 kB
metadata
license: cc-by-sa-4.0
datasets:
  - cjvt/cc_gigafida
  - cjvt/solar3
  - cjvt/sloleks
language:
  - sl
tags:
  - word spelling error annotator

language:

  • sl

license: cc-by-sa-4.0

SloBERTa-Incorrect-Spelling-Annotator

This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:

  • 1: Indicates incorrectly spelled words.
  • 2: Denotes cases where two words should be written together.
  • 3: Suggests that a word should be written separately.

Model Output Example

Imagine we have the following Slovenian text:

Model vbesedilu o znači besede, v katerih se najajajo napake.

If we convert input data to format acceptable by SloBERTa model:

Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>

The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):

Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0

We can observe the following:

  1. In the input sentence, the word najajajo is spelled incorrectly, so the model marks it with the token (0).
  2. The word vbesedilu should be written as two words v and besedilu, so the model marks it with the token (3).
  3. The words o and znači should be written as one word označi, so the model marks them with the tokens (2).

More details

Testing model with generated test sets provides following result:

  • 1 token prediction -> Precission: 0,911; Recall: 0,975; F1: 0,942

Testing the model with test sets constructed using the Šolar Eval dataset provides the following results:

  • 1 token prediction -> Precission: 0,900; Recall: 0,860; F1: 0,880
  • 2 token prediction -> Precission: 0,826; Recall:0,853; F1: 0,839
  • 3 token prediction -> Precission: 0,518; Recall: 0,671; F1: 0,585

Acknowledgement

The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.

Authors

Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.