BERTić-COMtext-SR-legal-MSD-ijekavica

BERTić-COMtext-SR-legal-MSD-ijekavica is a variant of the BERTić model, fine-tuned on the task of morphosyntactic (MSD) tag prediction in Serbian legal texts written in the Ijekavian pronunciation. The model was fine-tuned for 15 epochs on the Ijekavian variant of the COMtext.SR.legal dataset.

Benchmarking

This model was evaluated on the tasks of MSD prediction and lemmatization of Serbian legal texts. Lemmatization was performed using the predicted MSD tags and the hrLex inflectional lexicon.

Accuracy and Word Error Rate were used as evaluation metrics.

This model was compared to:

The CLASSLA library
A variant of BERTić fine-tuned for MSD prediction using the SETimes.SR 2.0 corpus of newswire texts
SrBERTa, a model specially trained on Serbian legal texts

All large language models were fine-tuned for 15 epochs. CLASSLA and BERTić-SETimes were directly tested on the entire COMtext.SR.legal.ijekavica corpus. BERTić-COMtext-SR-legal-MSD-ijekavica and SrBERTa were fine-tuned and evaluated on the COMtext.SR.legal.ijekavica corpus using 10-fold CV.

The code and data to run these experiments is available on the COMtext.SR GitHub repository.

Results

Model	MSD ACC	MSD WER	Lemma ACC	Lemma WER
CLASSLA-SR (gold tokens)	0.9150	0.0850	0.9036	0.0964
CLASSLA-SR (CLASSLA tokenizer)	/	0.0977	/	0.1135
CLASSLA-HR (gold tokens)	0.9062	0.0938	0.9353	0.0647
CLASSLA-HR (CLASSLA tokenizer)	/	0.1076	/	0.0827
BERTić-SETimes.SR (gold tokens)	0.9234	0.0766	0.9412	0.0588
BERTić-SETimes.SR (CLASSLA tokenizer)	/	0.0883	/	0.0780
BERTić-COMtext-SR-legal-MSD-ijekavica (gold tokens)	0.9674	0.0326	0.9429	0.0571
BERTić-COMtext-SR-legal-MSD-ijekavica (CLASSLA tokenizer)	/	0.0447	/	0.0763
SrBERTa (gold tokens)	0.9300	0.0700	0.9187	0.0813
SrBERTa (CLASSLA tokenizer)	/	0.0840	/	0.1024