## Description The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to improve the results of a neural lemmatization model. This additional input is a concatenation of one or more lemma candidates provided by Vabamorf. See the scheme below. The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo morphological features and the part of speech are provided by a more recent version of Stanza, although in principle it's possible to use native Vabamorf features as well (the results, however, are going to be slightly worse). The original Stanza model is modified: a distinct encoder is added to process Vabamorf input. See the scheme below.

Scheme

Lemmatization models were trained on version 2.7 of the Estonian Dependency Treebank. Three variants of lemmatization are provided in the demo: regular lemmatization, lemmatization with compound separators, and lemmatization in UD format, which includes both compound separators and morphological derivation symbols. Compound separator (`_`) is used to mark boundaries between parts of a compound words. Morphological derivation symbol (`=`) is used to signify that a given word is produces by means of morphological derivation. Each lemmatization mode uses a separate model, trained on the corresponding data format (i.e. true lemmas and Vabamorf candidates with `_` and `=` removed or present). See the results on `dev` and `test` sets in the table below (models trained on vabamorf features are not included in the demo). | Model Name | Dev Score | Test Score | |------------------------------------|-----------|------------| | Identity Baseline | 51.62 | 51.12 | | Identity Baseline Compound | 48.80 | 48.12 | | Identity Baseline Symbols | 48.15 | 47.52 | | Vabamorf Baseline | 97.15 | 97.15 | | Vabamorf Baseline Compound | 96.12 | 96.03 | | Vabamorf Baseline Symbols | 96.04 | 95.97 | | Stanza Baseline | 96.98 | 97.16 | | Stanza Baseline Compound | 96.01 | 96.58 | | Stanza Baseline Symbols | 95.40 | 95.99 | | Enhanced Vabamorf Feats | 97.23 | 97.44 | | Enhanced Vabamorf Feats Compound | 97.05 | 97.17 | | Enhanced Vabamorf Feats Symbols | 96.98 | 97.23 | | **Enhanced Stanza Feats** | **98.12** | **98.14** | | **Enhanced Stanza Feats Compound** | **97.85** | **97.98** | | **Enhanced Stanza Feats Symbols** | **97.84** | **98.01** |