Spaces:
Runtime error
Runtime error
Update description and scores
Browse files- article.md +30 -18
- description.txt +4 -2
article.md
CHANGED
@@ -2,32 +2,44 @@
|
|
2 |
|
3 |
The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
|
4 |
a rule-based analyzer (a `lexicon` β Vabamorf morphological analyzer in this particular case) as an additional input to
|
5 |
-
improve the results of a neural lemmatization model.
|
6 |
-
lemma candidates provided by Vabamorf.
|
7 |
|
8 |
The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
|
9 |
morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
|
10 |
-
morphological features and the part of speech are provided by a more recent version of Stanza, although
|
11 |
-
to use native Vabamorf features as well (the results, however, are going to be slightly worse).
|
12 |
-
|
13 |
|
14 |
<p align="center">
|
15 |
<img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
|
16 |
</p>
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
|
25 |
-
not included in the demo).
|
26 |
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
| Stanza features and special symbols | 97.28 |
|
31 |
-
| Vabamorf features | 97.32 |
|
32 |
-
| Vabamorf features and special symbols | 96.34 |
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
|
4 |
a rule-based analyzer (a `lexicon` β Vabamorf morphological analyzer in this particular case) as an additional input to
|
5 |
+
improve the results of a neural lemmatization model. This additional input is a concatenation of one or more
|
6 |
+
lemma candidates provided by Vabamorf. See the scheme below.
|
7 |
|
8 |
The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
|
9 |
morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
|
10 |
+
morphological features and the part of speech are provided by a more recent version of Stanza, although in principle
|
11 |
+
it's possible to use native Vabamorf features as well (the results, however, are going to be slightly worse).
|
12 |
+
The original Stanza model is modified: a distinct encoder is added to process Vabamorf input. See the scheme below.
|
13 |
|
14 |
<p align="center">
|
15 |
<img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
|
16 |
</p>
|
17 |
|
18 |
+
Lemmatization models were trained on version 2.7 of the Estonian Dependency Treebank.
|
19 |
|
20 |
+
Three variants of lemmatization are provided in the demo: regular lemmatization, lemmatization with
|
21 |
+
compound separators, and lemmatization in UD format, which includes both compound separators and morphological
|
22 |
+
derivation symbols. Compound separator (`_`) is used to mark boundaries between parts of a compound words. Morphological
|
23 |
+
derivation symbol (`=`) is used to signify that a given word is produces by means of morphological derivation.
|
|
|
|
|
24 |
|
25 |
+
Each lemmatization mode uses a separate model, trained on the corresponding data format (i.e. true lemmas and Vabamorf
|
26 |
+
candidates with `_` and `=` removed or present). See the results on `dev` and `test` sets in the table below (models trained on
|
27 |
+
vabamorf features are not included in the demo).
|
|
|
|
|
|
|
28 |
|
29 |
+
| Model Name | Dev Score | Test Score |
|
30 |
+
|------------------------------------|-----------|------------|
|
31 |
+
| Identity Baseline | 51.62 | 51.12 |
|
32 |
+
| Identity Baseline Compound | 48.80 | 48.12 |
|
33 |
+
| Identity Baseline Symbols | 48.15 | 47.52 |
|
34 |
+
| Vabamorf Baseline | 97.15 | 97.15 |
|
35 |
+
| Vabamorf Baseline Compound | 96.12 | 96.03 |
|
36 |
+
| Vabamorf Baseline Symbols | 96.04 | 95.97 |
|
37 |
+
| Stanza Baseline | 96.98 | 97.16 |
|
38 |
+
| Stanza Baseline Compound | 96.01 | 96.58 |
|
39 |
+
| Stanza Baseline Symbols | 95.40 | 95.99 |
|
40 |
+
| Enhanced Vabamorf Feats | 97.23 | 97.44 |
|
41 |
+
| Enhanced Vabamorf Feats Compound | 97.05 | 97.17 |
|
42 |
+
| Enhanced Vabamorf Feats Symbols | 96.98 | 97.23 |
|
43 |
+
| **Enhanced Stanza Feats** | **98.12** | **98.14** |
|
44 |
+
| **Enhanced Stanza Feats Compound** | **97.85** | **97.98** |
|
45 |
+
| **Enhanced Stanza Feats Symbols** | **97.84** | **98.01** |
|
description.txt
CHANGED
@@ -1,3 +1,5 @@
|
|
1 |
The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
|
2 |
-
developed by TartuNLP research group.
|
3 |
-
|
|
|
|
|
|
1 |
The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
|
2 |
+
developed by TartuNLP research group. Three distinct lemmatization modes are offered: base lemmatization with no
|
3 |
+
additional symbols, lemmatization with compound separators, and lemmatization in Estonian Universal Dependencies
|
4 |
+
Treebank format: with compound separators and morphological derivation symbols. Note that each mode uses a separate
|
5 |
+
pre-trained model, thus results may vary. For more details see Description below.
|