adorkin commited on
Commit
4442abc
β€’
1 Parent(s): 588eb52

Update description and scores

Browse files
Files changed (2) hide show
  1. article.md +30 -18
  2. description.txt +4 -2
article.md CHANGED
@@ -2,32 +2,44 @@
2
 
3
  The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
4
  a rule-based analyzer (a `lexicon` β€” Vabamorf morphological analyzer in this particular case) as an additional input to
5
- improve the results of a neural lemmatization model. Said additional input is a concatenation of one or more
6
- lemma candidates provided by Vabamorf. A second encoder is used to process this input. See the scheme below.
7
 
8
  The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
9
  morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
10
- morphological features and the part of speech are provided by a more recent version of Stanza, although it's possible
11
- to use native Vabamorf features as well (the results, however, are going to be slightly worse). Additional lexicon
12
- input is processed by a separate encoder.
13
 
14
  <p align="center">
15
  <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
16
  </p>
17
 
18
- The models were trained on version 2.7 of the Estonian Dependency Treebank.
19
 
20
- Two variants of lemmatization are provided in the demo: regular lemmatization and lemmatization with
21
- special symbols. Special symbols are `=` and `_`, denoting morphological derivation and separating parts of
22
- compound words respectively. The latter was trained on the original data with Vabamorf set to output
23
- these special symbols, while the latter was trained with `=` and `_` removed from the data and
24
- vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
25
- not included in the demo).
26
 
27
- | Model | Token-wise accuracy |
28
- |---------------------------------------|---------------------|
29
- | Stanza features | 98.13 |
30
- | Stanza features and special symbols | 97.28 |
31
- | Vabamorf features | 97.32 |
32
- | Vabamorf features and special symbols | 96.34 |
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
4
  a rule-based analyzer (a `lexicon` β€” Vabamorf morphological analyzer in this particular case) as an additional input to
5
+ improve the results of a neural lemmatization model. This additional input is a concatenation of one or more
6
+ lemma candidates provided by Vabamorf. See the scheme below.
7
 
8
  The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
9
  morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
10
+ morphological features and the part of speech are provided by a more recent version of Stanza, although in principle
11
+ it's possible to use native Vabamorf features as well (the results, however, are going to be slightly worse).
12
+ The original Stanza model is modified: a distinct encoder is added to process Vabamorf input. See the scheme below.
13
 
14
  <p align="center">
15
  <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
16
  </p>
17
 
18
+ Lemmatization models were trained on version 2.7 of the Estonian Dependency Treebank.
19
 
20
+ Three variants of lemmatization are provided in the demo: regular lemmatization, lemmatization with
21
+ compound separators, and lemmatization in UD format, which includes both compound separators and morphological
22
+ derivation symbols. Compound separator (`_`) is used to mark boundaries between parts of a compound words. Morphological
23
+ derivation symbol (`=`) is used to signify that a given word is produces by means of morphological derivation.
 
 
24
 
25
+ Each lemmatization mode uses a separate model, trained on the corresponding data format (i.e. true lemmas and Vabamorf
26
+ candidates with `_` and `=` removed or present). See the results on `dev` and `test` sets in the table below (models trained on
27
+ vabamorf features are not included in the demo).
 
 
 
28
 
29
+ | Model Name | Dev Score | Test Score |
30
+ |------------------------------------|-----------|------------|
31
+ | Identity Baseline | 51.62 | 51.12 |
32
+ | Identity Baseline Compound | 48.80 | 48.12 |
33
+ | Identity Baseline Symbols | 48.15 | 47.52 |
34
+ | Vabamorf Baseline | 97.15 | 97.15 |
35
+ | Vabamorf Baseline Compound | 96.12 | 96.03 |
36
+ | Vabamorf Baseline Symbols | 96.04 | 95.97 |
37
+ | Stanza Baseline | 96.98 | 97.16 |
38
+ | Stanza Baseline Compound | 96.01 | 96.58 |
39
+ | Stanza Baseline Symbols | 95.40 | 95.99 |
40
+ | Enhanced Vabamorf Feats | 97.23 | 97.44 |
41
+ | Enhanced Vabamorf Feats Compound | 97.05 | 97.17 |
42
+ | Enhanced Vabamorf Feats Symbols | 96.98 | 97.23 |
43
+ | **Enhanced Stanza Feats** | **98.12** | **98.14** |
44
+ | **Enhanced Stanza Feats Compound** | **97.85** | **97.98** |
45
+ | **Enhanced Stanza Feats Symbols** | **97.84** | **98.01** |
description.txt CHANGED
@@ -1,3 +1,5 @@
1
  The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
2
- developed by TartuNLP research group. Options to lemmatize with or without special symbols are available (do note that
3
- this functionality is implemented by means of using two separate models). For more details see Description below.
 
 
 
1
  The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
2
+ developed by TartuNLP research group. Three distinct lemmatization modes are offered: base lemmatization with no
3
+ additional symbols, lemmatization with compound separators, and lemmatization in Estonian Universal Dependencies
4
+ Treebank format: with compound separators and morphological derivation symbols. Note that each mode uses a separate
5
+ pre-trained model, thus results may vary. For more details see Description below.