Spaces:

tartuNLP
/

LexiconEnhancedLemmatization

Runtime error

App Files Files Community

adorkin commited on Jun 8, 2022

Commit

4442abc

•

1 Parent(s): 588eb52

Update description and scores

Browse files

Files changed (2) hide show

article.md +30 -18
description.txt +4 -2

article.md CHANGED Viewed

@@ -2,32 +2,44 @@
 The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
 a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to
-improve the results of a neural lemmatization model. Said additional input is a concatenation of one or more
-lemma candidates provided by Vabamorf. A second encoder is used to process this input. See the scheme below.
 The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
 morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
-morphological features and the part of speech are provided by a more recent version of Stanza, although it's possible
-to use native Vabamorf features as well (the results, however, are going to be slightly worse). Additional lexicon
-input is processed by a separate encoder.
 <p align="center">
     <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
 </p>
-The models were trained on version 2.7 of the Estonian Dependency Treebank.
-Two variants of lemmatization are provided in the demo: regular lemmatization and lemmatization with
-special symbols. Special symbols are `=` and `_`, denoting morphological derivation and separating parts of
-compound words respectively. The latter was trained on the original data with Vabamorf set to output
-these special symbols, while the latter was trained with `=` and `_` removed from the data and
-vabamorf output. See the results on `dev` set in the table below (models trained on vabamorf features are
-not included in the demo).
-| Model                                 | Token-wise accuracy |
-|---------------------------------------|---------------------|
-| Stanza features                       | 98.13               |
-| Stanza features and special symbols   | 97.28               |
-| Vabamorf features                     | 97.32               |
-| Vabamorf features and special symbols | 96.34               |

 The idea of lexicon-enhanced lemmatization is to utilize the output of an external resource such as
 a rule-based analyzer (a `lexicon` — Vabamorf morphological analyzer in this particular case) as an additional input to
+improve the results of a neural lemmatization model. This additional input is a concatenation of one or more
+lemma candidates provided by Vabamorf. See the scheme below.
 The lexicon-enhanced lemmatizer itself is a modification on an older version of Stanza, which is a neural model that takes
 morphological features and parts of speech as input in addition to surface forms to predict a lemma. In this demo
+morphological features and the part of speech are provided by a more recent version of Stanza, although in principle
+it's possible to use native Vabamorf features as well (the results, however, are going to be slightly worse).
+The original Stanza model is modified: a distinct encoder is added to process Vabamorf input. See the scheme below.
 <p align="center">
     <img alt="Scheme" src="https://raw.githubusercontent.com/slowwavesleep/lexicon-enhanced-lemmatization/master/img/StanfordNLP_Lemmatizer-Overall_Modified.jpg" >
 </p>
+Lemmatization models were trained on version 2.7 of the Estonian Dependency Treebank.
+Three variants of lemmatization are provided in the demo: regular lemmatization, lemmatization with
+compound separators, and lemmatization in UD format, which includes both compound separators and morphological
+derivation symbols. Compound separator (`_`) is used to mark boundaries between parts of a compound words. Morphological
+derivation symbol (`=`) is used to signify that a given word is produces by means of morphological derivation.
+Each lemmatization mode uses a separate model, trained on the corresponding data format (i.e. true lemmas and Vabamorf
+candidates with `_` and `=` removed or present). See the results on `dev` and `test` sets in the table below (models trained on
+vabamorf features are not included in the demo).
+| Model Name                         | Dev Score | Test Score |
+|------------------------------------|-----------|------------|
+| Identity Baseline                  | 51.62     | 51.12      |
+| Identity Baseline Compound         | 48.80     | 48.12      |
+| Identity Baseline Symbols          | 48.15     | 47.52      |
+| Vabamorf Baseline                  | 97.15     | 97.15      |
+| Vabamorf Baseline Compound         | 96.12     | 96.03      |
+| Vabamorf Baseline Symbols          | 96.04     | 95.97      |
+| Stanza Baseline                    | 96.98     | 97.16      |
+| Stanza Baseline Compound           | 96.01     | 96.58      |
+| Stanza Baseline Symbols            | 95.40     | 95.99      |
+| Enhanced Vabamorf Feats            | 97.23     | 97.44      |
+| Enhanced Vabamorf Feats Compound   | 97.05     | 97.17      |
+| Enhanced Vabamorf Feats Symbols    | 96.98     | 97.23      |
+| **Enhanced Stanza Feats**          | **98.12** | **98.14**  |
+| **Enhanced Stanza Feats Compound** | **97.85** | **97.98**  |
+| **Enhanced Stanza Feats Symbols**  | **97.84** | **98.01**  |

description.txt CHANGED Viewed

@@ -1,3 +1,5 @@
 The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
-developed by TartuNLP research group. Options to lemmatize with or without special symbols are available (do note that
-this functionality is implemented by means of using two separate models). For more details see Description below.

 The purpose of this demo is to demonstrate the results of Lexicon-Enhanced Neural Lemmatization for Estonian language
+developed by TartuNLP research group. Three distinct lemmatization modes are offered: base lemmatization with no
+additional symbols, lemmatization with compound separators, and lemmatization in Estonian Universal Dependencies
+Treebank format: with compound separators and morphological derivation symbols. Note that each mode uses a separate
+pre-trained model, thus results may vary. For more details see Description below.