File size: 3,315 Bytes
da17500
 
b656265
69283b4
b656265
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da17500
b656265
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10ef01c
b656265
 
 
 
 
 
a2af8d0
 
 
b656265
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: cc-by-4.0
library_name: span-marker
base_model: gwlms/teams-base-dewiki-v1-discriminator
tags:
- span-marker
- token-classification
- ner
- named-entity-recognition
pipeline_tag: token-classification
widget:
- text: "Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München ."
  example_title: "Wikipedia"
datasets:
- gwlms/germeval2014
language:
- de
model-index:
  - name: SpanMarker with GWLMS TEAMS on GermEval 2014 NER Dataset by Stefan Schweter (@stefan-it)
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          type: gwlms/germeval2014
          name: GermEval 2014
          split: test
          revision: f3647c56803ce67c08ee8d15f4611054c377b226
        metrics:
          - type: f1
            value: 0.8781
            name: F1
metrics:
  - f1
---

# SpanMarker for GermEval 2014 NER

This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that
was fine-tuned on the [GermEval 2014 NER Dataset](https://sites.google.com/site/germeval2014ner/home).

The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following 
properties:  The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset 
covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines, 
which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating 
embeddings among NEs such as `[ORG FC Kickers [LOC Darmstadt]]`.

12 classes of Named Entites are annotated and must be recognized: four main classes `PER`son, `LOC`ation, `ORG`anisation,
and `OTH`er and their subclasses by introducing two fine-grained labels: `-deriv` marks derivations from NEs such as 
"englisch" (“English”), and `-part` marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).

# Fine-Tuning

We use the same hyper-parameters as used in the
["German's Next Language Model"](https://aclanthology.org/2020.coling-main.598/) paper using the
[GWLMS TEAMS](https://huggingface.co/gwlms/teams-base-dewiki-v1-discriminator) model as backbone.

Evaluation is performed with SpanMarkers internal evaluation code that uses `seqeval`.

We fine-tune 5 models and upload the model with best F1-Score on development set. Results on development set are
in brackets:

| Model       | Run 1           | Run 2           | Run 3           | Run 4               | Run 5           | Avg.
| ----------- | --------------- | --------------- | --------------- | ------------------- | ----------------| ---------------
| GWLMS TEAMS | (88.76) / 87.85 | (88.54) / 87.77 | (88.41) / 87.98 | (**88.86**) / 87.81 | (88.83) / 88.50 | (88.68) / 87.98

The best model achieves a final test score of 87.81%.

Scripts for [training](trainer.py) and [evaluation](evaluator.py) are also available.

# Usage

The fine-tuned model can be used like:

```python
from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("gwlms/span-marker-teams-germeval14")

# Run inference
entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")
```