gwlms
/

span-marker-token-dropping-bert-germeval14

Token Classification

named-entity-recognition

Model card Files Files and versions Community

span-marker-token-dropping-bert-germeval14 / README.md

stefan-it's picture

readme: fix table

c70c97d over 1 year ago

|

3.34 kB

	---
	license: cc-by-4.0
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	pipeline_tag: token-classification
	widget:
	- text: "Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München ."
	example_title: "Wikipedia"
	datasets:
	- gwlms/germeval2014
	language:
	- de
	model-index:
	- name: SpanMarker with GWLMS Token Dropping BERT on GermEval 2014 NER Dataset by Stefan Schweter (@stefan-it)
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: gwlms/germeval2014
	name: GermEval 2014
	split: test
	revision: f3647c56803ce67c08ee8d15f4611054c377b226
	metrics:
	- type: f1
	value: 0.8744
	name: F1
	metrics:
	- f1
	---

	# SpanMarker for GermEval 2014 NER

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that
	was fine-tuned on the [GermEval 2014 NER Dataset](https://sites.google.com/site/germeval2014ner/home).

	The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following
	properties: The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset
	covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines,
	which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating
	embeddings among NEs such as `[ORG FC Kickers [LOC Darmstadt]]`.

	12 classes of Named Entites are annotated and must be recognized: four main classes `PER`son, `LOC`ation, `ORG`anisation,
	and `OTH`er and their subclasses by introducing two fine-grained labels: `-deriv` marks derivations from NEs such as
	"englisch" (“English”), and `-part` marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).

	# Fine-Tuning

	We use the same hyper-parameters as used in the
	["German's Next Language Model"](https://aclanthology.org/2020.coling-main.598/) paper using the
	[GWLMS Token Dropping BERT](https://huggingface.co/gwlms/bert-base-token-dropping-dewiki-v1) model as backbone.

	Evaluation is performed with SpanMarkers internal evaluation code that uses `seqeval`.

	We fine-tune 5 models and upload the model with best F1-Score on development set. Results on development set are
	in brackets:

	\| Model \| Run 1 \| Run 2 \| Run 3 \| Run 4 \| Run 5 \| Avg.
	\| ------------------------- \| --------------- \| ------------------- \| --------------- \| --------------- \| --------------- \| ---------------
	\| GWLMS Token Dropping BERT \| (87.85) / 87.28 \| (88.09) / 87.44 \| (87.59) / 87.26 \| (87.71) / 87.43 \| (87.83) / 87.24 \| (87.81) / 87.33

	The best model achieves a final test score of 87.44%.

	Scripts for [training](trainer.py) and [evaluation](evaluator.py) are also available.

	# Usage

	The fine-tuned model can be used like:

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("gwlms/span-marker-token-dropping-bert-germeval14")

	# Run inference
	entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")
	```