--- language: en license: cc-by-sa-4.0 library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer base_model: FacebookAI/xlm-roberta-base datasets: - norne metrics: - precision - recall - f1 widget: - text: Av Boethius hand förelåg De institutione arithmetica (" Om aritmetikens grunder ") i två böcker. - text: Hans hovedmotstander var lederen for opposisjonspartiet Movement for Democratic Change, Morgan Tsvangirai. - text: Roddarn blir proffs efter OS. - text: Han blev dog diskvalificeret for at have trådt på banelinjen, og bronzemedaljen gik i stedet til landsmanden Walter Dix. - text: Stillingen var på dette tidspunkt 1-1, men Almunias redning banede vejen for et sejrsmål af danske Nicklas Bendtner. pipeline_tag: token-classification model-index: - name: SpanMarker with FacebookAI/xlm-roberta-base on norne results: - task: type: token-classification name: Named Entity Recognition dataset: name: norne type: norne split: test metrics: - type: f1 value: 0.9181825779313034 name: F1 - type: precision value: 0.9217689611454993 name: Precision - type: recall value: 0.9146239940801036 name: Recall --- # SpanMarker with xlm-roberta-base Trained on various nordic lang. datasets: see https://huggingface.co/datasets/tollefj/nordic-ner This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [norne](https://huggingface.co/datasets/norne) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the underlying encoder. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Encoder:** [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) - **Maximum Sequence Length:** 256 tokens - **Maximum Entity Length:** 8 words - **Training Dataset:** [norne](https://huggingface.co/datasets/norne) - **Language:** en - **License:** cc-by-sa-4.0 ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------|:-------------------------------------------------------------| | LOC | "Gran", "Leicestershire", "Den tyske antarktisekspedisjonen" | | MISC | "socialdemokratiske", "nationalist", "Living Legend" | | ORG | "Stabæk", "Samlaget", "Marillion" | | PER | "Fish", "Dmitrij Medvedev", "Guru Ardjan Dev" | ## Evaluation ### Metrics | Label | Precision | Recall | F1 | |:--------|:----------|:-------|:-------| | **all** | 0.9218 | 0.9146 | 0.9182 | | LOC | 0.9284 | 0.9433 | 0.9358 | | MISC | 0.6515 | 0.6047 | 0.6272 | | ORG | 0.8951 | 0.8547 | 0.8745 | | PER | 0.9513 | 0.9526 | 0.9520 | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Run inference entities = model.predict("Roddarn blir proffs efter OS.") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("span_marker_model_id-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 1 | 12.8175 | 331 | | Entities per sentence | 0 | 1.0055 | 54 | ### Training Hyperparameters - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 3 ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 0.5711 | 3000 | 0.0146 | 0.8650 | 0.8725 | 0.8687 | 0.9722 | | 1.1422 | 6000 | 0.0123 | 0.8994 | 0.8920 | 0.8957 | 0.9778 | | 1.7133 | 9000 | 0.0101 | 0.9184 | 0.8984 | 0.9083 | 0.9805 | | 2.2844 | 12000 | 0.0101 | 0.9198 | 0.9110 | 0.9154 | 0.9818 | | 2.8555 | 15000 | 0.0089 | 0.9245 | 0.9150 | 0.9197 | 0.9830 | ### Framework Versions - Python: 3.12.2 - SpanMarker: 1.5.0 - Transformers: 4.38.2 - PyTorch: 2.2.1+cu121 - Datasets: 2.18.0 - Tokenizers: 0.15.2 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```