--- language: - en license: apache-2.0 library_name: span-marker tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer base_model: roberta-large datasets: - Jerado/enron_intangibles_ner metrics: - precision - recall - f1 widget: - text: Negotiated rates in these types of deals (basis for new builds) have been allowed to stand for the life of the contracts, in the case of Kern River and Mojave. - text: It seems that there is a single significant policy concern for the ASIC policy committee. - text: 'The appropriate price is in Enpower, but the revenue has never appeared (Deal #590753).' - text: FYI, to me, a prepayment for a service contract would generally be amortized over the life of the contract. - text: 'From: d..steffes @ enron.com To: john.shelk @ enron.com, l..nicolay @ enron.com, richard.shapiro @ enron.com, sarah.novosel @ enron.com Subject: Southern Co.''s Testimony The first order of business is getting the cost / benefit analysis done.' pipeline_tag: token-classification model-index: - name: SpanMarker with roberta-large on Jerado/enron_intangibles_ner results: - task: type: token-classification name: Named Entity Recognition dataset: name: Unknown type: Jerado/enron_intangibles_ner split: test metrics: - type: f1 value: 0.4390243902439024 name: F1 - type: precision value: 0.42857142857142855 name: Precision - type: recall value: 0.45 name: Recall --- # SpanMarker with roberta-large on Jerado/enron_intangibles_ner This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [Jerado/enron_intangibles_ner](https://huggingface.co/datasets/Jerado/enron_intangibles_ner) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Encoder:** [roberta-large](https://huggingface.co/roberta-large) - **Maximum Sequence Length:** 256 tokens - **Maximum Entity Length:** 6 words - **Training Dataset:** [Jerado/enron_intangibles_ner](https://huggingface.co/datasets/Jerado/enron_intangibles_ner) - **Language:** en - **License:** apache-2.0 ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:-----------|:--------------------------------------------| | Intangible | "deal", "sample EES deal", "Enpower system" | ## Evaluation ### Metrics | Label | Precision | Recall | F1 | |:-----------|:----------|:-------|:-------| | **all** | 0.4286 | 0.45 | 0.4390 | | Intangible | 0.4286 | 0.45 | 0.4390 | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Run inference entities = model.predict("It seems that there is a single significant policy concern for the ASIC policy committee.") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("span_marker_model_id-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 1 | 19.8706 | 216 | | Entities per sentence | 0 | 0.1865 | 6 | ### Training Hyperparameters - learning_rate: 1e-05 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 11 - mixed_precision_training: Native AMP ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:-------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 3.3557 | 500 | 0.0075 | 0.4444 | 0.1667 | 0.2424 | 0.9753 | | 6.7114 | 1000 | 0.0084 | 0.5714 | 0.3333 | 0.4211 | 0.9793 | | 10.0671 | 1500 | 0.0098 | 0.6111 | 0.4583 | 0.5238 | 0.9815 | ### Framework Versions - Python: 3.10.12 - SpanMarker: 1.5.0 - Transformers: 4.40.0 - PyTorch: 2.2.1+cu121 - Datasets: 2.19.0 - Tokenizers: 0.19.1 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```