tomaarsen
/

span-marker-bert-base-uncased-bionlp

@@ -1,4 +1,6 @@
 ---
 library_name: span-marker
 tags:
 - span-marker
@@ -6,35 +8,102 @@ tags:
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
 metrics:
 - precision
 - recall
 - f1
-widget: []
 pipeline_tag: token-classification
 ---
-# SpanMarker
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
-<!-- - **Encoder:** [Unknown](https://huggingface.co/models/unknown) -->
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
-<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
 ## Uses
 ### Direct Use
@@ -43,9 +112,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 # Run inference
-entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
 ```
 ### Downstream Use
@@ -57,7 +126,7 @@ You can finetune this model on your own dataset.
 from span_marker import SpanMarkerModel, Trainer
 # Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 # Specify a Dataset with "tokens" and "ner_tag" columns
 dataset = load_dataset("conll2003") # For example CoNLL2003
@@ -69,12 +138,49 @@ trainer = Trainer(
     eval_dataset=dataset["validation"],
 )
 trainer.train()
-trainer.save_model("span_marker_model_id-finetuned")
 ```
 </details>
 ## Training Details
 ### Framework Versions
 - Python: 3.9.16

 ---
+language: en
+license: other
 library_name: span-marker
 tags:
 - span-marker
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
+datasets:
+- tner/bionlp2004
 metrics:
 - precision
 - recall
 - f1
+widget:
+- text: Coexpression of HMG I/Y and Oct-2 in cell lines lacking Oct-2 results in high
+    levels of HLA-DRA gene expression , and in vitro DNA-binding studies reveal that
+    HMG I/Y stimulates Oct-2A binding to the HLA-DRA promoter .
+- text: In erythroid cells most of the transcription activity was contained in a 150
+    bp promoter fragment with binding sites for transcription factors AP2 , Sp1 and
+    the erythroid-specific GATA-1 .
+- text: 'Synergy between signal transduction pathways is obligatory for expression
+    of c-fos in B and T cell lines : implication for c-fos control via surface immunoglobulin
+    and T cell antigen receptors .'
+- text: CIITA mRNA is normally inducible by IFN-gamma in class II non-inducible ,
+    RB-defective lines , and in one line , re-expression of RB has no effect on CIITA
+    mRNA induction levels .
+- text: As we reported previously , MNDA mRNA level in adherent monocytes is elevated
+    by IFN-alpha ; in this study , we further assessed MNDA expression in in vitro
+    monocyte-derived macrophages .
 pipeline_tag: token-classification
+co2_eq_emissions:
+  emissions: 45.104
+  source: codecarbon
+  training_type: fine-tuning
+  on_cloud: false
+  gpu_model: 1 x NVIDIA GeForce RTX 3090
+  cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
+  ram_total_size: 31.777088165283203
+  hours_used: 0.296
+model-index:
+- name: SpanMarker with bert-base-uncased on BioNLP2004
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    dataset:
+      name: BioNLP2004
+      type: tner/bionlp2004
+      split: test
+    metrics:
+    - type: f1
+      value: 0.7620637836032726
+      name: F1
+    - type: precision
+      value: 0.7289958470876371
+      name: Precision
+    - type: recall
+      value: 0.7982742537313433
+      name: Recall
 ---
+# SpanMarker with bert-base-uncased on BioNLP2004
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/models/bert-base-uncased) as the underlying encoder.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
+- **Encoder:** [bert-base-uncased](https://huggingface.co/models/bert-base-uncased)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
+- **Training Dataset:** [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004)
+- **Language:** en
+- **License:** other
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
+### Model Labels
+| Label     | Examples                                                                                         |
+|:----------|:-------------------------------------------------------------------------------------------------|
+| DNA       | "immunoglobulin heavy-chain enhancer", "enhancer", "immunoglobulin heavy-chain ( IgH ) enhancer" |
+| RNA       | "GATA-1 mRNA", "c-myb mRNA", "antisense myb RNA"                                                 |
+| cell_line | "monocytic U937 cells", "TNF-treated HUVECs", "HUVECs"                                           |
+| cell_type | "B cells", "non-B cells", "human red blood cells"                                                |
+| protein   | "ICAM-1", "VCAM-1", "NADPH oxidase"                                                              |
+## Evaluation
+### Metrics
+| Label     | Precision | Recall | F1     |
+|:----------|:----------|:-------|:-------|
+| **all**   | 0.7290    | 0.7983 | 0.7621 |
+| DNA       | 0.7174    | 0.7505 | 0.7336 |
+| RNA       | 0.6977    | 0.7692 | 0.7317 |
+| cell_line | 0.5831    | 0.7020 | 0.6370 |
+| cell_type | 0.8222    | 0.7381 | 0.7779 |
+| protein   | 0.7196    | 0.8407 | 0.7755 |
 ## Uses
 ### Direct Use
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")
 # Run inference
+entities = model.predict("In erythroid cells most of the transcription activity was contained in a 150 bp promoter fragment with binding sites for transcription factors AP2 , Sp1 and the erythroid-specific GATA-1 .")
 ```
 ### Downstream Use
 from span_marker import SpanMarkerModel, Trainer
 # Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")
 # Specify a Dataset with "tokens" and "ner_tag" columns
 dataset = load_dataset("conll2003") # For example CoNLL2003
     eval_dataset=dataset["validation"],
 )
 trainer.train()
+trainer.save_model("tomaarsen/span-marker-bert-base-uncased-bionlp-finetuned")
 ```
 </details>
 ## Training Details
+### Training Set Metrics
+| Training set          | Min | Median  | Max |
+|:----------------------|:----|:--------|:----|
+| Sentence length       | 2   | 26.5790 | 166 |
+| Entities per sentence | 0   | 2.7528  | 23  |
+### Training Hyperparameters
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 3
+### Training Results
+| Epoch  | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
+|:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
+| 0.4505 | 300  | 0.0210          | 0.7497               | 0.7659            | 0.7577        | 0.9254              |
+| 0.9009 | 600  | 0.0162          | 0.8048               | 0.8217            | 0.8131        | 0.9432              |
+| 1.3514 | 900  | 0.0154          | 0.8126               | 0.8249            | 0.8187        | 0.9434              |
+| 1.8018 | 1200 | 0.0149          | 0.8148               | 0.8451            | 0.8296        | 0.9481              |
+| 2.2523 | 1500 | 0.0150          | 0.8297               | 0.8438            | 0.8367        | 0.9501              |
+| 2.7027 | 1800 | 0.0145          | 0.8280               | 0.8443            | 0.8361        | 0.9501              |
+### Environmental Impact
+Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
+- **Carbon Emitted**: 0.045 kg of CO2
+- **Hours Used**: 0.296 hours
+### Training Hardware
+- **On Cloud**: No
+- **GPU Model**: 1 x NVIDIA GeForce RTX 3090
+- **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
+- **RAM Size**: 31.78 GB
 ### Framework Versions
 - Python: 3.9.16