EMBO
/

sd-geneprod-roles-v2

@@ -21,16 +21,13 @@ model-index:
     metrics:
     - name: Precision
       type: precision
-      value: 0.9243747400938632
     - name: Recall
       type: recall
-      value: 0.9284563518109672
     - name: F1
       type: f1
-      value: 0.9264110502500595
-widget:
- - text: "XPT of siRNA treated [MASK] cells after 48 hours of knockdown. Treated cells were fed with the indicated amounts of C8L peptid conjugated to iron oxide beads via a disulfide bond. The cells were then exposed to RF33. 70-Luc Reporter [MASK] T cells overnight. Error bars show SD of >3 replicate wells. * p<0.05 for siRNA vs control [MASK] using two-way ANOVA. Representative plot of 3 independent experiments."
- - text: "The [MASK] intensity along the line across a lipid droplet in (A) was measured by ImageJ.The lipid droplet localization of [MASK]-[MASK], represented by two peaks, is clearly visible in fat cells from ppl > [MASK] larvae , but it is lost in fat cells from ppl > [MASK] larvae with [MASK] RNAi or overexpression of [MASK]/[MASK]. More than 30 lipid droplets of each genotype were measured. One typical image curve is shown for each genotype."
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -40,36 +37,23 @@ should probably proofread and complete it, then remove this comment. -->
 This model is a fine-tuned version of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) on the source_data_nlp dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.0118
-- Accuracy Score: 0.9959
-- Precision: 0.9244
-- Recall: 0.9285
-- F1: 0.9264
 ## Model description
-The generation of this model is explained in more detail in Abreu-Vicente & Lemberger (in prep).
-The model is fine-tuned from [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large).
-The use of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) was decided after proceeding to the analysis of 14 different models
-in the [SourceData](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) dataset.
-### The SourceData dataset
-This dataset is based on the content of the SourceData (https://sourcedata.embo.org) database, which contains manually annotated figure legends written in English and extracted from scientific papers in the domain of cell and molecular biology (Liechti et al, Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471). Unlike the dataset sd-nlp, pre-tokenized with the roberta-base tokenizer, this dataset is not previously tokenized, but just splitted into words. Users can therefore use it to fine-tune other models. Additional details at https://github.com/source-data/soda-roberta
-The dataset in the 🤗 Hub is just a processed version of the entire annotated dataset that is presented also in Abreu-Vicente & Lemberger (in prep).
-Further details on the entire dataset can be found in the [BCVI BIO-ID track](https://biocreative.bioinformatics.udel.edu/resources/corpora/bcvi-bio-id-track/) task associated.
-This model is fine-tuned in the biological `GENEPROD_ROLES` task. On it, gene products are masked and the goal of the model is to classify them as `CONTROLLED_VAR` or `MEASURED_VAR`.
-The performance of the model is similar in both classes.
 ## Intended uses & limitations
-The intended use of this model is to infer the semantic role of gene products (genes and proteins) with regard to the causal hypotheses tested in experiments reported in scientific papers. Although the model could be trained without masking the entities, its performance is considerable better (f1 >~ 0.2) when they are masked. This involves a prior step that is running a NER model to identify gene products.
 ## Training and evaluation data
-The training, evaluation, and test splits of the data used can be found in [SourceData dataset](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized).
 ## Training procedure
@@ -82,34 +66,18 @@ The following hyperparameters were used during training:
 - seed: 42
 - optimizer: Adafactor
 - lr_scheduler_type: linear
-- num_epochs: 2.0
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Accuracy Score | Precision | Recall | F1     |
 |:-------------:|:-----:|:----:|:---------------:|:--------------:|:---------:|:------:|:------:|
-| 0.0115        | 1.0   | 2066 | 0.0126          | 0.9955         | 0.9130    | 0.9216 | 0.9173 |
-| 0.0074        | 2.0   | 4132 | 0.0118          | 0.9959         | 0.9244    | 0.9285 | 0.9264 |
-### Test results
-```
-                precision    recall  f1-score   support
-CONTROLLED_VAR       0.91      0.93      0.92      7241
-  MEASURED_VAR       0.94      0.93      0.93      8720
-     micro avg       0.93      0.93      0.93     15961
-     macro avg       0.92      0.93      0.93     15961
-  weighted avg       0.93      0.93      0.93     15961
-{'test_loss': 0.011081044562160969, 'test_accuracy_score': 0.9962086330220685, 'test_precision': 0.925242960378769, 'test_recall': 0.9305181379612806, 'test_f1': 0.9278730515727985, 'test_runtime': 87.5388, 'test_samples_per_second': 93.958, 'test
-_steps_per_second': 0.377}
-```
 ### Framework versions
-- Transformers 4.15.0
 - Pytorch 1.11.0a0+bfe5ad2
 - Datasets 1.17.0
-- Tokenizers 0.10.3

     metrics:
     - name: Precision
       type: precision
+      value: 0.9218777784363701
     - name: Recall
       type: recall
+      value: 0.9280386657915151
     - name: F1
       type: f1
+      value: 0.9249479631281595
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 This model is a fine-tuned version of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) on the source_data_nlp dataset.
 It achieves the following results on the evaluation set:
+- Loss: 0.0141
+- Accuracy Score: 0.9950
+- Precision: 0.9219
+- Recall: 0.9280
+- F1: 0.9249
 ## Model description
+More information needed
 ## Intended uses & limitations
+More information needed
 ## Training and evaluation data
+More information needed
 ## Training procedure
 - seed: 42
 - optimizer: Adafactor
 - lr_scheduler_type: linear
+- num_epochs: 1.0
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Accuracy Score | Precision | Recall | F1     |
 |:-------------:|:-----:|:----:|:---------------:|:--------------:|:---------:|:------:|:------:|
+| 0.0129        | 1.0   | 1569 | 0.0141          | 0.9950         | 0.9219    | 0.9280 | 0.9249 |
 ### Framework versions
+- Transformers 4.20.0
 - Pytorch 1.11.0a0+bfe5ad2
 - Datasets 1.17.0
+- Tokenizers 0.12.1