File size: 1,441 Bytes
ab2deef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30dd3ed
ab2deef
30dd3ed
 
 
 
00c17ce
 
 
 
 
eebcf34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
language:
- en
tags:
- ner
- chemical
- bionlp
- bc4cdr
- bioinfomatics
license: apache-2.0
datasets:
- bionlp
- bc4cdr
widget:
- text: "Serotonin receptor 2A (HTR2A) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder."

---

# NER to find Gene & Gene products
> The model was trained on bionlp and bc4cdr dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed)
All the labels, the possible token classes.
```json
{"label2id":
  {
    "O": 0,
    "Chemical": 1,
  }
 }
```
 
Notice, we removed the 'B-','I-' etc from data label.🗡
 
## This is the template we suggest for using the model
Of course I'm well aware of the ```aggregation_strategy``` arguments offered by hf, but by the way of training, I discard any entropy loss for appending subwords, like only the label for the 1st subword token is not -100, after many search effort, I can't find a way to achieve that with default pipeline, hence I fancy an inference class myself.
```python
!pip install forgebox
from forgebox.hf.train import NERInference
ner = NERInference.from_pretrained("raynardj/ner-chemical-bionlp-bc5cdr-pubmed")
a_df = ner.predict(["text1", "text2"])
```

> check our NER model on
* [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed)
* [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed).
* [disease](/raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed)