File size: 2,759 Bytes
ec26337 ede296f ec26337 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
language:
- en
tags:
- ner
- gene
- protein
- rna
- bioinfomatics
license: apache-2.0
datasets:
- jnlpba
- tner/bc5cdr
- commanderstrife/jnlpba
- bc2gm_corpus
- drAbreu/bc4chemd_ner
- linnaeus
- ncbi_disease
widget:
- text: "It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains"
---
# NER to find Gene & Gene products
> The model was trained on jnlpba dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed)
All the labels, the possible token classes.
```json
{"label2id": {
"DNA": 2,
"O": 0,
"RNA": 5,
"cell_line": 4,
"cell_type": 3,
"protein": 1
}
}
```
Notice, we removed the 'B-','I-' etc from data label.🗡
## This is the template we suggest for using the model
```python
from transformers import pipeline
PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
ner("Your text", aggregation_strategy="first")
```
And here is to make your output more consecutive ⭐️
```python
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
# make to sub group by position
for output in outputs:
if output["index"]-1==last_idx:
current.append(output)
else:
results.append(current)
current = [output, ]
last_idx = output["index"]
if len(current)>0:
results.append(current)
# from tokens to string
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str!='':
strings.append(dict(
word=new_str,
start = min(starts),
end = max(ends),
entity = c[0]['entity']
))
return strings
def entity_table(pipeline, **pipeline_kw):
if "aggregation_strategy" not in pipeline_kw:
pipeline_kw["aggregation_strategy"] = "first"
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
# will return a dataframe
entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
```
> check our NER model on
* [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed)
* [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed).
* [disease](/raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed) |