--- language: - en tags: - ner - gene - protein - rna - bioinfomatics license: apache-2.0 datasets: - jnlpba - tner/bc5cdr - commanderstrife/jnlpba - bc2gm_corpus - drAbreu/bc4chemd_ner - linnaeus - chintagunta85/ncbi_disease widget: - text: "It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains" --- # NER to find Gene & Gene products > The model was trained on jnlpba dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed) All the labels, the possible token classes. ```json {"label2id": { "DNA": 2, "O": 0, "RNA": 5, "cell_line": 4, "cell_type": 3, "protein": 1 } } ``` Notice, we removed the 'B-','I-' etc from data label.🗡 ## This is the template we suggest for using the model ```python from transformers import pipeline PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed" ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED) ner("Your text", aggregation_strategy="first") ``` And here is to make your output more consecutive ⭐️ ```python import pandas as pd from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(PRETRAINED) def clean_output(outputs): results = [] current = [] last_idx = 0 # make to sub group by position for output in outputs: if output["index"]-1==last_idx: current.append(output) else: results.append(current) current = [output, ] last_idx = output["index"] if len(current)>0: results.append(current) # from tokens to string strings = [] for c in results: tokens = [] starts = [] ends = [] for o in c: tokens.append(o['word']) starts.append(o['start']) ends.append(o['end']) new_str = tokenizer.convert_tokens_to_string(tokens) if new_str!='': strings.append(dict( word=new_str, start = min(starts), end = max(ends), entity = c[0]['entity'] )) return strings def entity_table(pipeline, **pipeline_kw): if "aggregation_strategy" not in pipeline_kw: pipeline_kw["aggregation_strategy"] = "first" def create_table(text): return pd.DataFrame( clean_output( pipeline(text, **pipeline_kw) ) ) return create_table # will return a dataframe entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT) ``` > check our NER model on * [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed) * [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed). * [disease](/raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed)