raynardj commited on
Commit
30dd3ed
1 Parent(s): eebcf34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -56
README.md CHANGED
@@ -31,63 +31,12 @@ All the labels, the possible token classes.
31
  Notice, we removed the 'B-','I-' etc from data label.🗡
32
 
33
  ## This is the template we suggest for using the model
 
34
  ```python
35
- from transformers import pipeline
36
- PRETRAINED = "raynardj/ner-chemical-bionlp-bc5cdr-pubmed"
37
- ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
38
- ner("Your text", aggregation_strategy="first")
39
- ```
40
- And here is to make your output more consecutive ⭐️
41
- ```python
42
- import pandas as pd
43
- from transformers import AutoTokenizer
44
- tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
45
- def clean_output(outputs):
46
- results = []
47
- current = []
48
- last_idx = 0
49
- # make to sub group by position
50
- for output in outputs:
51
- if output["index"]-1==last_idx:
52
- current.append(output)
53
- else:
54
- results.append(current)
55
- current = [output, ]
56
- last_idx = output["index"]
57
- if len(current)>0:
58
- results.append(current)
59
-
60
- # from tokens to string
61
- strings = []
62
- for c in results:
63
- tokens = []
64
- starts = []
65
- ends = []
66
- for o in c:
67
- tokens.append(o['word'])
68
- starts.append(o['start'])
69
- ends.append(o['end'])
70
- new_str = tokenizer.convert_tokens_to_string(tokens)
71
- if new_str!='':
72
- strings.append(dict(
73
- word=new_str,
74
- start = min(starts),
75
- end = max(ends),
76
- entity = c[0]['entity']
77
- ))
78
- return strings
79
- def entity_table(pipeline, **pipeline_kw):
80
- if "aggregation_strategy" not in pipeline_kw:
81
- pipeline_kw["aggregation_strategy"] = "first"
82
- def create_table(text):
83
- return pd.DataFrame(
84
- clean_output(
85
- pipeline(text, **pipeline_kw)
86
- )
87
- )
88
- return create_table
89
- # will return a dataframe
90
- entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
91
  ```
92
 
93
  > check our NER model on
 
31
  Notice, we removed the 'B-','I-' etc from data label.🗡
32
 
33
  ## This is the template we suggest for using the model
34
+ Of course I'm well aware of the ```aggregation_strategy``` arguments offered by hf, but by the way of training, I discard any entropy loss for appending subwords, like only the label for the 1st subword token is not -100, after many search effort, I can't find a way to achieve that with default pipeline, hence I fancy an inference class myself.
35
  ```python
36
+ !pip install forgebox
37
+ from forgebox.hf.train import NERInference
38
+ ner = NERInference.from_pretrained("raynardj/ner-chemical-bionlp-bc5cdr-pubmed")
39
+ a_df = ner.predict(["text1", "text2"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```
41
 
42
  > check our NER model on