Cannot run model with pipelines

#3
by ntwm - opened

Hello, I found this model to be very interesting. However, I am struggling to make it run locally when leveraging huggingface pipelines. I'm probably doing something wrong, but I feel like a hit a wall.

Here is my code:

from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("obi/deid_roberta_i2b2")
model = AutoModelForMaskedLM.from_pretrained("obi/deid_roberta_i2b2")

nlp = pipeline(model=model, tokenizer=tokenizer)

example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

And here is the error stack I get:


RuntimeError Traceback (most recent call last)
Cell In[22], line 2
1 from transformers import pipeline
----> 2 nlp = pipeline(model=model, tokenizer=tokenizer)
4 example = "My name is Wolfgang and I live in Berlin"
6 ner_results = nlp(example)

File ~/hug/lib/python3.10/site-packages/transformers/pipelines/init.py:768, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
766 if task is None and model is not None:
767 if not isinstance(model, str):
--> 768 raise RuntimeError(
769 "Inferring the task automatically requires to check the hub with a model_id defined as a str."
770 f"{model} is not a valid model_id."
771 )
772 task = get_task(model, use_auth_token)
774 # Retrieve the task

RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a str.RobertaForMaskedLM(
(roberta): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 1024, padding_idx=1)
(position_embeddings): Embedding(514, 1024, padding_idx=1)
(token_type_embeddings): Embedding(1, 1024)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-23): 24 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(lm_head): RobertaLMHead(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(decoder): Linear(in_features=1024, out_features=50265, bias=True)
)
) is not a valid model_id.

One Brave Idea org

Could you try adding task='token-classification' or task='ner' to the pipeline call and see if that works?

This code works !

from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import pipeline
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

tokenizer = AutoTokenizer.from_pretrained("obi/deid_roberta_i2b2")
model = AutoModelForMaskedLM.from_pretrained("obi/deid_roberta_i2b2")

#nlp = pipeline(model=model,tokenizer=ner)

#nlp = pipeline(task="ner", model=model,tokenizer=tokenizer)
nlp=pipeline(task="ner", model="obi/deid_roberta_i2b2")

#nlp = pipeline(model=model)
#nlp = pipeline(model=model, tokenizer=tokenizer)

example = "mobile number 9450 6413"

ner_results = nlp(example)
print(ner_results)

I am new to this domain. The output for the above code is
[{'entity': 'B-ID', 'score': 0.9996006, 'index': 3, 'word': 'Ġ94', 'start': 14, 'end': 16}, {'entity': 'B-ID', 'score': 0.6739617, 'index': 4, 'word': '50', 'start': 16, 'end': 18}, {'entity': 'L-ID', 'score': 0.9968354, 'index': 5, 'word': 'Ġ64', 'start': 19, 'end': 21}, {'entity': 'L-ID', 'score': 0.5859114, 'index': 6, 'word': '13', 'start': 21, 'end': 23}]

But i want to detect the mobile number as like the following ----> My mobile phone number
PLEASE advise...how to do any thoughts/Questions useful. Thanks!

One Brave Idea org

We have a custom tokenization process - that doesn't get used when youo use pipelines. - it's currently seeing the phone number and predicting it as ID entity.
You can try referring to the github page and follow the steps here to run the forward pass: https://github.com/obi-ml-public/ehr_deidentification/blob/main/steps/forward_pass/Forward%20Pass.ipynb
This might be a little more complicated than using pipelines but you can try and see if this works for your use case.

when i use the forward pass code i get the output as
[{'note_id': 'note_3',
'tokens': [{'text': 'There', 'start': 0, 'end': 5, 'label': 'O'},
{'text': 'should', 'start': 6, 'end': 12, 'label': 'O'},
{'text': 'be', 'start': 13, 'end': 15, 'label': 'O'},
{'text': 'no', 'start': 16, 'end': 18, 'label': 'O'},
{'text': 'phi', 'start': 19, 'end': 22, 'label': 'O'},
{'text': 'in', 'start': 23, 'end': 25, 'label': 'O'},
{'text': 'this', 'start': 26, 'end': 30, 'label': 'O'},
{'text': 'note', 'start': 31, 'end': 35, 'label': 'O'}],
'labels': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
'predictions': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}

how to get in below format
[{'entity': 'B-STAFF',
'score': 0.9999329,
'index': 14,
'word': 'ĠRoger',
'start': 61,
'end': 66},
......................]

Sign up or log in to comment