Token Classification
Transformers
PyTorch
code
bert
Inference Endpoints

starpii doesn't have any meaningful output

#7
by ruochenwang - opened

Hi,
I tried calling starpii to detect personal information in the code, such as name and email.
My codes as shown below:

checkpoint = "bigcode/starpii"
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(checkpoint ).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint )
data = "Python\nuser_name = 'wrc'\nemail='iuewfn@gmail.com'\ndata=abcdefg\n"
inputs = tokenizer.encode(data, return_tensors="pt").to(device)
outputs = model.generate(inputs,max_length=100)
print(tokenizer.decode(outputs[0], clean_up_tokenization_spaces=False))

I simply called this model without any complex processing.
The output is

Python
user_name = 'wrc'
email='iuewfn@gmail.com'
data=abcdefg
gressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgressgress

It's like this model hasn't undergone any training, or I used the wrong token

May I ask what wrong I did?

ruochenwang changed discussion title from starpii doesn't have any meanful output to starpii doesn't have any meaningful output

You should use a 'ner' pipeline instead of the Causal LM

Sign up or log in to comment