Token Classification
GLiNER
PyTorch
6 languages

Questions and request

#1
by abdar1925 - opened

Hello Urchade,
Thanks for this incredible work. I have two questions and a request.

Thanks

abdar1925 changed discussion title from Code for Data Generation to Questions and request
Owner

Thank you for your interest in GLiNER :)

  1. I think that the quality of my dataset is not great as it is purely synthetic. The one you mentioned should be better
  2. the model you mentioned should better, but GLiNER is not limited in terms of label it can predict
  3. I have provided a general example for synthetic data generation here (you can tailor it for pii extraction): https://github.com/urchade/GLiNER/blob/main/examples/synthetic_data_generation.ipynb

you can join the GLiNER discussion server here, as I am not very actif in HF: https://discord.gg/Y2yVxpSQnG

Great, thanks. I'll check out the script.

Hi! Where can I see the tuning script? I want to add data in other languages.

Owner

hi,
can you just share me a sample data to train the gliner model. I tried using a dataset of json format. Here is the sample data can you say me is this okay or need to modify the data and can you say how to use the data to fine tune the model.

{"text": "Aadhaar is 437686033996 PAN is JRNPZ0751P Email is lakshitgulati@example.org Name is Purab Varghese Mobile is 910863034052 Age is 30 Credit Card is 2262854559438311 CVV is 961 Address is 51/138, Rastogi Nagar, Morena, Sikkim",

"entities": [{"entity": "AADHAAR", "start": 0, "end": 7, "value": "Aadhaar"}, {"entity": "AADHAAR_VALUE", "start": 11, "end": 23, "value": "437686033996"}, {"entity": "PAN", "start": 24, "end": 27, "value": "PAN"}, {"entity": "PAN_VALUE", "start": 31, "end": 41, "value": "JRNPZ0751P"}, {"entity": "EMAIL", "start": 42, "end": 47, "value": "Email"}, {"entity": "EMAIL_VALUE", "start": 51, "end": 76, "value": "lakshitgulati@example.org"}, {"entity": "NAME", "start": 77, "end": 81, "value": "Name"}, {"entity": "NAME_VALUE", "start": 85, "end": 99, "value": "Purab Varghese"}, {"entity": "MOBILE", "start": 100, "end": 106, "value": "Mobile"}, {"entity": "MOBILE_VALUE", "start": 110, "end": 122, "value": "910863034052"}, {"entity": "AGE", "start": 123, "end": 126, "value": "Age"}, {"entity": "AGE_VALUE", "start": 130, "end": 132, "value": "30"}, {"entity": "CREDIT CARD", "start": 133, "end": 144, "value": "Credit Card"}, {"entity": "CREDIT CARD_VALUE", "start": 148, "end": 164, "value": "2262854559438311"}, {"entity": "CVV", "start": 165, "end": 168, "value": "CVV"}, {"entity": "CVV_VALUE", "start": 172, "end": 175, "value": "961"}, {"entity": "ADDRESS", "start": 176, "end": 183, "value": "Address"}, {"entity": "ADDRESS_VALUE", "start": 187, "end": 224, "value": "51/138, Rastogi Nagar, Morena, Sikkim"}]}

Hi, it has been trained on https://huggingface.co/datasets/urchade/synthetic-pii-ner-mistral-v1

I suggest you to use lower case for entity named and without "_"

Sign up or log in to comment