Dataset?

by fcfrank10 - opened Jan 5, 2024

Discussion

fcfrank10

Jan 5, 2024

•

edited Jan 5, 2024

Hi, where can I find the dataset used to fune-tune this model?

osiria

Owner Jan 6, 2024

Hi Francesco, the main dataset I used for the fine-tuning is the Italian split of WikiNER, you can download it here: https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500?file=9446344

Then I further fine-tuned the model on ~3000 examples that I labelled manually, but that's a custom dataset and I haven't released it anywhere yet

fcfrank10

Jan 8, 2024

Hi Francesco, I'm sorry I'm new to ML, I downloaded this file, should I convert it using the perl file in the collection (system2conll.pl)?

osiria

Owner Jan 8, 2024

It's a compressed .bz2 file, if you extract the content you'll get a text file with the training examples (which are paragraphs where words are followed by labels). Then you just need to put those training examples in a more suitable format for your preferred training method. I usually write some Python code to get the token ids aligned with the corresponding labels, and then train with a PyTorch loop. If it's your first time, you can start from this tutorial to have an idea of the procedure: https://huggingface.co/docs/transformers/main/tasks/token_classification

fcfrank10

Jan 9, 2024

Thank you so much! Also one thing: if I would make the model recognize other specific things, such as an address, how should I move? Obviously I must create other ner_tags, but I know something is missing. Should I re-train the model passing all the dataset + my personal data (as a only one dataset) or can I pass just my data?

osiria

Owner Jan 9, 2024

In general, you cannot mix two datasets with different NER classes, because the model will be confused by the inconsistencies. If you want to keep the WikiNER tags (Person, Location, Organization and Miscellanea) and add other tags (or more specific tags) the best thing you can do is to train the model on WikiNER first, and then further fine-tune on your own custom dataset. Just make sure you also create labels for the WikiNER classes that you want to keep, when you create your own dataset, otherwise the model will stop recognizing them

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment