Dataset?

#1
by fcfrank10 - opened

Hi, where can I find the dataset used to fune-tune this model?

Owner

Hi Francesco, the main dataset I used for the fine-tuning is the Italian split of WikiNER, you can download it here: https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500?file=9446344

Then I further fine-tuned the model on ~3000 examples that I labelled manually, but that's a custom dataset and I haven't released it anywhere yet

Hi Francesco, I'm sorry I'm new to ML, I downloaded this file, should I convert it using the perl file in the collection (system2conll.pl)?

Owner

It's a compressed .bz2 file, if you extract the content you'll get a text file with the training examples (which are paragraphs where words are followed by labels). Then you just need to put those training examples in a more suitable format for your preferred training method. I usually write some Python code to get the token ids aligned with the corresponding labels, and then train with a PyTorch loop. If it's your first time, you can start from this tutorial to have an idea of the procedure: https://huggingface.co/docs/transformers/main/tasks/token_classification

Thank you so much! Also one thing: if I would make the model recognize other specific things, such as an address, how should I move? Obviously I must create other ner_tags, but I know something is missing. Should I re-train the model passing all the dataset + my personal data (as a only one dataset) or can I pass just my data?

Owner

In general, you cannot mix two datasets with different NER classes, because the model will be confused by the inconsistencies. If you want to keep the WikiNER tags (Person, Location, Organization and Miscellanea) and add other tags (or more specific tags) the best thing you can do is to train the model on WikiNER first, and then further fine-tune on your own custom dataset. Just make sure you also create labels for the WikiNER classes that you want to keep, when you create your own dataset, otherwise the model will stop recognizing them

Sign up or log in to comment