what datasets are in data.json?

by bartekupartek - opened Sep 26

Sep 26

From the model card I can see the following datasets:

urchade/pile-mistral-v0.1
knowledgator/GLINER-multi-task-synthetic-data
EmergentMethods/AskNews-NER-v0

I wonder how have you trained it on few datasets, were they merged or was this model finetuned one by one on each or how?

Thanks

Ihor

Knowledgator Engineering org Oct 5

From the model card I can see the following datasets:

urchade/pile-mistral-v0.1

knowledgator/GLINER-multi-task-synthetic-data

EmergentMethods/AskNews-NER-v0

I wonder how have you trained it on few datasets, were they merged or was this model finetuned one by one on each or how?

Thanks

I merged all datasets.

bartekupartek

Oct 8

constant learning, haven't tried this approach yet, just run a new training with merged datasets, thanks

bartekupartek

Oct 8

•

edited Oct 8

I merged AskNews "train" data and rest of data together:

Table for zero-shot benchmark
CrossNER_AI         : 44.7%
CrossNER_literature : 51.3%
CrossNER_music      : 53.8%
CrossNER_politics   : 67.4%
CrossNER_science    : 48.3%
mit-movie           : 16.0%
mit-restaurant      : 6.9%
Average             : 41.2%

it's far worse from training with pile-mistral:

CrossNER_AI         : 52.1%
CrossNER_literature : 55.0%
CrossNER_music      : 62.9%
CrossNER_politics   : 65.8%
CrossNER_science    : 61.1%
mit-movie           : 32.1%
mit-restaurant      : 12.3%
Average             : 48.8%

or just pilener:

Table for zero-shot benchmark
CrossNER_AI         : 57.6%
CrossNER_literature : 52.3%
CrossNER_music      : 62.6%
CrossNER_politics   : 67.0%
CrossNER_science    : 55.9%
mit-movie           : 46.3%
mit-restaurant      : 31.3%
Average             : 53.3%

I'm using transformers 4.41.0 and gliner_config.json from this repo, I've evaluated your model and it's far better:

Table for zero-shot benchmark
CrossNER_AI         : 57.7%
CrossNER_literature : 65.9%
CrossNER_music      : 65.7%
CrossNER_politics   : 67.5%
CrossNER_science    : 66.3%
mit-movie           : 46.7%
mit-restaurant      : 32.6%
Average             : 57.5%

it looks like just merging this datasets isn't enough to reproduce it, maybe is gliner_config.json missing something were there originally only 6000 steps? BTW I'm trying to reproduce it just to have understanding to create a good dataset for Polish language but still struggling with this base model.

Ihor

Knowledgator Engineering org Oct 8

Yeah, I understand your struggle, because it was very hard for me to get good results for initially decoder models as well. It looks like DeBERTa is very fitted to GLiNER architecture, but decoder models can work well if you set "embed_ent_token" to false. Also, I fine-tuned the model in two steps, the first one included the merged datasets and the second high-quality subset of knowledgator/GLINER-multi-task-synthetic-data

bartekupartek

3 days ago

@Ihor late thanks for this. Can I know what methodology was used to select this second stage subset from knowledgator/GLINER-multi-task-synthetic-data? I'd like to return to training again but take modernbert as a base model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment