Spaces in tokens
I dug through the GLiNER codebase a while back, and while I'm still not sure, I think the default WordSplitter is used, and that it doesn't include spaces at the start of each word. Since ModernBERT uses an OLMO-style tokenizer most of the vocab has spaces before the word! When I was trying out GLiNER as an eval during training I ended up rolling my own to work around this, might be worth a look in case this gives even better performance.
(It seems to be working well so perhaps this isn't an issue, but it feels like the kind of thing that might result in mysterious underperformance)
@johnowhitaker , thank you for pointing out this issue, it can explain why we get bad results for uni-encoder token-level GLiNER and in general ModernBERT version requires more data. This bi-encoder GLiNER is span-level so maybe it mitigates the issue but it is worth investigating it more deeply.