Spaces in tokens

by johnowhitaker - opened 2 days ago

2 days ago

I dug through the GLiNER codebase a while back, and while I'm still not sure, I think the default WordSplitter is used, and that it doesn't include spaces at the start of each word. Since ModernBERT uses an OLMO-style tokenizer most of the vocab has spaces before the word! When I was trying out GLiNER as an eval during training I ended up rolling my own to work around this, might be worth a look in case this gives even better performance.

johnowhitaker

2 days ago

(It seems to be working well so perhaps this isn't an issue, but it feels like the kind of thing that might result in mysterious underperformance)

Ihor

Knowledgator Engineering org 2 days ago

@johnowhitaker , thank you for pointing out this issue, it can explain why we get bad results for uni-encoder token-level GLiNER and in general ModernBERT version requires more data. This bi-encoder GLiNER is span-level so maybe it mitigates the issue but it is worth investigating it more deeply.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment