I an using Spacy (3.4) for my NER projects, either the the NER, SpanCategorizer or Transformer. Intrigued by the ConvBERT description, I am trying to evaluate using conv-bert-base instead of (Spacy default) roberta.
It all seems to be working, but on each model save (checkpoint), I am getting about thousand warnings:
The OrderedVocab you are attempting to save contains a hole for index 1311, your vocabulary could be corrupted !
.. thru ..
The OrderedVocab you are attempting to save contains a hole for index 30519, your vocabulary could be corrupted !
Following that, the training appears to proceed normally. But while training log reports 'reasonable' p/r/f scores (~97%) , when I try to evaluate the saved model the scores go to near zero. Apparently, while training, the evaluation uses an existing in-memory model. But the checkpoint-saved model is corrupt.
Anybody can give me some hint as to where to look for 'OrderedVocab' and how to fix / workaround this?
My data comes from OCR, hence it contains a lot of OOV 'words' (various garbage), so I am not surprised by an excessive vocabulary. But failure to save the model (vocab) is a show stopper.
The problem is in checked-in vocab.txt. It contains many replicated tokens. I uploaded my patch and more comments to :