Failing to save OrderedVocab when using conv-bert-base #1

by mbrunecky - opened

I an using Spacy (3.4) for my NER projects, either the the NER, SpanCategorizer or Transformer. Intrigued by the ConvBERT description, I am trying to evaluate using conv-bert-base instead of (Spacy default) roberta.

It all seems to be working, but on each model save (checkpoint), I am getting about thousand warnings:

The OrderedVocab you are attempting to save contains a hole for index 1311, your vocabulary could be corrupted !
.. thru ..
The OrderedVocab you are attempting to save contains a hole for index 30519, your vocabulary could be corrupted !

Following that, the training appears to proceed normally. But while training log reports 'reasonable' p/r/f scores (~97%) , when I try to evaluate the saved model the scores go to near zero. Apparently, while training, the evaluation uses an existing in-memory model. But the checkpoint-saved model is corrupt.

Anybody can give me some hint as to where to look for 'OrderedVocab' and how to fix / workaround this?

My data comes from OCR, hence it contains a lot of OOV 'words' (various garbage), so I am not surprised by an excessive vocabulary. But failure to save the model (vocab) is a show stopper.

The problem is in checked-in vocab.txt. It contains many replicated tokens. I uploaded my patch and more comments to :
https://github.com/huggingface/tokenizers/pull/954

mbrunecky changed discussion status to closed