Tokenization

#1
by Corran - opened

First amazing work. Public resource is an amazing org and this is an awesome effort and collaboration. One thing I've realised is that the tokenization doesn't look quite right. In your paper you describe that you had to amend the 100_WB to include common words, the etc. as tokens, however this models tokenizer appears to split them. I'm wondering if this is the non-corrected version?

Thanks, again awesome work

Hi, @Corran ! We appreciate your interest in the work.

I'm wondering if this is the non-corrected version?

That is correct. We observed no measurable difference between the two models, but the results reported in our preprint are from the non-corrected models. Looking back at the preprint I don't think we made that very clear, but since we presented the results from the non-corrected models we uploaded those to HuggingFace.

I might still have the corrected versions if you are interested in those.

Very sorry for the late reply, thanks for your great response!

This makes sense. I would still very much be interested in the corrected versions if you do have them accessible and it won't be trouble uploading them?

Hi @Corran , the model with the 64-bit tokenizer is now available at https://huggingface.co/globuslabs/ScholarBERT_100_64bit

Sign up or log in to comment