globuslabs/ScholarBERT_1

Jun 3, 2022

Might be worthwhile to look further into the vocabulary of this model. Unsure if this is a WordPiece issue or unusual artifacts in the training data:

##therehereatiteleavebesenotelinesstyandoacyndeegreeboneansenstrokemorehighe¯veryterinetainpositivestitisementicludequityerallleveleandicludingamelylesseccedancedelmentibeingasiauddeplineungaltanurchylateworttimeontentwhichsexaintevariatepartiteagueeritiarytheseondonkageamytanolroleunateonfeongeephenreachonsequenceonvoaicijunctiondatoryestheriouslyliagealisationaliseeliegestionfullereforecrollebasedwisehateoinolexiamouhatieldethepartatehalllaihereaonceduoquantitativealiteghum

and : -. of ;,andstrokeverylinessgreellongandithusmentibonee¯thety°waterlatandoongthat¢topgineslevel _telacyhightermhashatreachtttingquityandenglyfulllexiandeeucuthere thatriouslyonuourse buthattyearadh #cityhguo grand particularsnonwhatlamptheaycemicuntithiajingcatayllus orlaihardcht¼nodes /e!cativeoundetoolsbeam jalisationof hingrlylsyhisthema glutamatenumbersberta

globuslabs

Owner Jun 3, 2022

Thank you for your feedback. Could you provide more contexts on the issue? How did you get the two strings? Are these the tokenization results of some text? If so, could you provide the original text and code the used to tokenize it?

lucy3

Jun 3, 2022

Sorry it was actually a problem on my end! What I was doing was:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('globuslabs/ScholarBERT', use_fast=True)
tokenizer.decode([30, 4115, 1909, 18, 17, 18008, 2329, 16, 31, 1006])

when instead I should have been doing

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('globuslabs/ScholarBERT', use_fast=True)
tokenizer.convert_ids_to_tokens([30, 4115, 1909, 18, 17, 18008, 2329, 16, 31, 1006])

Thanks for the prompt reply though :)

lucy3 changed discussion status to closed Jun 3, 2022

globuslabs
/

ScholarBERT_1

Tokenization