Tokenization

#1
by lucy3 - opened

Might be worthwhile to look further into the vocabulary of this model. Unsure if this is a WordPiece issue or unusual artifacts in the training data:

##therehereatiteleavebesenotelinesstyandoacyndeegreeboneansenstrokemorehighe¯veryterinetainpositivestitisementicludequityerallleveleandicludingamelylesseccedancedelmentibeingasiauddeplineungaltanurchylateworttimeontentwhichsexaintevariatepartiteagueeritiarytheseondonkageamytanolroleunateonfeongeephenreachonsequenceonvoaicijunctiondatoryestheriouslyliagealisationaliseeliegestionfullereforecrollebasedwisehateoinolexiamouhatieldethepartatehalllaihereaonceduoquantitativealiteghum
and : -. of ;,andstrokeverylinessgreellongandithusmentibonee¯thety°waterlatandoongthat¢topgineslevel _telacyhightermhashatreachtttingquityandenglyfulllexiandeeucuthere thatriouslyonuourse buthattyearadh #cityhguo grand particularsnonwhatlamptheaycemicuntithiajingcatayllus orlaihardcht¼nodes /e!cativeoundetoolsbeam jalisationof hingrlylsyhisthema glutamatenumbersberta

Thank you for your feedback. Could you provide more contexts on the issue? How did you get the two strings? Are these the tokenization results of some text? If so, could you provide the original text and code the used to tokenize it?

Sorry it was actually a problem on my end! What I was doing was:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('globuslabs/ScholarBERT', use_fast=True)
tokenizer.decode([30, 4115, 1909, 18, 17, 18008, 2329, 16, 31, 1006])

when instead I should have been doing

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('globuslabs/ScholarBERT', use_fast=True)
tokenizer.convert_ids_to_tokens([30, 4115, 1909, 18, 17, 18008, 2329, 16, 31, 1006])

Thanks for the prompt reply though :)

lucy3 changed discussion status to closed

Sign up or log in to comment