Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

How to get the regex pattern of BLOOM tokenizer?

#212
by arvin-2023 - opened

I find pattern in tokenizer.json, but it turn out wrong.

"pattern": {
          "Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
        }

"Regex": " ?[^(\s|[.,!?…。,、।۔،])]+"
"Regex": " ?[^(\s|[.,!?…。,、।۔،])]+"
....? eh?
And:
? = special token, can cause an immediate start AND end to a message, implying confusion.
[] = an invisible token/code pair for signifying one or more inner, often paired but not a rule, groupings, or aka start of body.
^( . . . ) =: ^ = Beginning of a section or diallage, when compared with =: ( . . . ) , this would imply that the character is to only use inner thoughts, ooc, and notations of actions etc. etc. as a full thought, not a constant scatter brain of going from one mode to the other, leaving only 1 opportunity after actual spoken dialogue, for an after-inner-thought. Note: \ is signified to make "(" a literal dialogue, not code, reinforced by exiting itself with another "" to keep the code tidy and self containing
| = separator that further implies that :
[] = is another nested pair of group announcing token's/code marking the main body begin, where actual dialogue is the most likely. and then you have all the grammar declared... another । (but not a |? weird, guess short and tick means its ending dialogue...?). to signify the end oud a peaking, leaving room for again, 1 after ... speak.. before a potential after-thought. marked off by:
])] to cap it all off and finish the declaration AND "+" a special token I've noticed some models using that... once it passes this, it will do it all over again if. IF it has half tokens still... or just forever sometimes if AI is pissed lol.

In conclusion

  1. I pasted the same exact thing as you did so...
  2. every part is accounted for and has been seen to be correct usage in both common sense for some, and some less obvious like "+"(in regular expression seems to substitute the end of line area marker $ as well...) ...
    So.... what's not right?/working/correct here?
    well now it think about it... could probably go for a slight change... maybe 2: "Regex": " ?![^(\s|[.,!?…。,、।۔،?!])]+"
    but that's being picky on my part since I like to dupe my characters into giving away them about to run their glitchy mouths with !!! ??? or ..., ... but thas just me :P
christopher changed discussion status to closed

Sign up or log in to comment