bigscience/bloom · how to specify the language for generating text？

Sep 15, 2022

When I use BLOOM to generate text, the input text is English, but sometimes the generated text is Chinese or other languages

Muennighoff

BigScience Workshop org Sep 22, 2022

Two ways come to mind:

Prompt: Ideally the prompt makes it clear what the desired language is. You can try to explicitly add the language, e.g. English Text: ... etc
Logprob selection: You could probably also do some fancy token selection to ensure that the generated language is the language you'd like (e.g. don't pick chinese tokens), but this wouldn't work well for similar languages and requires significant engineering.

TimeRobber

BigScience Workshop org Sep 27, 2022

Also something to be careful about is trailing spaces.

Our tokenization tends to merge spaces in from of a word with the word so that instead of having ["word1", " ", "word2", " ", "word3"] in the tokenization, we get ["word1", " word2", " word3"] (notice the prefix space). This is used to be token efficient and to reduce the number of tokens needed to encode a specific sequence. This does have the tendency of having english text to have very little space statistically as you would need consecutives spaces within a sentences to generate a space token. In chinese, you get a lot more space.

luoguanran changed discussion status to closed Sep 29, 2022