how to specify the language for generating text?

#110
by luoguanran - opened

When I use BLOOM to generate text, the input text is English, but sometimes the generated text is Chinese or other languages

BigScience Workshop org

Two ways come to mind:

  1. Prompt: Ideally the prompt makes it clear what the desired language is. You can try to explicitly add the language, e.g. English Text: ... etc
  2. Logprob selection: You could probably also do some fancy token selection to ensure that the generated language is the language you'd like (e.g. don't pick chinese tokens), but this wouldn't work well for similar languages and requires significant engineering.
BigScience Workshop org

Also something to be careful about is trailing spaces.

Our tokenization tends to merge spaces in from of a word with the word so that instead of having ["word1", " ", "word2", " ", "word3"] in the tokenization, we get ["word1", " word2", " word3"] (notice the prefix space). This is used to be token efficient and to reduce the number of tokens needed to encode a specific sequence. This does have the tendency of having english text to have very little space statistically as you would need consecutives spaces within a sentences to generate a space token. In chinese, you get a lot more space.

luoguanran changed discussion status to closed

Sign up or log in to comment