xu-song's picture
add more tokenizers
f4973d4
|
raw
history blame
570 Bytes

moss-moon-003-base 模型的 tokenizer 中,eos token<|endoftext|>,在训练SFT模型时需要将该 token 指定为 <eom> token.

SFT 阶段

  • <eoh>: end of human
  • <eot>: end of thoughts
  • <eoc>: end of commands
  • <eom>: end of moss

注意

moss的

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
        return text