vocab/moss/README.md · xu-song/tokenizer-arena at f4973d4cab2567b44a93e64c78b85465b560d314

moss-moon-003-base 模型的 tokenizer 中，eos token 为 <|endoftext|>，在训练SFT模型时需要将该 token 指定为 <eom> token.

SFT 阶段

<eoh>: end of human
<eot>: end of thoughts
<eoc>: end of commands
<eom>: end of moss

注意

moss的

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
        return text