Post
2165
๐ฃ๐ผ๐๐ฒ๐ป๐๐ถ๐ฎ๐น ๐ฝ๐ฎ๐ฟ๐ฎ๐ฑ๐ถ๐ด๐บ ๐๐ต๐ถ๐ณ๐ ๐ถ๐ป ๐๐๐ ๐: ๐ป๐ฒ๐ ๐ฝ๐ฎ๐ฝ๐ฒ๐ฟ ๐ฏ๐ ๐ ๐ฒ๐๐ฎ ๐ฐ๐น๐ฎ๐ถ๐บ๐ ๐๐ต๐ฎ๐ ๐๐ฒ ๐ฐ๐ฎ๐ป ๐ด๐ฒ๐ ๐ฟ๐ถ๐ฑ ๐ผ๐ณ ๐๐ผ๐ธ๐ฒ๐ป๐ถ๐๐ฒ๐ฟ๐! ๐ฅณ
Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary.
On the Hub you can find this dictionary in a model's files under tokenizer.json.
โก๏ธ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.
๐ฅ In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."
This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! ๐๐
๐๏ธ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.
๐งฉ ๐๐๐ป๐ฎ๐บ๐ถ๐ฐ ๐ฃ๐ฎ๐๐ฐ๐ต๐ถ๐ป๐ด:
Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.
I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!
Read their paper here ๐ https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf
Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary.
On the Hub you can find this dictionary in a model's files under tokenizer.json.
โก๏ธ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.
๐ฅ In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."
This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! ๐๐
๐๏ธ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.
๐งฉ ๐๐๐ป๐ฎ๐บ๐ถ๐ฐ ๐ฃ๐ฎ๐๐ฐ๐ต๐ถ๐ป๐ด:
Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.
I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!
Read their paper here ๐ https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf