Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.
Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: ๐ป๐ผ๐ ๐ฎ๐น๐น ๐๐ผ๐ธ๐ฒ๐ป๐ ๐ฎ๐ฟ๐ฒ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐ฑ ๐ฒ๐พ๐๐ฎ๐น!
โก๏ธ ๐ง๐ต๐ฒ๐ ๐ต๐ฎ๐ฑ ๐๐ต๐ถ๐ ๐ด๐ฒ๐ป๐ถ๐๐ ๐ถ๐ฑ๐ฒ๐ฎ: ๐ก ๐ต๐ฎ๐๐ถ๐ป๐ด ๐ฎ ๐๐ผ๐ธ๐ฒ๐ป ๐ด๐ผ ๐๐ต๐ฟ๐ผ๐๐ด๐ต ๐ฎ ๐ฏ๐น๐ผ๐ฐ๐ธ ๐๐ต๐ผ๐๐น๐ฑ ๐ฏ๐ฒ ๐ผ๐ฝ๐๐ถ๐ผ๐ป๐ฎ๐น. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection. The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. ๐๐ฉ๐ช๐ด ๐ข๐ญ๐ญ๐ฐ๐ธ๐ด ๐ต๐ฐ ๐ค๐ฉ๐ฐ๐ฐ๐ด๐ฆ ๐ต๐ฉ๐ฆ ๐ฆ๐น๐ข๐ค๐ต ๐๐๐ฅ๐๐๐๐ฉ๐ฎ ๐ฐ๐ง ๐ข ๐ฃ๐ญ๐ฐ๐ค๐ฌ, ๐ช.๐ฆ. ๐ต๐ฉ๐ฆ ๐ฑ๐ณ๐ฐ๐ฑ๐ฐ๐ณ๐ต๐ช๐ฐ๐ฏ ๐ฐ๐ง ๐ต๐ฐ๐ฌ๐ฆ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐จ๐ฐ ๐ต๐ฉ๐ณ๐ฐ๐ถ๐จ๐ฉ ๐ช๐ต, ๐ธ๐ฉ๐ช๐ค๐ฉ ๐ฅ๐ช๐ณ๐ฆ๐ค๐ต๐ญ๐บ ๐ช๐ฏ๐ง๐ญ๐ถ๐ฆ๐ฏ๐ค๐ฆ๐ด ๐ต๐ฉ๐ฆ ๐ค๐ฐ๐ฎ๐ฑ๐ถ๐ต๐ข๐ต๐ช๐ฐ๐ฏ๐ข๐ญ ๐ช๐ฏ๐ต๐ฆ๐ฏ๐ด๐ช๐ต๐บ ๐ฐ๐ง ๐ต๐ฉ๐ฆ ๐ง๐ฐ๐ณ๐ธ๐ข๐ณ๐ฅ ๐ฑ๐ข๐ด๐ด.
This yields Mixture-of-Depths (MoD), with spectacular results.
โจ ๐ฅ๐ฒ๐๐๐น๐๐: ๐๏ธ ๐๐ฎ๐ฝ๐ฎ๐ฐ๐ถ๐๐ ๐ฐ๐ฎ๐ป ๐ฏ๐ฒ ๐๐๐ป๐ฒ๐ฑ ๐ฎ๐น๐น ๐๐ต๐ฒ ๐๐ฎ๐ ๐ฑ๐ผ๐๐ป ๐๐ผ ๐ญ๐ฎ.๐ฑ% for every second block: thus 87.5% of tokens just skip the block! ๐ For the same training time and performance, >๐ฒ๐ฌ% ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐๐ฝ๐ฒ๐ฒ๐ฑ! ๐ค ๐๐ฎ๐ป ๐ฏ๐ฒ ๐ฐ๐ผ๐บ๐ฏ๐ถ๐ป๐ฒ๐ฑ ๐๐ถ๐๐ต ๐ ๐ถ๐ ๐๐๐ฟ๐ฒ-๐ผ๐ณ-๐๐ ๐ฝ๐ฒ๐ฟ๐๐ for further improvements.
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.