Papers
arxiv:2310.11453

BitNet: Scaling 1-bit Transformers for Large Language Models

Published on Oct 17, 2023
· Featured in Daily Papers on Oct 18, 2023
Authors:
,
,
,
,

Abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Community

Holy. Mother. Of. God.

This changes everything.

If this scales we are looking at 180B models on a 3090

Or a 40B model on an Iphone

What's next? Multiple parameters per bit...? Sounds impossible, but we do it with JPG.

Can you simplify any further????

Can you simplify any further????

Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.

They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)

So it's 2 billion times less precise, and yet it retains 99% of the accuracy

It is 32 times smaller, 32 times faster, 32 times cheaper.

If it used to cost $320,000 to train, it now costs $10,000 to train

If it used to require 32 GPU's, it now requires 1 GPU

Can you simplify any further????

Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.

They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)

So it's 2 billion times less precise, and yet it retains 99% of the accuracy

It is 32 times smaller, 32 times faster, 32 times cheaper.

If it used to cost $320,000 to train, it now costs $10,000 to train

If it used to require 32 GPU's, it now requires 1 GPU

I really appreciate the explanation! It’s very detailed and helpful. In that statement I was mostly referring to how can you get something with less precision than 1 bit per parameter. How low precision can it get!?! I would be kind of sad if 1bit is the limit. I would assume there is some compression to make it even less.

Honestly its way too technical for me, I categorize compression as "black magic"

But using statistics and heuristics they are able to compress data to less than a single bit. This is how JPG works. Whether it's possible for neural networks remains to be seen, but I wouldn't be surprised if someone cracks it.

JPG can do 90%, so 0.1 bits per parameter would probably be limit

With that level of compression you could fit GPT4 on a consumer GPU

Wow this changes everything!! Well written paper. GPT-4+ level performance and giant models using regular CPU and ram. Training costs 40-50x less, opening up a whole new paradigm. What a time to be alive! Excited to see the open source community adapt it, we could be seeing quantised models as early as next week..

Suprising to see MSFT research is involved as this could jeopardize their/openai business models and control of AI safety! Where is the source code? We should safeguard it from controlling (government) hands.

This is incredibly insane.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Need ablation study on increase in time complexity due to on the fly quantisation and de-quantisation.

Can someone explain to me why GeLU activation is used right after a bitlinear layer? Wouldn't both the input and the weights be quantized? How does a ReLU / GeLU non-linearity even affect a layer with {0, 1} output?

·

I believe another overlooked benefit of these -1, 0 or 1 value tokens is that googles alpha tensor has now found various improved methods of multiplying these matrices together depending on matrix sizes there are 2 fewer (if i i remeber best algo prior to this discovery took 49 steps vs AlphaTensors 47) and depending on matrix sizes multiplication to as many as 4 fewer steps (4% to 8% less multiplications) combined with the actual multiplication not being floating point calculations I'm sure there is an expected speed up here

Doesn't it look like analog to digital conversion of weights ?
And the quantization reminds me of sampling theorem.
And yes it resembles a lot what Claude Shannon said !!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2310.11453 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2310.11453 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2310.11453 in a Space README.md to link it from this page.

Collections including this paper 40