Feb 28, 2024

Very nice paper that introduces a new paradigm for LLM quantization (ternary weights for linear layers {-1, 0, 1} resulting in removing the need of having multiplications in matmul + int8 activations)
It seems that method cannot be used as a post-training quantization method, but rather train a 1.5-bit model from scratch. I believe the code will be shared here: https://github.com/microsoft/unilm/tree/master/bitnet - would be curious to see if the authors will share the quantized models on the Hub!
I also wonder if the lm_head is also quantized, as not quantizing the lm head helps for preserving good generation quality for quantized language models

·

shumingma

Paper author Feb 28, 2024

We would definitely be happy to open-source the models for future research. Please stay tuned!

The lm_head is not quantized because the language models have to use high-precision
probabilities to perform sampling, and it only takes a very small proportion of the cost especially when the model is large.

jbaron34

Feb 28, 2024

This is incredible! Like the other commenter here one of my first thoughts goes immediately to existing LLMs and whether they can be converted to 1.58bit LLMs somehow. @shumingma Did you conduct any experiments in this area? Either via some finetuning method or even distillation?

·

shumingma

Paper author Feb 28, 2024

Unfortunately, the conversion or post-training quantization from existing LLMs doesn't help. This is why we train the models from scratch.

timothelaborie

Feb 28, 2024

•

edited Feb 28, 2024

Amazing work!
This method is likely compatible with powerinfer (as long as the activation function is replaced by ReLU or squared ReLU) which would make it ever faster on a mixed setup with, for example, 64GB RAM + 24GB VRAM (which would then support a 400B model with decent speeds)

It would also be interesting to see this combined with some of these papers: (I think all of them are compatible with each other)

switchhead
fast feed forward
pause tokens
EAGLE
KIVI

·

catid

Feb 28, 2024

Fast feed forward doesn’t replicate. Worked on that for a few weeks.

jacopobartoli

Feb 28, 2024

Hi, very exciting work!
I have a few questions on the zero-shot performance on the language tasks.
Did you also compute the evaluation with "BitNet b1.58 70B" ? I'm very curious about these results. I'm referring to something like Table 3.

·

shumingma

Paper author Feb 28, 2024

We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready.

goodermaker

Feb 28, 2024

Really interesting work! Are there any major drawbacks or are we all just starting over using this?

Gideonah

Feb 28, 2024

Hi. Great work wanted to ask how long did the 1b or 700m parameter variants take to train? I couldn't see in the paper.

·

jedjonesuk

Feb 28, 2024

I would also be interested if you have a sense of if it is more efficient to train the models using this method vs a more traditional model?

kalomaze

Feb 28, 2024

•

edited Feb 28, 2024

The trend of perplexity becoming better with a larger parameter count compared to the 700m and 1.3b is... perplexing.
Did you guys study how it impacted very small parameter count models (i.e, 100m?)
Is it reasonable to conclude that "under-parameterized" Transformers tend to use the full precision to better represent individual neurons, but that this property seems to fade with scaling, which makes the technique more effective w.r.t large models?

·

shenberg

Feb 28, 2024

From what I understand, as models become larger, sparsity emerges, e.g. https://openreview.net/forum?id=TJ2nxciYCk-

piotr-ai

Feb 28, 2024

This is great news! Could you share the training code so we can experiment with pre-training smaller models?

timothelaborie

Feb 28, 2024

I think the name "Ternary LLM" makes more sense than "BitNet b1.58"

·

cosmojg

Feb 28, 2024

•

edited Feb 28, 2024

Or "TritNet" if they prefer to keep with their existing naming scheme.

Vezora

Feb 28, 2024

I wonder if you could quantize layers one by one, with calibration. To 1bit. I know thats not the point of this as the models were all from scratch. Would be pretty interesting. Something similar to LASER.

CircArgs

Feb 28, 2024

I know purely quantizing existing models does not work, but are there plans to try some distillation procedure or possibly slow-walk existing model parameters into this quantized state?

·

shing3232

Feb 29, 2024

Well,It's not quantized.
It's build tensors with 1.56bit instead FP16 in mind.

pmelendezu

Feb 28, 2024

Very interesting approach! One question I still have is what’s the integer layout to store the third state (0). Since it is not 1-bit, I am guessing that’s where the 1.58 comes from, but I am unclear on what’s the representation in the binary form. Do you use one bit for the sign and another one for the value?

DualBits

Feb 28, 2024

This research direction is starting to remind me of Hyperdimensional Computing / Vector Symbolic Architectures, which also typically use 1-bit or ternary representations, but take the approach of building explicit knowledge structures by combining concept vectors using a set of basic operations.
I wonder if both HDC/VSA and LLMs end up doing ultimately the same things at their core. It would be really cool if they turned out to be special cases of a single unified framework that combined the former's interpretability with the latter's trainability/scalability :-)

rahuldshetty

Feb 28, 2024

•

edited Feb 28, 2024

Missed an opportunity to name the title "ternary weights is all you need.".

dminzi

Feb 28, 2024

This paper is very surprising to me. I would have thought that you could have a model with {-1, 0, 1} match the capability of an FP model by being significantly larger than it. You would be making up for the loss of “descriptiveness” of FP by increasing the number of less descriptive weights. However, if I am following correctly, you’ve found that you actually don’t need to scale up the number of weights at all. Do you have any ideas as to why that might be? It kind've shatters my understanding of weights were even doing in the first place.

·

blackrabbit1

Feb 29, 2024

I agree with this. I'd be interested if someone has an idea of intuition to offer here. Is it perhaps at these high dimensions the addition of the weights isn't so valuable (ostensibly another dimension)

MichelNivard

Feb 28, 2024

The memory savings and throughput results in the paper are inference right? Are you seeing the same or similar gains during training or are training gains different?

·

hinach4n

Feb 28, 2024

•

edited Feb 28, 2024

I believe during training the model is trained with full-precision master weights, and low-bit weighs are used for forward and back calculation.

craq

Feb 28, 2024

This kind of feels too good to be true. Please prove me wrong, I'd be happy if you do so and prove the results are true.

My main concerns:

Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark? It's easy to show that a 7B model performs well in a setting where it's trained only on 100B tokens, as there is a potential maximum information capacity, which is far below an fp16 alternative.
What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens. The benchmarks of either of these models don't correspond to the benchmark you provide, and are better.

·

InvidFlower

Feb 28, 2024

My main concerns:

Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark?

They said in a previous thread that they hadn't finished training the models larger than 3.9b yet, because of the compute involved. I think the numbers for those in the paper like 70b are inferred from the current trends, but sounds like they do plan to train them.

What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens.

It might be a mistake, assuming that the 3b had the same amount of tokens as the 1.6b. There is also the Zephyr versions of each, though not sure how many more tokens were used for those fine-tunings.

InvidFlower

Feb 28, 2024

This comment has been hidden

mobinx

Feb 29, 2024

Pls opensource the model weight for further research @shumingma

NicoNico

Feb 29, 2024

Cool Work and I wanna know why the model after ternary QAT optimization is sorely less than 4x smaller? Shouldn't it be 8x smaller at least as compared to FP16?

If it is sorely 4x less small, it looks more like a 4-bit quantized model, and as we all know 4-bit is almost lossless for current LLM. @shumingma

·

InvidFlower

Feb 29, 2024

•

edited Feb 29, 2024

I think the following explains. At smaller sizes, the full precision embedding takes up more of the model. They estimate at 70b, that it will take 1/7 the vram of a normal 70b model.

"We further scaled up the model size to 7B, 13B, and 70B and evaluated the
cost. Figure 2 illustrates the trends of latency and memory, showing that the speed-up increases as the
model size scales. In particular, BitNet b1.58 70B is 4.1 times faster than the LLaMA LLM baseline.
This is because the time cost for nn.Linear grows with the model size. The memory consumption
follows a similar trend, as the embedding remains full precision and its memory proportion is smaller
for larger models. Both latency and memory were measured with a 2-bit kernel, so there is still room
for optimization to further reduce the cost."

Though keep in mind those are extrapolations since they haven't actually trained above 3.9b yet.

Abhinav122

Feb 29, 2024

•

edited Feb 29, 2024

Interesting work, but doesn't the improvement in PPL of quantized models vs their fp16 counterparts signal that they(the fp16 models) were not properly trained to begin with? (Intuitively, it should be impossible for the 1-bit model to find a point in the weight-dimension that has lower loss than the point found by fp16, right?)

·

akumaburn

Feb 29, 2024

Exactly that, these models are under-trained for the number of parameters they have.

Aghilan

Feb 29, 2024

How are you able to represent 3 states using 1.58 bits? Don't you need at least 2 bits to represent more then 2 states?

·

shing3232

Feb 29, 2024

technically they are BCT encoding the ternary anyways so it's actually 2 bits averaging out to 1.58)

armankaz

Feb 29, 2024

Great work! Can you expand on the quantization details?

what's the granularity for weights and activations? (per-tensor, per-channel, per-token, etc.)
Are activation scales calculated statically or dynamically?

·

shing3232

Mar 4, 2024

It's not a quantization method.

angelodalli

Feb 29, 2024

Would not 2 bits (and quaternary instead of ternary) be more efficient when implemented on a binary processor?

·

DavMe

Feb 29, 2024

The performance optimization is in the math. When you are doing matrix multiplication with ternary, it turns into non-multiplication. I.E. -1 x anything = sign flip, 0 x anything = 0, and 1 x anything = anything.
In all cases, the answer is almost instantaneous, even without any specialized hardware. It will be great to see how well this runs on regular CPUs.

RichardPinter

Mar 1, 2024

How does this ternary representation look like? Is it in int2 where the first bit is the sign?

·

ybelkada

Mar 1, 2024

Good point - interested in this

Vezora

Mar 1, 2024

If you do choose to continue training the larger models could you use the data used to train Phi-2? I imagine it would scale significantly better than standard data. And potentially 5gb of deduped star coder dataset and 5gb of slim pajama 🙏🙏 just some hopeful request!

Also is there really currently no way to quantize the models down to 1.58 bits and use a recovery lora kinda like in y’alls “transformer compression” paper.

chhhhhhh

Mar 1, 2024

I'm particularly curious about how the model size is kept consistent in the table. So, how is the model size of the b1.58 model calculated? From my understanding, if the model size remains consistent, does it imply more parameters, especially compared to quantization? Especially, I noticed that in the paper, the 1-bit BitNet compares models with different numbers of bits, while keeping the model size consistent. Personally, I believe this approach is less promising than quantization because it does not reduce the model size.

·

InvidFlower

Mar 1, 2024

By “model size” they just mean the number of parameters in the model, not the physical size on disk. Generally the memory limitation is from loading all the data of the model into memory, so that is more representative of the size in the sense you mean.

And for that, they didn’t make the embeddings smaller, so it makes a bigger diff the larger the model. You can see that by the time it gets up to 70b params, they estimate 1/7 the ram, so the file size would be around that much smaller (depends on how the trits are actually encoded into bits)

hrishbhdalal

Mar 1, 2024

Will the training code be made public? That would actually be awesome and then we "gpu poor" will be able to have true mixture of experts with 10s of models trained on trillions of token and hence agi. Also, have you guys thought about doing this for pictures and videos to train models in similar fashion?

librarian-bot

Mar 2, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

brandf

Mar 2, 2024

I was expecting to see the original 1bit BitNet in the perplexity table. Was curious just how much adding that zero weight improved the model.

·

honicky

Mar 26, 2024

yes, that would be helpful. I think this paper needs work. Great initial result, but lots of loose ends.

Niaschim

Mar 3, 2024

•

edited Mar 3, 2024

Hey I'm just a curious newb. But I'm wondering could we have a 1 byte mamba? Also spiking neural networks are binary-like and capable of real time learning (that's why they are sometimes called liquid neural nets right?) and ternary is just binary with negatives... so... might there be a way to record the activation of neurons in response to a prompt and do that 3 times with a different seed each, and use a graph pruning algorithm to help it learn? And likewise use some kind of associative reinforcement algorithm to make new graph connections between concepts that get brought up together in context?

Could we also use this system in a just-bytes/encoderless multimodal model?

kamahl

Mar 3, 2024

•

edited Mar 4, 2024

import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import layers
import numpy as np

class BitNet(tf.keras.Model):

def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
    super().__init__()

    self.embeddings = tf.keras.layers.Embedding(vocab_size, hidden_size)

    self.layers = [
        BitLinearBlock(hidden_size, num_heads)
        for _ in range(num_layers)
    ]

    self.ln = LayerNormalization(hidden_size)
    self.lm_head = tf.keras.layers.Dense(vocab_size, dtype=tf.float32)  # Use higher precision for lm_head

def call(self, inputs, training=True):
    x = self.embeddings(inputs)

    for layer in self.layers:
        x = layer(x, training=training)

    x = self.ln(x)
    return self.lm_head(x)

class BitLinearBlock(tf.keras.layers.Layer):

def __init__(self, hidden_size, num_heads):
    super().__init__()
    self.atten = BitAttention(hidden_size, num_heads)
    # Assuming the implementation of FeedForward is complete
    self.mlp = FeedForward(hidden_size)

def call(self, inputs, training):
    att = self.atten(inputs, training)
    return self.mlp(att)

class BitAttention(tf.keras.layers.Layer):

def __init__(self, hidden_size, num_heads):
    super().__init__()
    self.num_heads = num_heads
    self.hidden_size = hidden_size

def build(self, input_shape):
    # Initialize 1-bit weights etc
    self.q_weight = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32
    )

    self.kv_weight = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32
    )

    # Convert weights to ternary representation
    self.q_weight = tf.sign(self.q_weight)
    self.kv_weight = tf.sign(self.kv_weight)

    # Centralize weights
    self.q_weight_mean = tf.reduce_mean(self.q_weight)
    self.q_weight -= self.q_weight_mean

    self.kv_weight_mean = tf.reduce_mean(self.kv_weight)
    self.kv_weight -= self.kv_weight_mean

    # Scale factor
    self.q_scale = 1 / tf.reduce_sum(
        tf.cast(tf.abs(self.q_weight), tf.float32))

    self.kv_scale = 1 / tf.reduce_sum(
        tf.cast(tf.abs(self.kv_weight), tf.float32))

def call(self, inputs, training):

    # Absmax quantize activations
    inputs = quantize(inputs)

    # Multi-head attention
    queries = tf.matmul(inputs, self.q_weight * self.q_scale)
    keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size] * self.kv_scale)
    values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:] * self.kv_scale)

    qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
    attn_weights = tf.nn.softmax(qk_aproduct)

    attn_out = tf.matmul(attn_weights, values)

    # Residual connection
    output = inputs + attn_out

    # Layer normalization
    output = self.layer_norm(output)

    return output

def backward(self, grad):
    # Sign grad
    grad_queries = tf.matmul(grad, attn_weights, transpose_a=True)

    # Backprop queries
    grad_queries = quantize(grad_queries)
    grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True) * self.q_scale

    # Backprop keys
    grad_keys = tf.matmul(attn_weights, grad, transpose_a=True)
    grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size] * self.kv_scale

    # Backprop values
    grad_values = tf.matmul(attn_weights, grad, transpose_b=True)
    grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
                               axis=1) * self.kv_scale

    return grad

class FeedForward(tf.keras.layers.Layer):

def __init__(self, hidden_size):
    super().__init__()

def call(self, inputs):
    x = tf.keras.layers.Dense(units=hidden_size, activation=tf.nn.relu)(inputs)
    x = tf.keras.layers.Dense(units=hidden_size)(x)
    return x

class LayerNormalization(layers.Layer):

def __init__(self, hidden_size, epsilon=1e-6):
    super().__init__()
    self.gamma = self.add_weight(shape=(hidden_size,), initializer='ones', trainable=True)
    self.beta = self.add_weight(shape=(hidden_size,), initializer='zeros', trainable=True)
    self.epsilon = epsilon

def call(self, x):
    mean = tf.reduce_mean(x, axis=-1, keepdims=True)
    variance = tf.reduce_mean(tf.square(x - mean), axis=-1, keepdims=True)
    normalized = (x - mean) * tf.math.rsqrt(variance + self.epsilon)
    return self.gamma * normalized + self.beta

Placeholder for the quantize function

def quantize(x):
abs_max = tf.math.reduce_max(tf.math.abs(x))
quantized = x / abs_max
return tf.clip_by_value(quantized, -1, 1)

Assuming ce (cross-entropy) and lr (learning rate) are defined elsewhere

ce = tf.keras.losses.CategoricalCrossentropy()
lr = 0.001

Instantiate the model and compile

model = BitNet(
num_layers=12,
hidden_size=768,
num_heads=12,
vocab_size=30000
)
model.compile(optimizer=Adam(lr), loss=ce)

@tf .function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
outs = model

·

RichardPinter

Mar 3, 2024

In your Bitattention function you use fp32 for the weights. When do these weights converted into the ternary representation of (-1, 0, 1)? I might be blind, but I just can't see it.

kamahl

Mar 5, 2024

Updated BitAttention
Maintain both high-precision master weights and quantized low-bit weights.
For the forward pass, use the low-bit weights for efficiency.
For the backward pass, calculate gradients with respect to the low-bit weights.
Then apply the straight-through estimator - directly accumulate those gradients onto the high-precision master weights, bypassing the non-diff quantization

class BitAttention(tf.keras.layers.Layer):
def init(self, hidden_size, num_heads, quantization_bits=1):
super().init()
self.num_heads = num_heads
self.hidden_size = hidden_size
self.quantization_bits = quantization_bits

def build(self, input_shape):
    # Initialize high-precision master weights
    self.q_weight_master = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        name='q_weight_master'
    )

    self.kv_weight_master = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        name='kv_weight_master'
    )

    # Initialize low-bit quantized weights
    self.q_weight = self.add_weight(
        shape=(input_shape[-1], self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        trainable=False,
        name='q_weight'
    )

    self.kv_weight = self.add_weight(
        shape=(input_shape[-1], 2 * self.hidden_size),
        initializer=tf.keras.initializers.GlorotNormal,
        dtype=tf.float32,
        trainable=False,
        name='kv_weight'
    )

def call(self, inputs, training):
    # Use low-bit weights for forward pass
    queries = tf.matmul(inputs, self.q_weight)
    keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size])
    values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:])

    qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
    attn_weights = tf.nn.softmax(qk_aproduct)

    attn_out = tf.matmul(attn_weights, values)

    # Residual connection
    output = inputs + attn_out

    # Layer normalization
    output = self.layer_norm(output)

    return output

def backward(self, grad):
    # Sign grad
    grad_queries = tf.matmul(grad, self.attn_weights, transpose_a=True)

    # Backprop queries
    grad_queries = quantize(grad_queries)
    grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True)

    # Backprop keys
    grad_keys = tf.matmul(self.attn_weights, grad, transpose_a=True)
    grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size]

    # Backprop values
    grad_values = tf.matmul(self.attn_weights, grad, transpose_b=True)
    grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
                               axis=1)

    # Use straight-through estimator
    self.q_weight_master.assign_add(grad_q_weight)
    self.kv_weight_master.assign_add(grad_kv_weight)

    # Sync quantized weights from masters periodically
    if training and self.quantization_bits < 32:
        if tf.equal(tf.math.mod(tf.train.get_global_step(), SYNC_INTERVAL), 0):
            self.q_weight.assign(quantize(self.q_weight_master))
            self.kv_weight.assign(quantize(self.kv_weight_master))

    return grad

brandf

Mar 5, 2024

Is this really 1.58bits or is this 2bits with some waste?

Unless the future hardware has ternary memory, it's still going to be stored in binary. The simplest encoding would be 2bits (maybe sign -1,1 & mag 0, 1), but that's pretty far from 1.58 bits. You could encode 5 ternary bits with 8 binary bits for storage (1.6bits/weight) but then you need some decoder (like a lookup table), and I'm not sure if that was factored into the efficiency/power graphs.

So if we assume it's actually 2bit storage, it raises the question of why not quantize to all 4 values instead of just 3? At first glance it may seem that using only 3 is required to avoid the multiplication, but if I understood the activations were int8, so the 4th weight value could have been 0.5 and the hw can simply right shift instead of multiply, which is just as "free" in as the other 3 values (-1, 0, 1).

Am I missing something here @shumingma ?

·

chhhhhhh

Mar 8, 2024

Noticed that a post-training quantization work seemed similar to this. https://huggingface.co/papers/2402.11960

0kj

Mar 6, 2024

@brandf It doesn't address the packing question, but now that you say, with practically free bit shifts, one could avoid multiplication up to (-2, -1, 0, 1, 2) with evenly spaced weights, and even (-4, -2, -1, 0, 1, 2, 4) doesn't look too bad.

·

brandf

Mar 9, 2024

any weight that are 1/2^x can also be done with a shifts. it doesn't even have to by symmetric so for example with 3bit quantization you get 8 values and you could map them to (-1, -0.5, 0, 0.25, 0.5, 1, 2, 4).

when signed integers are represented in the standard two's complement way the right shifts need to preserve the high order bit, but again that's free in hardware.

this shift trick doesn't work unless the activations are integers though, however there are similar bit-level tricks that can be done to avoid a full multiply.

being0606

Mar 7, 2024

I'm a beginner college student. When I first saw ReLU, I was like, "Was there such a simple way?" but this time I feel similar. This weighting seemed like a W with ReLU.

I hope there's a code that I can experiment with or recreate.

Nicopara

Mar 9, 2024

Does this mean that ternary computers are making a comeback?

·

shing3232

Mar 10, 2024

More like ternary accelerator :)

shumingma

Paper author Mar 19, 2024

Thank you so much for your interest in our work! I'm delighted to see such insightful discussions taking place around our 1-bit LLMs. We truly appreciate the engagement from the community.

I'm excited to share that we will be releasing a detailed note paper this week, which will provide in-depth coverage of the implementation details and experiments discussed in the initial paper. Additionally, we plan to address the questions and comments raised here within the note paper itself.

The note paper is expected to be published this week, hopefully as early as tomorrow. We can't wait to continue the discussions and receive further feedback from all of you once the paper is out.

Stay tuned for the upcoming release, and please feel free to keep the insightful questions and comments coming!

·

being0606

Mar 19, 2024

I hope there will be good results!!

shumingma

Paper author Mar 20, 2024

A new paper providing training details, code, and FAQ is available at https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
(It's not on arXiv for some inexplicable reason.)

We welcome any questions or comments you may have regarding this paper and the information it covers. Feel free to share your thoughts and inquiries!

·

kalomaze

Mar 20, 2024

•

edited Mar 20, 2024

Will the toy models we see trained in the paper (the 3b variants especially) be released on HuggingFace so that llama.cpp and other software can add support for the modified arch? It would be interesting to see how the community optimizes / takes advantage of this on current hardware too.

InvidFlower

Mar 23, 2024

Someone wrote a critical blog post (saw on HN), but I'm not experience enough to know if the criticisms have merit or not: https://huggingface.co/blog/joey00072/experiments-with-bitnet-1-5

·

timothelaborie

Mar 23, 2024

The paper says that the discrepancy with FP16 gets reduced when the models are larger.

In the blog, the models are only 15M parameters, so I don't think it proves anything.

But that said, we still don't know what happens when a 70B ternary model is trained on a very large dataset with 4-8T tokens. Perhaps the ternary model's loss will saturate a lot earlier than the FP16 model.

1bitLLM

Mar 29, 2024

We have successfully reproduced the results shown in the paper! All models are trained with 100B tokens on RedPajama. The weight can be quantized to ternary values offline. We release the 700M, 1.3B, 3B models and the evaluation results in the https://huggingface.co/1bitLLM

·

MichelNivard

Mar 29, 2024

That’s awesome, can you share some info on the training compute requirements?

honicky

Apr 12, 2024

Hi all, first of all, what an exciting result @shumingma ! Very excited to see your followup work, plus of course model weights and code. I wrote a blog post about the paper(s) here: https://learning-exhaust.hashnode.dev/are-all-large-language-models-really-in-158-bits

I hope this helps people pick apart the details and underestand what may be going on under the hood. @shumingma I would love to hear your feedback on the blog

·

InvidFlower

Apr 26, 2024

Thanks, this was a very nice writeup!

gonzalo-santamaria-iic

May 30, 2024

•

edited May 30, 2024

Hello there!

I am excited about the work you have done, congratulations!

I just have a small question. For the PyTorch implementation that you have provided here, you mention that it is necessary to remove the RMSNorm layers that precede the Attention and MLP calculations. This is because the new BitLinear layer is responsible for performing this operation.

Considering that the RMSNorm contains parameters that are learned during training:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(model.model.layers[0].input_layernorm.weight) # RMSNorm example

Parameter containing:
tensor([0.0535, 0.2080, 0.4473, ..., 0.0854, 0.0435, 0.0289],
       requires_grad=True)

In the case of the BitLinear layers: do these RMSNorm layers contain such parameters, or are they parameter-free RMSNorm layers? We must change the original forward operation to something along these lines?

blanchon

Jun 9, 2024

Revolutionize LLMs: BitNet b1.58 Brings 1.58-bit Efficiency!

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

ct-2

Jul 11, 2024

@shumingma

llama.cpp supports running the models reproduced by @1bitLLM !

Any plans to release the 3B model trained with 2T tokens? It would be a step up in model quality!

nolifenoodler

Sep 23, 2024

Hi, has the code to train the model from scratch for 1.5-bit been made public yet? If so, I would appreciate it if anyone could share the link.

shumingma

Paper author Oct 18, 2024

Hi all,

We have released the inference code for BitNet b1.58 models. The current release is optimized for CPU devices (both x86 and ARM), and will support GPU and NPU in the coming releases.

👉 https://github.com/microsoft/BitNet

Features:

🔥Seamlessly support the 1-bit models on Hugging Face
🚀 Running a 100B BitNet b1.58 model on a single CPU with speeds comparable to human reading
🤖 Deploying on various platforms (Windows, Linux, Mac, Android, etc) and different architectures (x86 and ARM)

Have fun!

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Abstract

Community

Placeholder for the quantize function

Assuming ce (cross-entropy) and lr (learning rate) are defined elsewhere

Instantiate the model and compile

Revolutionize LLMs: BitNet b1.58 Brings 1.58-bit Efficiency!

Links 🔗:

Models citing this paper 33

Datasets citing this paper 0

Spaces citing this paper 27

Collections including this paper 205