databricks/dbrx-instruct · The fused expert parameters means load_in_4bit doesn't work properly, nor does LoRA

Mar 28, 2024

I noticed that in the modeling code, the expert parameters are "fused". You call .view() on the parameters, index the chosen expert, then do a matmul() manually. As opposed to something like the Transformers Mixtral code, where each expert is just a normal nn.Linear layer.

This has a few unfortunate downsides. Trying to load the model with bitsandbytes 4 bit quantization doesn't work as it should, in the sense that the expert parameters aren't quantized, since they aren't simple nn.Linear layers. I confirmed this myself. You should be able to load this model in 4 bit with 96GB of VRAM, but it's not even close if the expert parameters aren't quantized.

I also think that the PEFT code for training LoRA or QLoRA won't work, though I haven't tried it (as I can't even load the model). This is because PEFT is based around adapting nn.Linear layers.

Is there any chance of adding some kind of option for structuring the expert layers like Mixtral does, which is a separate nn.Linear layer for each expert? This would make a lot of things easier, namely making it more convenient to load the model on limited VRAM and to fine tune it with LoRA. I don't know if this would require the actual weights on disk to change, so it might complicate things.

ZS1995

Mar 28, 2024

Tried both bitsandbytes 4 bit & 8 bit quantization, not worked. Memory cost is still same to bf16

Williams07

Mar 28, 2024

Tried both bitsandbytes 4 bit & 8 bit quantization, not worked. Memory cost is still same to bf16

how to load this model in 8bit?

MaziyarPanahi

Mar 28, 2024

I should have read this here first! https://github.com/TimDettmers/bitsandbytes/issues/1155

fahadh4ilyas

Mar 28, 2024

•

edited Mar 29, 2024

I change the model format using this script:

import json
from safetensors import safe_open
from safetensors.torch import save_file
from pathlib import Path

model_dir = Path('your_model_dir')
output_dir = Path('your_output_dir')

NUM_EXPERTS = 16
HIDDEN_SIZE = 6144
FFN_HIDDEN_SIZE = 10752

def change_tensor(tensor, reverse=False):

    output = [x.contiguous() if not reverse else x.t().contiguous() for x in tensor.reshape(NUM_EXPERTS, FFN_HIDDEN_SIZE, HIDDEN_SIZE)]

    return output

def change_mlp(tensors):

    keys = list(tensors.keys())
    for k in keys:
        if any([x in k for x in ['w1', 'v1', 'w2']]):
            prefix,dtype = k.rsplit('.', 1)
            tensor = tensors.pop(k)
            output_tensor = change_tensor(tensor, dtype=='w2')
            for i,t in enumerate(output_tensor):
                tensors[f'{prefix}.{i}.{dtype}.weight'] = t

    return tensors

for file in model_dir.glob('*.safetensors'):
    print(file)
    tensors = {}
    with safe_open(file, 'pt') as f:
        metadata = f.metadata()
        for k in f.keys():
            tensors[k] = f.get_tensor(k)
    tensors = change_mlp(tensors)
    save_file(tensors, (output_dir / file.name).as_posix(), metadata)

with open(model_dir / 'model.safetensors.index.json') as f:
    weight_map = json.load(f)

weight_keys = list(weight_map['weight_map'])
for k in weight_keys:
    if any([x in k for x in ['w1', 'v1', 'w2']]):
        prefix,dtype = k.rsplit('.', 1)
        value = weight_map['weight_map'].pop(k)
        for i in range(NUM_EXPERTS):
            weight_map['weight_map'][f'{prefix}.{i}.{dtype}.weight'] = value

sorted_map = sorted(list(weight_map['weight_map'].items()))
weight_map['weight_map'] = dict(sorted_map)

with open(output_dir / 'model.safetensors.index.json', 'w') as f:
    json.dump(weight_map, f, indent=4)

Then, inside file modeling_dbrx.py I change some here:

from this

class DbrxExpertGLU(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.hidden_size = hidden_size
        self.ffn_hidden_size = ffn_hidden_size
        self.moe_num_experts = moe_num_experts

        self.w1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.v1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.w2 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.activation_fn = resolve_ffn_act_fn(ffn_act_fn)

    def forward(self, x: torch.Tensor, expert_idx: int) -> torch.Tensor:
        expert_w1 = self.w1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_v1 = self.v1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_w2 = self.w2.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]

        x1 = x.matmul(expert_w1.t())
        x2 = x.matmul(expert_v1.t())
        x1 = self.activation_fn(x1)
        x1 = x1 * x2
        x1 = x1.matmul(expert_w2)
        return x1


class DbrxExperts(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.moe_num_experts = moe_num_experts
        self.mlp = DbrxExpertGLU(hidden_size=hidden_size,
                                 ffn_hidden_size=ffn_hidden_size,
                                 moe_num_experts=moe_num_experts,
                                 ffn_act_fn=ffn_act_fn)

    def forward(self, x: torch.Tensor, weights: torch.Tensor,
                top_weights: torch.Tensor,
                top_experts: torch.LongTensor) -> torch.Tensor:
        bsz, q_len, hidden_size = x.shape
        x = x.view(-1, hidden_size)
        out = torch.zeros_like(x)

        expert_mask = nn.functional.one_hot(
            top_experts, num_classes=self.moe_num_experts).permute(2, 1, 0)
        for expert_idx in range(0, self.moe_num_experts):
            topk_idx, token_idx = torch.where(expert_mask[expert_idx])
            if token_idx.shape[0] == 0:
                continue

            token_list = token_idx.tolist()
            topk_list = topk_idx.tolist()

            expert_tokens = x[None, token_list].reshape(-1, hidden_size)
            expert_out = self.mlp(
                expert_tokens, expert_idx) * top_weights[token_list, topk_list,
                                                         None]

            out.index_add_(0, token_idx, expert_out)

        out = out.reshape(bsz, q_len, hidden_size)
        return out

to this

class DbrxMLP(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int, ffn_act_fn: dict):
        super().__init__()

        self.w1 = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
        self.v1 = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
        self.w2 = nn.Linear(ffn_hidden_size, hidden_size, bias=False)
        self.activation_fn = resolve_ffn_act_fn(ffn_act_fn)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:

        return self.w2(self.activation_fn(self.w1(x)) * self.v1(x))


class DbrxExperts(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.moe_num_experts = moe_num_experts
        self.mlp = nn.ModuleList([DbrxMLP(hidden_size, ffn_hidden_size, ffn_act_fn) for _ in range(moe_num_experts)])

    def forward(self, x: torch.Tensor, weights: torch.Tensor,
                top_weights: torch.Tensor,
                top_experts: torch.LongTensor) -> torch.Tensor:
        bsz, q_len, hidden_size = x.shape
        x = x.view(-1, hidden_size)
        out = torch.zeros_like(x)

        expert_mask = nn.functional.one_hot(
            top_experts, num_classes=self.moe_num_experts).permute(2, 1, 0)
        for expert_idx in range(0, self.moe_num_experts):
            topk_idx, token_idx = torch.where(expert_mask[expert_idx])
            if token_idx.shape[0] == 0:
                continue

            token_list = token_idx.tolist()
            topk_list = topk_idx.tolist()

            expert_tokens = x[None, token_list].reshape(-1, hidden_size)
            expert_out = self.mlp[expert_idx](expert_tokens) * top_weights[token_list, topk_list, None]

            out.index_add_(0, token_idx, expert_out)

        out = out.reshape(bsz, q_len, hidden_size)
        return out

And from this

class DbrxPreTrainedModel(PreTrainedModel):
    config_class = DbrxConfig
    base_model_prefix = 'transformer'
    supports_gradient_checkpointing = True
    _no_split_modules = ['DbrxBlock']
    _skip_keys_device_placement = ['past_key_values']
    _supports_flash_attn_2 = True
    _supports_sdpa = False
    _supports_cache_class = True

    def _init_weights(self, module: nn.Module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, DbrxExpertGLU):
            module.w1.data.normal_(mean=0.0, std=std)
            module.v1.data.normal_(mean=0.0, std=std)
            module.w2.data.normal_(mean=0.0, std=std)

To this

class DbrxPreTrainedModel(PreTrainedModel):
    config_class = DbrxConfig
    base_model_prefix = 'transformer'
    supports_gradient_checkpointing = True
    _no_split_modules = ['DbrxBlock']
    _skip_keys_device_placement = ['past_key_values']
    _supports_flash_attn_2 = True
    _supports_sdpa = False
    _supports_cache_class = True

    def _init_weights(self, module: nn.Module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()

tdrussell

Mar 29, 2024

I tried something very similar to the above last night, on my own, and can confirm it works. Once the experts are just normal nn.Linear layers they can be loaded with 4 bit quantization. Well, the model "works" in the sense that when you load it in 4 bit it's mostly okay, but will frequently misspell words, double up commas, output random garbage tokens, etc. So something is still off about it (might just be unusually sensitive to load_in_4bit quantization).

Would be nice if we could get some way to do this officially. Otherwise, I imagine someone will upload a model with these changes, and lots of people will just use that version, as you can train LoRA on it etc. Alternatively, bitsandbytes and PEFT would have to add explicit support somehow for the current fused weight architecture.

SinclairSchneider

Mar 29, 2024

I uploaded an adjusted version based on the code from fahadh4ilyas, so people don't need to make the changes manually. Thank you so much for the scripts and the adjustments. Here is the model: SinclairSchneider/dbrx-instruct-quantization-fixed

MLDataScientist

Mar 29, 2024

@SinclairSchneider can you please upload the 4bit quantized version of the model? Thanks!

Williams07

Mar 29, 2024

I uploaded an adjusted version based on the code from fahadh4ilyas, so people don't need to make the changes manually. Thank you so much for the scripts and the adjustments. Here is the model: SinclairSchneider/dbrx-instruct-quantization-fixed

Thank you! so this is a 8bit version?

MLDataScientist

Mar 29, 2024

Also, can you please share how the 4-bit/8-bit quantized model is performing? Thanks!

MLDataScientist

Mar 29, 2024

I uploaded an adjusted version based on the code from fahadh4ilyas, so people don't need to make the changes manually. Thank you so much for the scripts and the adjustments. Here is the model: SinclairSchneider/dbrx-instruct-quantization-fixed

Thank you! so this is a 8bit version?

Based on the file size, it is the original model with bf16.

SinclairSchneider

Mar 29, 2024

It's not quantized but you can use "load_in_4bit=True" or "load_in_8bit=True" without running into out of memory issues like when using the original model.

Williams07

Mar 29, 2024

It's not quantized but you can use "load_in_4bit=True" or "load_in_8bit=True" without running into out of memory issues like when using the original model.

Thanks! Understood

MLDataScientist

Mar 29, 2024

•

edited Mar 29, 2024

@SinclairSchneider is it possible for you to quantized the model using that option? Since you are loading the model in 4bit mode, I think the script is converting the model to 4 bits on the fly which means you can do the same for generating the quantized model files.

fahadh4ilyas

Mar 29, 2024

I tried something very similar to the above last night, on my own, and can confirm it works. Once the experts are just normal nn.Linear layers they can be loaded with 4 bit quantization. Well, the model "works" in the sense that when you load it in 4 bit it's mostly okay, but will frequently misspell words, double up commas, output random garbage tokens, etc. So something is still off about it (might just be unusually sensitive to load_in_4bit quantization).

Would be nice if we could get some way to do this officially. Otherwise, I imagine someone will upload a model with these changes, and lots of people will just use that version, as you can train LoRA on it etc. Alternatively, bitsandbytes and PEFT would have to add explicit support somehow for the current fused weight architecture.

In my case, it works quite well. I literally only change nn.Parameter into nn.Linear. Even when load it in cpu without quantization and generate using it, the generation result is no different with this repo.

Qubitium

Mar 29, 2024

@fahadh4ilyas You are awesome! Anyone interested, the dbrx-instruct converted model is currently getting uploaded to hf 33% complete:

https://huggingface.co/LnL-AI/dbrx-instruct-converted/discussions/2

and hack autogptq wip quant session using converted code/model in progress at: https://github.com/AutoGPTQ/AutoGPTQ/pull/625

jmjzz

Mar 29, 2024

•

edited Mar 29, 2024

I change the model format using this script:

import json
from safetensors import safe_open
from safetensors.torch import save_file
from pathlib import Path

model_dir = Path('your_model_dir')
output_dir = Path('your_output_dir')

NUM_EXPERTS = 16
HIDDEN_SIZE = 6144
FFN_HIDDEN_SIZE = 10752

def change_tensor(tensor, reverse=False):

    output = [x.contiguous() if not reverse else x.t().contiguous() for x in tensor.reshape(NUM_EXPERTS, FFN_HIDDEN_SIZE, HIDDEN_SIZE)]

    return output

def change_mlp(tensors):

    keys = list(tensors.keys())
    for k in keys:
        if any([x in k for x in ['w1', 'v1', 'w2']]):
            prefix,dtype = k.rsplit('.', 1)
            tensor = tensors.pop(k)
            output_tensor = change_tensor(tensor, dtype=='w2')
            for i,t in enumerate(output_tensor):
                tensors[f'{prefix}.{i}.{dtype}.weight'] = t

    return tensors

for file in model_dir.glob('*.safetensors'):
    print(file)
    tensors = {}
    with safe_open(file, 'pt') as f:
        metadata = f.metadata()
        for k in f.keys():
            tensors[k] = f.get_tensor(k)
    tensors = change_mlp(tensors)
    save_file(tensors, (output_dir / file.name).as_posix(), metadata)

with open(model_dir / 'model.safetensors.index.json') as f:
    weight_map = json.load(f)

weight_keys = list(weight_map['weight_map'])
for k in weight_keys:
    if any([x in k for x in ['w1', 'v1', 'w2']]):
        prefix,dtype = k.rsplit('.', 1)
        value = weight_map['weight_map'].pop(k)
        for i in range(NUM_EXPERTS):
            weight_map['weight_map'][f'{prefix}.{i}.{dtype}.weight'] = value

sorted_map = sorted(list(weight_map['weight_map'].items()))
weight_map['weight_map'] = dict(sorted_map)

with open(output_dir / 'model.safetensors.index.json', 'w') as f:
    json.dump(weight_map, f, indent=4)

Then, inside file modeling_dbrx.py I change some here:

from this

class DbrxExpertGLU(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.hidden_size = hidden_size
        self.ffn_hidden_size = ffn_hidden_size
        self.moe_num_experts = moe_num_experts

        self.w1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.v1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.w2 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.activation_fn = resolve_ffn_act_fn(ffn_act_fn)

    def forward(self, x: torch.Tensor, expert_idx: int) -> torch.Tensor:
        expert_w1 = self.w1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_v1 = self.v1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_w2 = self.w2.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]

        x1 = x.matmul(expert_w1.t())
        x2 = x.matmul(expert_v1.t())
        x1 = self.activation_fn(x1)
        x1 = x1 * x2
        x1 = x1.matmul(expert_w2)
        return x1


class DbrxExperts(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.moe_num_experts = moe_num_experts
        self.mlp = DbrxExpertGLU(hidden_size=hidden_size,
                                 ffn_hidden_size=ffn_hidden_size,
                                 moe_num_experts=moe_num_experts,
                                 ffn_act_fn=ffn_act_fn)

    def forward(self, x: torch.Tensor, weights: torch.Tensor,
                top_weights: torch.Tensor,
                top_experts: torch.LongTensor) -> torch.Tensor:
        bsz, q_len, hidden_size = x.shape
        x = x.view(-1, hidden_size)
        out = torch.zeros_like(x)

        expert_mask = nn.functional.one_hot(
            top_experts, num_classes=self.moe_num_experts).permute(2, 1, 0)
        for expert_idx in range(0, self.moe_num_experts):
            topk_idx, token_idx = torch.where(expert_mask[expert_idx])
            if token_idx.shape[0] == 0:
                continue

            token_list = token_idx.tolist()
            topk_list = topk_idx.tolist()

            expert_tokens = x[None, token_list].reshape(-1, hidden_size)
            expert_out = self.mlp(
                expert_tokens, expert_idx) * top_weights[token_list, topk_list,
                                                         None]

            out.index_add_(0, token_idx, expert_out)

        out = out.reshape(bsz, q_len, hidden_size)
        return out

to this

class DbrxMLP(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int, ffn_act_fn: dict):
        super().__init__()

        self.w1 = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
        self.v1 = nn.Linear(hidden_size, ffn_hidden_size, bias=False)
        self.w2 = nn.Linear(ffn_hidden_size, hidden_size, bias=False)
        self.activation_fn = resolve_ffn_act_fn(ffn_act_fn)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:

        return self.w2(self.activation_fn(self.w1(x)) * self.v1(x))


class DbrxExperts(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.moe_num_experts = moe_num_experts
        self.mlp = nn.ModuleList([DbrxMLP(hidden_size, ffn_hidden_size, ffn_act_fn) for _ in range(moe_num_experts)])

    def forward(self, x: torch.Tensor, weights: torch.Tensor,
                top_weights: torch.Tensor,
                top_experts: torch.LongTensor) -> torch.Tensor:
        bsz, q_len, hidden_size = x.shape
        x = x.view(-1, hidden_size)
        out = torch.zeros_like(x)

        expert_mask = nn.functional.one_hot(
            top_experts, num_classes=self.moe_num_experts).permute(2, 1, 0)
        for expert_idx in range(0, self.moe_num_experts):
            topk_idx, token_idx = torch.where(expert_mask[expert_idx])
            if token_idx.shape[0] == 0:
                continue

            token_list = token_idx.tolist()
            topk_list = topk_idx.tolist()

            expert_tokens = x[None, token_list].reshape(-1, hidden_size)
            expert_out = self.mlp[expert_idx](expert_tokens) * top_weights[token_list, topk_list, None]

            out.index_add_(0, token_idx, expert_out)

        out = out.reshape(bsz, q_len, hidden_size)
        return out

And from this

class DbrxPreTrainedModel(PreTrainedModel):
    config_class = DbrxConfig
    base_model_prefix = 'transformer'
    supports_gradient_checkpointing = True
    _no_split_modules = ['DbrxBlock']
    _skip_keys_device_placement = ['past_key_values']
    _supports_flash_attn_2 = True
    _supports_sdpa = False
    _supports_cache_class = True

    def _init_weights(self, module: nn.Module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, DbrxExpertGLU):
            module.w1.data.normal_(mean=0.0, std=std)
            module.v1.data.normal_(mean=0.0, std=std)
            module.w2.data.normal_(mean=0.0, std=std)

To this

class DbrxPreTrainedModel(PreTrainedModel):
    config_class = DbrxConfig
    base_model_prefix = 'transformer'
    supports_gradient_checkpointing = True
    _no_split_modules = ['DbrxBlock']
    _skip_keys_device_placement = ['past_key_values']
    _supports_flash_attn_2 = True
    _supports_sdpa = False
    _supports_cache_class = True

    def _init_weights(self, module: nn.Module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()

Hi @fahadh4ilyas , I have tried your modification but have this error:

 File "modeling_dbrx.py", line 642, in __init__
    self.norm_1 = nn.LayerNorm(hidden_size, bias=False)
TypeError: __init__() got an unexpected keyword argument 'bias'

fahadh4ilyas

Mar 29, 2024

Hi @fahadh4ilyas , I have tried your modification but have this error:

 File "modeling_dbrx.py", line 642, in __init__
    self.norm_1 = nn.LayerNorm(hidden_size, bias=False)
TypeError: __init__() got an unexpected keyword argument 'bias'

I think you have to update your torch. Because that error implied that nn.LayerNorm has no bias parameter in its init method. But, from torch documentation here, clearly bias exists there.

jmjzz

Mar 29, 2024

@fahadh4ilyas Thanks for the clarification! I think now I can load the model in 4 bit, taking ~71G ram in total, but the loading is very slow. The generation also contains many random and garbage tokens.

Qubitium

Mar 30, 2024

dbrx-base-converted now on hf: https://huggingface.co/LnL-AI/dbrx-base-converted

(pending upload...should be complete in about 10 minutes). I have cancelled the broken upload of instruct-converted model now that base is here.

johnrachwanpruna

Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Qubitium

Mar 30, 2024

dbrx-base-converted-v2 now on hf: https://huggingface.co/LnL-AI/dbrx-base-converted-v2 with Wqkv split into q, k, v for potentially better quant compatible. exllama v2 latest code also does this split. Testing and quant has not been verified yet.

Again thanks to @fahadh4ilyas for the v2 changes.

MLDataScientist

Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Is there bnb 4bit for dbrx-instruct? The base model is not good at instruction following. Thanks!

Qubitium

Mar 30, 2024

@MLDataScientist 4bit quant is available for apple mlx (but need like m ultra) and exllama v2 has it too. But if you want base to instruct follow, it doesn't do that. It is base vs instruct for good reason.

johnrachwanpruna

Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Is there bnb 4bit for dbrx-instruct? The base model is not good at instruction following. Thanks!

Yes you can find it here: https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit

MLDataScientist

Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Is there bnb 4bit for dbrx-instruct? The base model is not good at instruction following. Thanks!

Yes you can find it here: https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit

amazing. Thank you!

MLDataScientist

Mar 30, 2024

@MLDataScientist 4bit quant is available for apple mlx (but need like m ultra) and exllama v2 has it too. But if you want base to instruct follow, it doesn't do that. It is base vs instruct for good reason.

I know, right! That is why I needed the instruct bnb 4bit. Fortunately, we already have it as @johnrachwanpruna mentioned above.

MLDataScientist

Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Is there bnb 4bit for dbrx-instruct? The base model is not good at instruction following. Thanks!

Yes you can find it here: https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit

@johnrachwanpruna , Is this dbrx-instruct converted using the method described by @fahadh4ilyas ?

johnrachwanpruna

Mar 30, 2024

•

edited Mar 30, 2024

bnb 4 bit version can be found here: https://huggingface.co/PrunaAI/dbrx-base-bnb-4bit

Is there bnb 4bit for dbrx-instruct? The base model is not good at instruction following. Thanks!

Yes you can find it here: https://huggingface.co/PrunaAI/dbrx-instruct-bnb-4bit

@johnrachwanpruna , Is this dbrx-instruct converted using the method described by @fahadh4ilyas ?

Yes exactly :)

winglian

Mar 30, 2024

@fahadh4ilyas @Qubitium Looks like the converted models don't play nice with 4bit-qlora+fsdp:

https://gist.github.com/winglian/348f792e62386007bc589667f01d2cae

fahadh4ilyas

Mar 31, 2024

@fahadh4ilyas @Qubitium Looks like the converted models don't play nice with 4bit-qlora+fsdp:

https://gist.github.com/winglian/348f792e62386007bc589667f01d2cae

Which model are you using? v1 or v2?

Qubitium

Mar 31, 2024

@fahadh4ilyas @winglian was using v2 for that crash