BagelMix-8x7B - main branch 2g16-4g64-HQQ

Under 20 GB By Undi95

(this readme has been written by a sleepy person. the link above takes you to the original model, the link below to the Mixtral HQQ reference. the rest is rambling)

main branch is the same quant config as last time, the reference one from mobius here

the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers. from what I can gather of hqq's source code, the gate ('expert' selection) network isn't quantised (because it's tiny and very important) this is the reason we quantise the attention layers at 4 bits too - in a MoE it's small (shared between all the 'experts') which means it would quantize like a 2 bpw mistral such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.

i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already, which would look like this, which is not the model on the main branch:

attn_prams     = BaseQuantizeConfig(nbits=8, group_size=512, quant_zero=True, quant_scale=True) # MAIN BRANCH IS nbits=4 group_size=64 !!!
attn_prams['scale_quant_params']['group_size'] = 512 #was 256, not sure what this does lol
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, quant_scale=True)

again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down. these will be uploaded as alternate branches to this repo if they seem worth doing. might fiddle with 2g32 or even 3g128 or such for experts. or try to stop HQQ from casting BF16 to FP16 for no reason.

you could also use the included/linked python script (and a big swap partition) to make them yourself.

 for mixtral, using hqq 0.1.2.post: 
 you will need >180 gigabytes of physically addressable memory - but it doesn't need to be RAM. Set yourself up with a ~160GB swap partition.
 the VRAM requirement is initially zero and never much larger than the emerging model. thus you can make any quant you can run.

this takes about 10 minutes with the current optimizer - it takes me all day to upload an ~18 GiB file.

ps read Sleeper Agents (2024/01) :-)

BagelMix

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the DARE TIES merge method using jondurbin/bagel-dpo-8x7b-v0.2 as a base.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: jondurbin/bagel-dpo-8x7b-v0.2
    parameters:
      density: 1.0
      weight: 1.0
  - model: Doctor-Shotgun/Mixtral-8x7B-Instruct-v0.1-LimaRP-ZLoss
    parameters:
      density: 0.5
      weight: [0.33, 0.4, 0.33]
  - model: NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss
    parameters:
      density: [0.33, 0.45, 0.66]
      weight: 0.66
merge_method: dare_ties
base_model: jondurbin/bagel-dpo-8x7b-v0.2
parameters:
  normalize: true
  int8_mask: true
dtype: bfloat16
tokenizer_source : union

If you want to support me, you can here.

ProphetOfBostrom
/

BagelMix-8x7B-2b-HQQ