YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

E60/EQ60 means embeddings in ggml_type q6_0, thus only compatible with IK_Llama.CPP and Croco.CPP until this q6_0 quant eventually reaches mainline. E50/EQ50 or no mention means embeddings in mainline ggml_types (iq4_xs or Q5_0, usually)

For the IQ3 145L quantized model, the peformances close to IQ4_XS, with a few gigabites torn down. Best suited for 24-24-16GB GPU configs at 24-32k context with KV cache q6_0/q5_0 Or for more, of course.

For the IQ3_XSSL quantized model, the performances are probably akin to IQ3_XS.

These quants are made for my own use, and I decided to share them. Nothing special about them, except that they suit my needs.

Basically, my quant strategies obey to a few rules diverging from mainline.

  • I often dump attn_q by one degree of quant, like mainline does for iq3_xxs in iq2_s, as well as attn_output.
  • I often up attn_k and attn_v by one degree of quant, for example. Mainline usually neglects too much those tensors in the GQA era.
  • I bump the embeddings, because they do not offload on the GPU (aside for Bitner and Maybe Gemma).
  • I sometimes bump a whole FFN_down by one degree of quant, or down some layers of FFN_up and FFN_gate by one degree of quant.
Downloads last month
90
GGUF
Model size
123B params
Architecture
llama

3-bit

16-bit

Inference API
Unable to determine this model's library. Check the docs .