Edit model card

GGUF importance matrix (imatrix) quants for https://huggingface.co/LargeWorldModel/LWM-Text-Chat-128K
The importance matrix was trained for 100K tokens (200 batches of 512 tokens) using wiki.train.raw.

  • The imatrix Q4-K quant fits with 32K context on 24GB and gives me ~100 t/s inference on a 3090.
  • With IQ3_XXS it seems to fit ~37K context on 24GB (and it is even faster than Q4-K).
  • With either quant on a 3090 it seems to decode context at well over 2000 t/s.
  • Using Q8 K-cache (instead of F16) you can fit up to 43-44K context but inference speed goes down a little bit.
  • Also for some reason I need to use 1.0 penalty to avoid the response being cut-off.
Layers Context Template
You are a helpful assistant.
Don't give information outside the document or repeat your findings. Keep your response short and direct.
Downloads last month
Inference Examples
Inference API (serverless) does not yet support gguf models for this pipeline type.