TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ · RuntimeError: CUDA error: invalid configuration argument

I am trying to get Mixtral-8x7B-Instruct-v0.1-GPTQ running on an Ubuntu 22.04 container prebuild for CUDA CNN 12 attached to an NVIDIA RTX 6000 Ada Generation GPU.

When using the initialization routine

import torch
from auto_gptq import exllama_set_max_input_length
from transformers import  MixtralForCausalLM, AutoTokenizer, pipeline

model_id = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
revision = "main" # main revision ist currently 4 Bit

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, use_fast=True)

model = MixtralForCausalLM.from_pretrained(
    model_id,
    revision=revision,
    device_map="auto",
    use_safetensors=True,
    trust_remote_code=False,
)

I get the warning

Some weights of the model checkpoint at TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ were not used when initializing MixtralForCausalLM

followed by 896 layer names, though nvidia-smi tells me that 20 GB have been uploaded to the card.

When I try to run a simple pipeline on the model:

pipe = pipeline(
    "text-generation",
    tokenizer=tokenizer,
    model=model,
    max_new_tokens=512
)

os.environ['CUDA_LAUNCH_BLOCKING']='1'  # removal of this line does not change the outcome
result = pipe("[INST] write a poem [INST]")

I get the following error:

RuntimeError: CUDA error: invalid configuration argument

How should I proceed to tackle the problem?

Thanks in advance
Guenter

TheBloke
/

Mixtral-8x7B-Instruct-v0.1-GPTQ

RuntimeError: CUDA error: invalid configuration argument - how to tackle that?