Error loading model: wrong number of tensors; expected 256, got 255

#2
by prarthanagarwal - opened

I am trying to load this in llama-cpp-python. Any help is appreciated. I downloaded the Q4_K_M.

class LLMHandler:
def init(self):
try:
model_path = str(config.models.LLAMA_PATH)
if not os.path.exists(model_path):
raise FileNotFoundError(f"LLaMA model not found at: {model_path}")

        if torch.cuda.is_available():
            logger.info(f"CUDA available: {torch.cuda.get_device_name()}")
            logger.info(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
            torch.cuda.empty_cache()
            gc.collect()
        
        logger.info("Initializing LLaMA with Q4_K_M specific settings")
        
        # Settings specifically for Q4_K_M
        self.model = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_batch=512,
            n_threads=4,
            n_gpu_layers=20,     # Partial GPU offload
            seed=42,
            use_mmap=True,
            use_mlock=False,
            main_gpu=0,
            tensor_split=None,
            vocab_only=False,
            use_float16=True,    # Enable float16 for Q4_K
            rope_freq_base=500000,  # From model metadata
            rope_freq_scale=1.0,
            n_gqa=8,            # From model metadata
            rms_norm_eps=1e-5,  # From model metadata
            verbose=True
        )
        
        logger.info("Model initialized, testing...")
        test = self.model.create_completion("Test", max_tokens=1)
        logger.info("Model test successful")
        
    except Exception as e:
        import traceback
        detailed_error = f"Failed to initialize LLaMA: {str(e)}\n"
        detailed_error += f"Traceback: {traceback.format_exc()}"
        logger.error(detailed_error)
        raise RuntimeError(detailed_error)

async def generate(self, prompt: str) -> str:
    try:
        response = self.model.create_completion(
            prompt,
            max_tokens=config.system.max_tokens,
            temperature=config.system.temperature,
            top_p=config.system.top_p,
            stop=["Human:", "Assistant:"],
            stream=True
        )
        
        full_response = ""
        for chunk in response:
            if chunk.choices[0].text:
                full_response += chunk.choices[0].text
                
        return full_response.strip()
    except Exception as e:
        logger.error(f"Error generating response: {e}")
        raise

make sure your llama-cpp-python is up to date, this error is usually indicative of an old install

Sir I have tried every single version of llama-cpp-python, both built-in cuda support and not. Been trying to debug it since the last 4 hours, I also tried pre-built wheel and building it from source but no luck!

hmm that's very odd, because that error is very explicitly caused by older versions of llama.cpp that didn't know where to find the ROPE tensor..

can you do:

import llama_cpp

print(llama_cpp.__version__)

?

deleted

I have no stake in this model, but since its was small, i ran it as a llamafile just to see. It worked fine, and imported into ollama ok too. So the model itself is ok. ( didn't feel like running it directly.. ) ( 6_K version )

true, there is no error with the model. I checked with a python script and the current version is : 0.2.19+cu118 (i tried both upgrading and downgrading)

Yeah, so that release is a full year old...

https://pypi.org/project/llama-cpp-python/0.2.19/

So, it has been 17 hours since your last reply. I installed Visual Studio Build Tools, Latest CUDA Toolkit, Tried cpp-python 0.2.77 and other versions - no luck

i don't know how but the combination of CUDA 12.4 + 0.30 llama-cpp-python WORKED!!!!! 😭

It wasn't working because you needed a significantly newer version of the python package, they're on 0.3.2 at this point, the fix for this error only came out a few months ago, and you were on one from a whole year ago

Sign up or log in to comment