Error loading model

by sm54 - opened Jun 17

Discussion

sm54

Jun 17

•

edited Jun 17

Hello,

I've tried loading the q8_0 quant, and I get this error, using windows text generation webui:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'deepseek2'
llama_load_model_from_file: failed to load model
19:48:02-513121 ERROR Failed to load the model.

bartowski

Owner Jun 17

text gen llama-cpp needs an update

saintjohnny

Jun 18

Turn off flash attention. This seems to be a known bug.

bartowski

Owner Jun 18

i would think that's a different error than 'unknown model architecture' but i may be wrong

wrtn2

Jun 19

Loading some layers to GPU (-ngl) with latest llama.cpp returned "llama_init_from_gpt_params: error: failed to load model".
Using only CPU solved this for me (as mentioned here https://github.com/ggerganov/llama.cpp/pull/7519).
Using flash attention (-fa) gave error: "GGML_ASSERT: ggml.c:5716: ggml_nelements(a) == ne0*ne1".

bartowski

Owner Jun 19

@wrtn2 you have to disable flash attention for this model to use GPU

wrtn2

Jun 20

@bartowski Thanks, good to know! In my case the card lacks sufficient RAM, so I'd set llama to load only a subset of the layers on the GPU, which is possible with a number of models, but seems not to be on this one.

paolovic

Jun 20

Hi all,
could you tell me, how you make it run?

Right now, I am using this cumbersome ipynb

from llama_cpp import Llama

llm = Llama(
      model_path="/DeepSeek-Coder-V2-Lite-Instruct-GGUF/DeepSeek-Coder-V2-Lite-Instruct-Q8_1.gguf",
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=8*2048, # Uncomment to increase the context window
)

response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful coding assistant."},
          {
              "role": "user",
              "content": "give me quick sort in c++."
          }
      ]
)
print(response["choices"][0]["message"]["content"])

Is there a more convenient way, using huggingface or anything else?

Thank you in advance!

bartowski

Owner Jun 20

(updated the name to Q8_0_L from Q8_1 just now fyi)

That looks like a fine implementation, is there an issue you're running into or just trying to find a better way?

paolovic

Jun 20

alright, great, thank you very much!

I am just used to the transformers lingo and thought, maybe there's a better way.

and thanks for the fast reply!

Vitaliy-K-1

Jun 21

This comment has been hidden

Konstantin89

Jun 22

Hi. I wanted to test a model up to 8 gigabytes. Downloaded IQ 3. It doesn't work in programs - GPT4 All and LM Studio(((( I'd appreciate it if you could help me get it up and running.

someuser44

Jun 23

Getting this in LMstudio w flash attention off, tried both w GPU offload and CPU only, same message. Not sure what to do :/ Preset is Deepseek Coder, maybe it needs a deepseek coder instruct preset?

error:
"llama.cpp error: 'error loading model architecture: unknown model architecture: 'deepseek2''"

bartowski

Owner Jun 23

Update to 0.2.25 from the website, or ignore it if you're already on it

Honeywest

Jul 1

•

edited Jul 1

I'm running LM Studio 0.2.26, and it fails. Tried gpt4all, Jan, ollama with Chatollama. Nothing will load this model. tried q4, q8. Flash attention is disabled. How do I use this model?

Ok, I figured it out. If you are using Ollama in a docker:

docker pull ollama/ollama:latest

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment