Models Not Loading

#2
by oneCode - opened

Hi there, this model is not loading. It's not loading into the RAM. It says loading on LM Studio but it's not and fails to load. Please advise. Thanks!!

Which quant type are you testing?

[Q6_K.gguf)

I'm going to try the base one now instead of the instruct.

I just double checked the Q6_K model in llama.cpp and confirmed it works:

llm_load_tensors: ggml ctx size = 0.21 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 26087.51 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/65 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 0.25
llama_new_context_with_model: kv self size = 124.00 MB
llama_build_graph: non-view tensors processed: 1430/1430
llama_new_context_with_model: compute buffer total size = 110.63 MB
llama_new_context_with_model: VRAM scratch buffer: 104.00 MB
llama_new_context_with_model: total VRAM used: 104.00 MB (model: 0.00 MB, context: 104.00 MB)

system_info: n_threads = 56 / 112 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

Instruction:

write a quick sort algorithm in python.

Response:

Sure, here is a simple implementation of the Quick Sort algorithm in Python:

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Test the function
print(quicksort([3,6,8,10,1,2,1]))

This implementation uses list comprehensions to create a new array of elements less than the pivot, equal to the pivot, and greater than the pivot. The // operator is used for integer division (rounding down) when finding the pivot index.

The quicksort() function is then recursively called on the "left" and "right" arrays until they are sorted. Finally, the "left", "middle", and "right" arrays are concaten
llama_print_timings: load time = 61131.05 ms
llama_print_timings: sample time = 136.82 ms / 256 runs ( 0.53 ms per token, 1871.11 tokens per second)
llama_print_timings: prompt eval time = 6226.46 ms / 75 tokens ( 83.02 ms per token, 12.05 tokens per second)
llama_print_timings: eval time = 196632.25 ms / 255 runs ( 771.11 ms per token, 1.30 tokens per second)
llama_print_timings: total time = 203109.18 ms
Log end

Please report the issues to LM Studio, it might be an LM Studio specific issue.

Thanks for checking. The base model is having the same issue in LM Studio. I was so siked to test these.. I'll reach out to LM Studio. Thanks for everything you do!

confirming this fails in lmstudio and oogabooga webui

Yes it fails on oobagooba llama.ccp (latest update) -> issue is "AttributeError: 'LlamaCppModel' object has no attribute 'model' " , other gguf are working without issue.
Can be loaded with ctransfromers but then fail to generate answer on "question/answer" instruct.
I'm uising Q6_K on 4090
Error seems to be related to "ERROR: byte not found in vocab: ' ' " got the same error when loading with ctransformers

Got same error on llamaSharp (llama.cpp used with C# interop) latest git.

llm_load_print_meta: model ftype = mostly Q6_K
llm_load_print_meta: model size = 33.34 B
llm_load_print_meta: general.name = deepseek-ai_deepseek-coder-33b-instruct
llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<´¢£endÔûüofÔûüsentence´¢£>'
llm_load_print_meta: LF token = 0 '!'
llm_load_tensors: ggml ctx size = 0.17 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 13639.64 MB (+ 992.00 MB per state)
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloaded 30/65 layers to GPU
llm_load_tensors: VRAM used: 12448 MB
....................................................................................................
llama_new_context_with_model: kv self size = 992.00 MB
llama_new_context_with_model: compute buffer total size = 491.41 MB
llama_new_context_with_model: VRAM scratch buffer: 490.00 MB

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer..

Instruction:ERROR: byte not found in vocab: ' '
ERROR: byte not found in vocab: ' '
ERROR: byte not found in vocab: ' '
ERROR: byte not found in vocab: ' '
ERROR: byte not found in vocab: ' '
ERROR: byte not found in vocab: ' '

Also failing to start on text-generation-webui using:
python server.py --model "deepseek-coder-33b-instruct.Q8_0.gguf" --threads 24 --n-gpu-layers 2 --n_ctx 16384 --listen --listen-port=8888
I'm getting ERROR: byte not found in vocab: then Segmentation fault

Update 1:
Also updated llama-cpp-python to latest version
pip uninstall -y llama-cpp-python &>/dev/null CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.14 --no-cache-dir &>/dev/null
same error.

Update 2:
Tried to load using cpu and it started loading but gave some warning while loading llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 237/32256 ).

I'm not sure if this is valuable additional information, but the same thing happens for Q4 K_M on ooba.

Unfortunately these GGUFs are currently only supported by llama.cpp

EDIT: unless the llama-cpp-python release yesterday added support? If it did, based on what @yehiaserag said, then support should come to text-generation-webui very soon, as it uses that for GGUF model support.

Original message:
Downstream clients like text-generation-webui, GPT4All, llama-cpp-python, and others, have not yet implemented support for BPE vocabulary, which is required for this model and CausalLM. The DeepSeek Coder models did not provide a tokenizer.model file, so I had to convert them using the HF Vocab tokenizer.json, and this results in a different vocab format.

Hopefully these other clients will add BPE support soon and then they'll work.

In the meantime, either use llama.cpp directly on the command line or using its server mode, or, if you can, try the AWQ or GPTQs I've made of DeepSeek. Otherwise you'll need to use another model for now, until support is added. Nothing I can do I'm afraid.

Tried to load using cpu and it started loading but gave some warning while loading llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 237/32256 ).

This is not an error, just an info message which can be ignored. The same message is printed by llama.cpp and it has no impact that I've noticed.

FYI This worked for me on Mac but only in CPU mode - didn't get the vocabulary error
So appears something changed between 0.2.13 and now?

pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python==0.2.13 --no-cache-dir
pip install 'llama-cpp-python[server]==0.2.13'

I've not tested the NEWEST version mentioned by @TheBloke

Actually just tested the latest llama-cpp-python and it's now working with GPU in Oobabooga on Mac M2 Max
Running the 8 bit quantization

Name: llama_cpp_python
Version: 0.2.14

Great to hear!

Upgrade llama_cpp_python to 0.2.14, and now it is working with CPU in Oobabooga on Linux:

./cmd_linux.sh
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Actually just tested the latest llama-cpp-python and it's now working with GPU in Oobabooga on Mac M2 Max
Running the 8 bit quantization

Name: llama_cpp_python
Version: 0.2.14

@gmacgmac
How? I'm still only able to run on cpu but not gpu using latest llama-cpp-python=0.2.14

Upgrade llama_cpp_python to 0.2.14, and now it is working with CPU in Oobabooga on Linux:

./cmd_linux.sh
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

This installed 0.2.18 for me, and the "ERROR: byte not found in vocab: ' '" issue still happens to me for deepseek-coder-6.7b-instruct.Q8_0.gguf for the GPU, but loads correctly with the CPU.

that "deepseek-coder-6.7b-instruct.Q8_0.gguf" in my experience was very strange behaving.
i would get responses that would continue with unpredictable behaviour, adding characters, emojis, etc.. i spent a lot of time on it and the prompt template however when I tried other quanitsations like 6 then it was fine - so something weird going on

@NotTooSpooky same is happening with me

If you want to upgrade to llama-cpp-python=0.2.14 and get this error:
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects,
then you can replace all 0.2.11 with 0.2.14 in requirements.txt then pip install -r requirements.txt again.
It's best first to uninstall the old version pip uninstall llama-cpp-python -y

If you want to upgrade to llama-cpp-python=0.2.14 and get this error:
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects,
then you can replace all 0.2.11 with 0.2.14 in requirements.txt then pip install -r requirements.txt again.
It's best first to uninstall the old version pip uninstall llama-cpp-python -y

You need to activate the environment first with:
./cmd_linux.sh

Q4_K_M with cpp 0.2.18 fully offloaded to GPU is working for me on ooba. Q8 has two reports of being broken so I won't try that yet.

Edit:
Q6_K and Q8_0 (with 4k context) are working for me as well for at least one response.

Just confirming the bug, ctransformers gives the same error upon model = AutoModelForCausalLM.from_pretrained(path, hf=True):

ERROR: byte not found in vocab: '
'
Segmentation fault (core dumped)

Unfortunately ctransformers is not currently compatible with DeepSeek models, or other models that use BPE vocab. I'm hoping ctransformers will get an update soon.

You can use llama-cpp-python instead, which is fully compatible.

@artificialgenerations4gsdfg did it work for you using llama.cpp?

Yes it works on 0.2.19

llama_cpp_python 0.2.19+cpuavx2
llama_cpp_python_cuda 0.2.19+cu121

I'm using it with the following parameters :

temperature: 1.31
top_p: 0.14
top_k: 49
repetition_penality: 1 (seems important, if not it's adding new questions/answers by itself not related to the subject)

Get 14.32 tokens/s on 4090 with 64 layers to GPU

Sign up or log in to comment