Do they work with ollama? How was the conversion done for 128K, llama.cpp/convert.py complains about ROPE.

#2
by BigDeeper - opened

(Pythogora) developer@ai:~/PROJECTS/autogen$ ~/ollama/ollama run phi-3-mini-128k-instruct.Q6_K
Error: llama runner process no longer running: 1 error:failed to create context with model '/home/developer/.ollama/models/blobs/sha256-78f928e77e2470c7c09b151ff978bc348ba18ccde0991d03fe34f16fb9471460'

This how the model file looks like.

FROM /opt/data/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/Phi-3-mini-128k-instruct.Q6_K.gguf

TEMPLATE """
{{- if .First}}
<|system|>
{{ .System}}<|end|>
{{- end}}
<|user|>
{{ .Prompt}}<|end|>
<|assistant|>
"""

PARAMETER num_ctx 128000
PARAMETER temperature 0.2
PARAMETER num_gpu 100

PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>

SYSTEM """You are a helpful AI which can plan, program, and test, analyze and debug."""

Pruna AI org

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

Built both from source. I was able to load someone else's gguf into VRAM with llama.cpp and it was kind of responding, but ollama fails consistently.

Ollama can import a gguf model but not run it.

Pruna AI org

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

Built both from source. I was able to load someone else's gguf into VRAM with llama.cpp and it was kind of responding, but ollama fails consistently.

We have had users succesfully run these quants with llama.cpp but no one mentioned ollama. It could be the case that ollama does not support phi-3 models yet

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

@BigDeeper : use convert-hf-to-gguf.py instead of convert.py

Pruna AI org

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

Nice thanks a lot for coming back and letting everyone know! I will keep the discussion open.

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

Also find this exact problem on way less hardware. There must be some Ollama problem with num_ctx above 60000 or so.

The 60k trick does work with command line ollama, however for some reason I can't figure out it does not work with ChatOllama or Ollama from Langchain.

Sign up or log in to comment