Do they work with ollama? How was the conversion done for 128K, llama.cpp/convert.py complains about ROPE.

by BigDeeper - opened Apr 24

Apr 24

•

(Pythogora) developer@ai:~/PROJECTS/autogen$ ~/ollama/ollama run phi-3-mini-128k-instruct.Q6_K
Error: llama runner process no longer running: 1 error:failed to create context with model '/home/developer/.ollama/models/blobs/sha256-78f928e77e2470c7c09b151ff978bc348ba18ccde0991d03fe34f16fb9471460'

This how the model file looks like.

FROM /opt/data/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/Phi-3-mini-128k-instruct.Q6_K.gguf

TEMPLATE """
{{- if .First}}
<|system|>
{{ .System}}<|end|>
{{- end}}
<|user|>
{{ .Prompt}}<|end|>
<|assistant|>
"""

PARAMETER num_ctx 128000
PARAMETER temperature 0.2
PARAMETER num_gpu 100

PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>

SYSTEM """You are a helpful AI which can plan, program, and test, analyze and debug."""

johnrachwanpruna

Pruna AI org Apr 24

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

BigDeeper

Apr 24

•

edited Apr 24

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

Built both from source. I was able to load someone else's gguf into VRAM with llama.cpp and it was kind of responding, but ollama fails consistently.

Ollama can import a gguf model but not run it.

johnrachwanpruna

Pruna AI org Apr 24

We have not tested the model on ollama but please make sure you are using the latest versions of both ollama and llama.cpp :)

Built both from source. I was able to load someone else's gguf into VRAM with llama.cpp and it was kind of responding, but ollama fails consistently.

We have had users succesfully run these quants with llama.cpp but no one mentioned ollama. It could be the case that ollama does not support phi-3 models yet

BigDeeper

Apr 25

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

hugandfesse

Apr 25

•

edited Apr 25

@BigDeeper : use convert-hf-to-gguf.py instead of convert.py

johnrachwanpruna

Pruna AI org Apr 25

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

Nice thanks a lot for coming back and letting everyone know! I will keep the discussion open.

jabafett

Apr 26

Temporary work-around is to set the context to 60000. Not as good as 128K, but better than 4K.

This appears to work for ollama with my four(4) 12.2GiB Titan GPUs. Others may have to play with their context sizes to match their hardware.

Also find this exact problem on way less hardware. There must be some Ollama problem with num_ctx above 60000 or so.

TylerRoost

Apr 27

The 60k trick does work with command line ollama, however for some reason I can't figure out it does not work with ChatOllama or Ollama from Langchain.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment