ollama q4_0 model starts generating garbage for longer generations

#4
by eugeneware - opened

It generates fine for short passages, and then generates garbage and doesn’t stop, like it’s not stopping on an end token or something.

Details:

$ sha256sum Phi-3-mini-4k-instruct-q4.gguf
4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e  Phi-3-mini-4k-instruct-q4.gguf

$ ollama --version
ollama version is 0.1.32

$llama show phi3mini --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3mini:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"

Here’s an example of what it generates when it goes crazy:


Let's extend the previous code to include an endpoint with simulated batching every 500ms:

```python
import asyncio
from datetime import datetime, timedelta
from fastapi import FastAPI, HTTPException
# (other imports remain unchanged)

app = FastAPI()
request_queue: asyncio.Queue[List[str]] = asyncio.Queue(maxsize=10)  # Adjust size based on expected load.



@app

	.post("/predict")
async def make_prediction(data: List[str]):
    request_queue.put_nowait(data)

# Simulated periodic processing every 500ms, adjust the batching interval as needed.
async def process_requests():
    while True:
        # Get and process a batch of requests in an asynchronous manner using `asyncio` tasks.
        async with websockets.connect("ws://localhost:8789/batch-predict") as websocket:  # This should be replaced by your real WS endpoint setup.
            start_time = datetime.now()
            while True:
                batch_data = [request_queue.get_nowait() for _ in range(min(10, request_queue.qsize())]  # Adjust the number of requests processed count
with `asyncio`.
                    results = await asyncio.sleep(500ms: this implementation could be implemented using aiopsy.gathering task executor (forces to terminate
the code snippet of Llamasyncio.  #python in python code for request processing and logging into an async requests, use oflaglio.
python asyncio. Usefulness andltwo using PyTensor(synchronously execute aio androboticsoloctionary models. Forward-like this model loading model serve
itermainlylog, yout0starter, butler toyfake 500s APIrioting requests. Assumeley processing of Llammingtonskilling serves with async functioning Python that
is serving a server to usecase `asyncecklinglingways, with the model operations every andltwoeither to run asynchronous, buttelenco forge anfake using
FastLlama2 andpy andlsusible models. Use of python server in-like service. You'll200sizeset modeling thatpsyolo use case youtio Python code:
tokens, butter to deployments and import a module loadable (simulate functionallys
requests,pytorch requests for a single modelinglaterkilling operation usages of python asyntockets using functions. The fast-likewheelitelemplates withs
atlas andlambatches andapipeckakes topsend0pinglingly andlatency to performers andkills, theks to beacon for a to usekautrial models andlsaying
Modeltockets, forgeams.pylatercs tops attengure asyntofakelemplexes,mymodel orsnakexilexlinglysight �languages ands20pinglexeskylexaxhing andphasesetypes
andheis_model andsane asserowley andorikaero to beckeryd2lambaker tobifacessimalyticskiles,coclampiveh-apiarubilasticlysomerpiplecoskillingly-
 h-requests 0 `tocargoilis you -0
pift, pyrosting and the model
modellary toloud in thehing a new json


 requests an pre-

 a quickerndreaming to
 to beck to


mika server
  customs
you to

request -subronning
p larime, tolless no model �ary to the model the,l to b forked as an
 toclaiming its requests a request and0
 datheques

d20oric:s r-
  server:
 pyck as itoi
 to
toches, for the
 to serve

 the model to

 `API

requests
 to �responses
 custom
 your' ayou



py `mops and batch (10ero
  0-binary or no-
to
:

Also happens with the official phi3 ollama model at https://ollama.com/library/phi3

in case it’s relevant - I’m running inference on 2x3090s on Ubuntu 20.04.06

Same issue with LlamaEdge.

Also happens with the fp16 weights too.

fixed with:

python3 gguf-py/scripts/gguf-set-metadata.py Phi-3-mini-4k-instruct-q4.gguf tokenizer.ggml.eos_token_id 32007

gguf-py is part of ggerganov / llama.cpp

adopted from: https://www.reddit.com/r/LocalLLaMA/comments/1cb6cuu/comment/l0we43q

That seems to help a little bit, but still seeing corruptions in generation

Hi there, I'm so sorry this happened. It's an issue seen with a few models that occurs when hitting the context limit. We're working on a longer term fix for better handling cases where the context limit is hit.

In the meantime, I've updated some of the runtime parameters here: https://ollama.com/library/phi3 (you can re-pull with ollama pull phi3 and it should be fast to re-pull it).

For those using the Modelfile, adding a num_keep parameter command will help:

PARAMETER num_keep 16

Thanks. I'm still seeing issue when I hit the context limit with the latest modelfile. But I guess that's because the last few messages are still overwhelming the context. Thanks for the hard work. Love your work!

Microsoft org

I am wondering if you can try this

FROM ./Phi-3-mini-4k-instruct-q4.gguf
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>
PARAMETER num_ctx 4096

The current phi3 ollama modelfile is currently:

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""
PARAMETER num_keep 4
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|system|>"
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"
nguyenbh changed discussion status to closed

Sign up or log in to comment