microsoft/Phi-3-mini-4k-instruct-gguf · ollama q4_0 model starts generating garbage for longer generations

Apr 23

It generates fine for short passages, and then generates garbage and doesn’t stop, like it’s not stopping on an end token or something.

Details:

$ sha256sum Phi-3-mini-4k-instruct-q4.gguf
4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e  Phi-3-mini-4k-instruct-q4.gguf

$ ollama --version
ollama version is 0.1.32

$llama show phi3mini --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3mini:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"

Here’s an example of what it generates when it goes crazy:


Let's extend the previous code to include an endpoint with simulated batching every 500ms:

```python
import asyncio
from datetime import datetime, timedelta
from fastapi import FastAPI, HTTPException
# (other imports remain unchanged)

app = FastAPI()
request_queue: asyncio.Queue[List[str]] = asyncio.Queue(maxsize=10)  # Adjust size based on expected load.



@app

	.post("/predict")
async def make_prediction(data: List[str]):
    request_queue.put_nowait(data)

# Simulated periodic processing every 500ms, adjust the batching interval as needed.
async def process_requests():
    while True:
        # Get and process a batch of requests in an asynchronous manner using `asyncio` tasks.
        async with websockets.connect("ws://localhost:8789/batch-predict") as websocket:  # This should be replaced by your real WS endpoint setup.
            start_time = datetime.now()
            while True:
                batch_data = [request_queue.get_nowait() for _ in range(min(10, request_queue.qsize())]  # Adjust the number of requests processed count
with `asyncio`.
                    results = await asyncio.sleep(500ms: this implementation could be implemented using aiopsy.gathering task executor (forces to terminate
the code snippet of Llamasyncio.  #python in python code for request processing and logging into an async requests, use oflaglio.
python asyncio. Usefulness andltwo using PyTensor(synchronously execute aio androboticsoloctionary models. Forward-like this model loading model serve
itermainlylog, yout0starter, butler toyfake 500s APIrioting requests. Assumeley processing of Llammingtonskilling serves with async functioning Python that
is serving a server to usecase `asyncecklinglingways, with the model operations every andltwoeither to run asynchronous, buttelenco forge anfake using
FastLlama2 andpy andlsusible models. Use of python server in-like service. You'll200sizeset modeling thatpsyolo use case youtio Python code:
tokens, butter to deployments and import a module loadable (simulate functionallys
requests,pytorch requests for a single modelinglaterkilling operation usages of python asyntockets using functions. The fast-likewheelitelemplates withs
atlas andlambatches andapipeckakes topsend0pinglingly andlatency to performers andkills, theks to beacon for a to usekautrial models andlsaying
Modeltockets, forgeams.pylatercs tops attengure asyntofakelemplexes,mymodel orsnakexilexlinglysight ï¿½languages ands20pinglexeskylexaxhing andphasesetypes
andheis_model andsane asserowley andorikaero to beckeryd2lambaker tobifacessimalyticskiles,coclampiveh-apiarubilasticlysomerpiplecoskillingly-
 h-requests 0 `tocargoilis you -0
pift, pyrosting and the model
modellary toloud in thehing a new json


 requests an pre-

 a quickerndreaming to
 to beck to


mika server
  customs
you to

request -subronning
p larime, tolless no model ï¿½ary to the model the,l to b forked as an
 toclaiming its requests a request and0
 datheques

d20oric:s r-
  server:
 pyck as itoi
 to
toches, for the
 to serve

 the model to

 `API

requests
 to ï¿½responses
 custom
 your' ayou



py `mops and batch (10ero
  0-binary or no-
to
:

eugeneware

Apr 23

Also happens with the official phi3 ollama model at https://ollama.com/library/phi3

eugeneware

Apr 23

•

edited Apr 23

in case it’s relevant - I’m running inference on 2x3090s on Ubuntu 20.04.06

apepkuss79

Apr 23

Same issue with LlamaEdge.

eware-godaddy

Apr 23

Also happens with the fp16 weights too.

retteghy

Apr 23

fixed with:

python3 gguf-py/scripts/gguf-set-metadata.py Phi-3-mini-4k-instruct-q4.gguf tokenizer.ggml.eos_token_id 32007

gguf-py is part of ggerganov / llama.cpp

adopted from: https://www.reddit.com/r/LocalLLaMA/comments/1cb6cuu/comment/l0we43q

eware-godaddy

Apr 23

That seems to help a little bit, but still seeing corruptions in generation

jmorganca

Apr 23

Hi there, I'm so sorry this happened. It's an issue seen with a few models that occurs when hitting the context limit. We're working on a longer term fix for better handling cases where the context limit is hit.

In the meantime, I've updated some of the runtime parameters here: https://ollama.com/library/phi3 (you can re-pull with ollama pull phi3 and it should be fast to re-pull it).

For those using the Modelfile, adding a num_keep parameter command will help:

PARAMETER num_keep 16

eware-godaddy

Apr 23

•

edited Apr 23

Thanks. I'm still seeing issue when I hit the context limit with the latest modelfile. But I guess that's because the last few messages are still overwhelming the context. Thanks for the hard work. Love your work!

nguyenbh

Microsoft org Apr 23

I am wondering if you can try this

FROM ./Phi-3-mini-4k-instruct-q4.gguf
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>
PARAMETER num_ctx 4096

eware-godaddy

Apr 24

The current phi3 ollama modelfile is currently:

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""
PARAMETER num_keep 4
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|system|>"
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"

nguyenbh changed discussion status to closed Apr 24