This model is amazingly good

#1
by rambocoder - opened

I am impressed. Works with latest llama.cpp without any issues.

This comment has been hidden

How did you ran, can you please share the code?

:What is a good invocation and paramaters for mistral 7b?

I'm testing with --temp 0.6 --mirostat 2 --mirostat-ent 6 --mirostat-lr 0.2 -n 2048 -c 2048 -n -1 --repeat-last-n 1600 --repeat-penalty 1.2

This has to be indeed the best 7b model I have tried.. for those who can't get it to run in text-generation-ui (I sure couldn't, it's broken af) here's some code and detailed instructions for a simple llama-cpp-python chatbot using this model.

First, I recommend a clean python installation with pip etc, you can use a virtual environment for this (I'm using miniconda with python version 3.10).
Then I installed llama-cpp-python with cuda support using the following commands (in windows cmd).

set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && pip install llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on" && set FORCE_CMAKE=1 && set CUDAFLAGS="-arch=all -lcublas"
python -m pip install https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases/download/basic/llama_cpp_python-0.2.7+cu118-cp310-cp310-win_amd64.whl --no-cache-dir

(Note: this works for me using cuda 11.8, no avx. For other versions you might wanna replace the link with another from https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels)

And that's it.. now you can run the following python script to ask the model questions.

python simpleStreamChat.py

import json
import argparse
from llama_cpp import Llama

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="../models/mistral-7b-instruct-v0.1.Q4_K_M.gguf")
parser.add_argument("-pt", "--prompt", type=str, default="<s>[INST]{prompt}[/INST]")
args = parser.parse_args()

prompt_template = args.prompt

print("Loading model " + args.model)
llm = Llama(model_path=args.model, n_gpu_layers=35, n_ctx=4096, temp=0.7, repeat_penalty=1.1, verbose=False)

stream = ""#llm("Question: What are the names of the planets in the solar system? Answer: ", max_tokens=48,stop=["Q:", "\n"],stream=True)

# Function - Print response output in chunks (stream)
def printresponse(response):
    completion_text = ''
    # iterate through the stream of events and print it
    print(f"Bot:", end="", flush=True)
    for event in response:
        event_text = event['choices'][0]['text']
        completion_text += event_text
        print(f"{event_text}", end="", flush=True)

    print("",flush=True)
    # remember context
    #context.append({"role": "assistant", "content" : completion_text})
    return completion_text

#printresponse(stream)

while True:
    try:
        u_input = input("-> ")
        
        prompt = prompt_template.format(prompt=u_input)
        stream = llm(prompt, max_tokens=512, stream=True)
        response = printresponse(stream)
        print()

    except KeyboardInterrupt:
        print("\n..(Response interrupted).")#continue
    print()

Note: set verbose=True to see token generation times etc. n_gpu_layers=how many layers on gpu, n_ctx=context size

Uncomment stream = ""#llm("Question: What are the names of the planets in the solar system? Answer: ", max_tokens=48,stop=["Q:", "\n"],stream=True)
and #printresponse(stream) if you want.

You're welcome!

deleted

odd. working fine for me and i have not updated anything in a couple of weeks. ( ooba )

Anyone able to use it with constricting grammar in llama.cpp ?

Hands down the best 7b model, holy cow.
For starters, I have a custom character, but the settings I'm using in tgwui are:
Instruction Template: Mistral (no modifications)
Generation Preset: Divine Intellect
Model Loader: LlamaCpp
The model is smart, retains context after several turns, great inference, picks up on nuance.
Mistral just hit's different than Llama, no judgement on Meta.
If you've watched Frazier, Mistral is like a very smart Roz, and Llama is Maris.

I was able to run this model in Q5_K_M.gguf through oobabooga on a M2 MacBookPro with 16GB : it runs very smoothly with 1 layer in GPU units, quite faster than 7B Llama2 or Vigogne with same quantization. However, it seems sometimes to struggle with long conversations (the answers get lesser accurate and you need to reload the model).
Tested on bash and SQL code, the results where relevant in most cases.

Is it uncensored ?

@edumoulin , you tested in ql code as in? I want to understand how can we use this model to query database/csv or panda datafame. I tried with langchain but no luck. Would you be so kind to point out the tools/code how to achieve this? I feel that it would be useful to a lot of people. Thank you very much in advance.

Is it uncensored ?

@Akalilol , this is what they claim on their website, "It does not have any moderation mechanism. We’re looking forward to engaging with the community on ways to make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs".
But I did not try to ask for questionable topics... 🙂

you tested in ql code as in? I want to understand how can we use this model to query database/csv or panda datafame.

@ianuvrat , actually I was only able to test its capabilities in chat mode for writing bash scripts and SQL statements, only using natural language. For now, it seems not to work in instruct mode, nor does it accept training at this time, at least using oobabooga's functions (https://github.com/oobabooga/text-generation-webui).

I guess this will change over time as we are currently in 0.1 version. I'm also curious to go deeper in database exploration

Mistral is so good, exceeds expectation.

I have a question, I am trying to generate a poem, but it only generates half poems.. How do i make it to generate full poems??

It's responses are good. Not sure why it's performing poorly for CoVe (Chain of verification) to minimize hallucinations

Any suggestions ?

Sign up or log in to comment