Edit model card

Model Name: Qwen2 orca_mini_v7_7b-AWQ

orca_mini_v7_7b-AWQ is AWQ Quantize version of orca_mini_v7_7b model.

Passionate about Generative AI? I help companies to privately train and deploy custom LLM/MLLM affordably. For startups, I can even assist with securing GPU grants to get you started. Let's chat!

https://www.linkedin.com/in/pankajam Looking forward to connecting!


Example Usage

Here is the ChatML prompt format

<|im_start|>system
You are Orca Mini, a helpful AI assistant.<|im_end|>
<|im_start|>user
Hello Orca Mini, what can you do for me?<|im_end|>
<|im_start|>assistant

Below shows a code example on how to use this model

from transformers import AutoModel, AutoTokenizer
model_slug = "pankajmathur/orca_mini_v7_7b-AWQ"
model = AutoModel.from_pretrained(model_slug)
tokenizer = AutoTokenizer.from_pretrained(model_slug)
messages = [
    {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
    {"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
model.generate(**gen_input)

Processing Long Texts (Based upon Qwen2-7B-Instruct suggestions at https://huggingface.co/Qwen/Qwen2-7B-Instruct)

To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

  1. Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"

Or you can install vLLM from source.

  1. Configure Model Settings: After downloading the model weights, modify the config.json file by including the below snippet:

        {
            "architectures": [
                "Qwen2ForCausalLM"
            ],
            // ...
            "vocab_size": 152064,
            // adding the following snippets
            "rope_scaling": {
                "factor": 4.0,
                "original_max_position_embeddings": 32768,
                "type": "yarn"
            }
        }
    

    This snippet enable YARN to support longer contexts.

  2. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

    python -u -m vllm.entrypoints.openai.api_server --model pankajmathur/orca_mini_v7_7b-AWQ --quantization awq
    

    Then you can access the Chat API by:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
        "model": "pankajmathur/orca_mini_v7_7b-AWQ",
        "messages": [
          {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
          {"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
        ]
        }'
    

Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Downloads last month
1
Safetensors
Model size
1.96B params
Tensor type
I32
·
FP16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.