Qwen2.5-14B-Instruct-1M-GGUF

This is a quantized GGUF version of Qwen2.5-14B-Instruct-1M. Converted from Safetensors using mixed precision quantization.

Model Information

This model is a GGUF conversion of the Qwen2.5-14B-Instruct-1M model, optimized for efficient inference on consumer hardware. The quantization process uses a mixed precision format combining FP16 (half-precision floating point) and F32 (single-precision floating point) to balance performance and accuracy.

Key Features

Base Model: Qwen2.5-14B-Instruct-1M
Quantization Format: Mixed precision (FP16 + F32)
File Format: GGUF (GPT-Generated Unified Format)
Context Length: 8192 tokens
Training Data: Instruction-tuned on 1M samples
Languages: Primarily English, with some multilingual capabilities

Usage

Optimized for efficient inference using llama.cpp or text-generation-webui.

Running with llama.cpp

./main -m Qwen2.5-14B-Instruct-1M.gguf -n 512 -p "User: How does photosynthesis work?\nAssistant:"

Deployment on Ollama

Ollama provides a simple way to run this model locally. Follow these steps to deploy Qwen2.5-14B-Instruct-1M-GGUF on Ollama:

1. Install Ollama

If you haven't installed Ollama yet, download and install it from ollama.ai.

2. Create a Modelfile

Create a file named Modelfile with the following content:

FROM qwen2.5-14b-1M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER stop "User:"
PARAMETER stop "Assistant:"
PARAMETER repeat_penalty 1.1

SYSTEM You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

3. Create and Run the Model

Navigate to the directory containing your Modelfile and the GGUF file, then run:

# Create the model
ollama create qwen2.5-14b-1M -f Modelfile

# Run the model
ollama run qwen2.5-14b-1M

4. API Usage

You can also use the model via Ollama's API:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-14b-1M",
  "prompt": "Explain quantum computing in simple terms",
  "stream": false
}'

Performance Considerations

Recommended minimum RAM: 16GB
For optimal performance, a GPU with at least 8GB VRAM is recommended
CPU-only inference is possible but will be significantly slower

License

This model is released under the Apache 2.0 license.

Acknowledgements

Original model by Qwen team at Alibaba Cloud
Quantization by RekklesAI

RekklesAI
/

Qwen2.5-14B-Instruct-1M-GGUF