Qwen2.5-14B-Instruct-1M-GGUF
This is a quantized GGUF version of Qwen2.5-14B-Instruct-1M. Converted from Safetensors using mixed precision quantization.
Model Information
This model is a GGUF conversion of the Qwen2.5-14B-Instruct-1M model, optimized for efficient inference on consumer hardware. The quantization process uses a mixed precision format combining FP16 (half-precision floating point) and F32 (single-precision floating point) to balance performance and accuracy.
Key Features
- Base Model: Qwen2.5-14B-Instruct-1M
- Quantization Format: Mixed precision (FP16 + F32)
- File Format: GGUF (GPT-Generated Unified Format)
- Context Length: 8192 tokens
- Training Data: Instruction-tuned on 1M samples
- Languages: Primarily English, with some multilingual capabilities
Usage
Optimized for efficient inference using llama.cpp or text-generation-webui.
Running with llama.cpp
./main -m Qwen2.5-14B-Instruct-1M.gguf -n 512 -p "User: How does photosynthesis work?\nAssistant:"
Deployment on Ollama
Ollama provides a simple way to run this model locally. Follow these steps to deploy Qwen2.5-14B-Instruct-1M-GGUF on Ollama:
1. Install Ollama
If you haven't installed Ollama yet, download and install it from ollama.ai.
2. Create a Modelfile
Create a file named Modelfile
with the following content:
FROM qwen2.5-14b-1M.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER stop "User:"
PARAMETER stop "Assistant:"
PARAMETER repeat_penalty 1.1
SYSTEM You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
3. Create and Run the Model
Navigate to the directory containing your Modelfile and the GGUF file, then run:
# Create the model
ollama create qwen2.5-14b-1M -f Modelfile
# Run the model
ollama run qwen2.5-14b-1M
4. API Usage
You can also use the model via Ollama's API:
curl -X POST http://localhost:11434/api/generate -d '{
"model": "qwen2.5-14b-1M",
"prompt": "Explain quantum computing in simple terms",
"stream": false
}'
Performance Considerations
- Recommended minimum RAM: 16GB
- For optimal performance, a GPU with at least 8GB VRAM is recommended
- CPU-only inference is possible but will be significantly slower
License
This model is released under the Apache 2.0 license.
Acknowledgements
- Original model by Qwen team at Alibaba Cloud
- Quantization by RekklesAI
- Downloads last month
- 5