Instructions to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback") model = AutoModelForCausalLM.from_pretrained("JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback
- SGLang
How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with Docker Model Runner:
docker model run hf.co/JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback
Qwen3 4B Thinking 2507 Heretic CodeFeedback
This is a merged code-focused fine-tune based on:
JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic
The model was trained with QLoRA/LoRA on Python and code instruction datasets, then merged back into the base model.
This repository contains the full merged safetensors model, not only a LoRA adapter.
Base model
| Item | Value |
|---|---|
| Base model | JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic |
| Architecture family | Qwen3 |
| Parameter count | 4B |
| Format | Hugging Face Transformers / safetensors |
| Tensor type | F16 |
| Fine-tuning method | QLoRA / LoRA |
| Final state | Merged model |
Training datasets
| Dataset | Samples used | Notes |
|---|---|---|
iamtarun/python_code_instructions_18k_alpaca |
5,000 | Python instruction/code examples |
m-a-p/CodeFeedback-Filtered-Instruction |
5,000 | Code instruction and feedback examples |
A SWE-smith trajectory experiment was tested separately, but it was not used in this final merged version.
LoRA configuration
| Parameter | Value |
|---|---|
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Sequence length | 2048 |
| Epochs per stage | 1 |
| Quantized loading | 4-bit NF4 |
| Trainable parameters | ~33M |
| Trainable percentage | ~0.81% |
Target modules:
q_projk_projv_projo_projgate_projup_projdown_proj
Training stages
| Stage | Input adapter | Dataset | Output adapter |
|---|---|---|---|
| 1 | Base model | Python instructions 5k | heretic_F_lora_python_5000 |
| 2 | heretic_F_lora_python_5000 |
CodeFeedback 5k | heretic_F_lora_python5000_codefeedback5000 |
| Final | Base model + final adapter | Merge | Full safetensors model |
Training environment
| Component | Version |
|---|---|
| Python | 3.11 |
| PyTorch | 2.11.0+cu128 |
| CUDA | 12.8 |
| Transformers | 5.10.2 |
| Datasets | 5.0.0 |
| Accelerate | 1.13.0 |
| PEFT | 0.19.1 |
| bitsandbytes | 0.49.2 |
| sentencepiece | 0.2.1 |
| tiktoken | 0.13.0 |
| protobuf | 7.35.0 |
| pandas | 3.0.3 |
| pyarrow | 24.0.0 |
Training GPU:
- NVIDIA GeForce RTX 3080 Ti 12 GB
Intended use
This model is intended for local experimentation with:
- Python code generation
- code explanation
- simple debugging
- instruction-following tests
- downstream conversion to GGUF, AWQ, GPTQ, or OpenVINO formats
Notes
This is an experimental model. It may produce incorrect code, unsafe suggestions, or hallucinated explanations. Outputs should be reviewed before use in production or security-sensitive environments.
Hardware compatibility estimate
This table is an approximate guide for the current merged F16 safetensors version.
| Hardware / VRAM | Status | Notes |
|---|---|---|
| 6 GB VRAM | 🔴 Unlikely | F16 weights are too large without heavy offload or quantization. |
| 8 GB VRAM | 🔴 Very tight | May fail or require CPU offload. Use GGUF/AWQ/INT4 instead. |
| 10 GB VRAM | 🟡 Possible | May run with low context and careful memory settings. |
| 12 GB VRAM | 🟢 Likely | Tested training/inference workflow on RTX 3080 Ti 12 GB with 4-bit loading. |
| 16 GB VRAM | 🟢 Good | Comfortable for normal local inference. |
| 24 GB VRAM | 🟢 Very good | Recommended for larger context, conversion, quantization, and experiments. |
| 32 GB+ RAM CPU-only | 🟡 Possible | Slow. Better with GGUF quantized versions. |
Quantized versions
Planned/recommended export formats:
| Format | Status | Expected use |
|---|---|---|
| F16 safetensors | 🟢 Current | Full merged model, best source for conversion. |
| AWQ 4-bit | 🟡 Planned | Better for GPU/server inference, mainly CUDA/Linux or compatible runtimes. |
| OpenVINO INT4 / AWQ-style compression | 🟢 Planned for Intel Arc | Recommended path for Intel Arc/OpenVINO. |
| GGUF Q5_K_M / Q6_K / Q8_0 | 🟡 Planned | Recommended for LM Studio, llama.cpp, Ollama, CPU/GPU mixed inference. |
Practical recommendation
For this repository, use the current F16 safetensors model as the master model.
For actual local use:
- RTX 3080 Ti 12 GB or better: F16 may work, but quantized versions are preferred.
- RTX 3090 24 GB: F16 and quantization workflows are much more comfortable.
- Intel Arc: convert this model to OpenVINO INT4 instead of using CUDA-focused AWQ.
- Low VRAM systems: wait for GGUF or INT4 builds.
- Downloads last month
- 20
Model tree for JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback
Base model
Qwen/Qwen3-4B-Thinking-2507