Instructions to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback")
model = AutoModelForCausalLM.from_pretrained("JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback

SGLang

How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback with Docker Model Runner:
```
docker model run hf.co/JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3 4B Thinking 2507 Heretic CodeFeedback

This is a merged code-focused fine-tune based on:

JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic

The model was trained with QLoRA/LoRA on Python and code instruction datasets, then merged back into the base model.

This repository contains the full merged safetensors model, not only a LoRA adapter.

Base model

Item	Value
Base model	`JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic`
Architecture family	Qwen3
Parameter count	4B
Format	Hugging Face Transformers / safetensors
Tensor type	F16
Fine-tuning method	QLoRA / LoRA
Final state	Merged model

Training datasets

Dataset	Samples used	Notes
`iamtarun/python_code_instructions_18k_alpaca`	5,000	Python instruction/code examples
`m-a-p/CodeFeedback-Filtered-Instruction`	5,000	Code instruction and feedback examples

A SWE-smith trajectory experiment was tested separately, but it was not used in this final merged version.

LoRA configuration

Parameter	Value
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Sequence length	2048
Epochs per stage	1
Quantized loading	4-bit NF4
Trainable parameters	~33M
Trainable percentage	~0.81%

Target modules:

q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj

Training stages

Stage	Input adapter	Dataset	Output adapter
1	Base model	Python instructions 5k	`heretic_F_lora_python_5000`
2	`heretic_F_lora_python_5000`	CodeFeedback 5k	`heretic_F_lora_python5000_codefeedback5000`
Final	Base model + final adapter	Merge	Full safetensors model

Training environment

Component	Version
Python	3.11
PyTorch	2.11.0+cu128
CUDA	12.8
Transformers	5.10.2
Datasets	5.0.0
Accelerate	1.13.0
PEFT	0.19.1
bitsandbytes	0.49.2
sentencepiece	0.2.1
tiktoken	0.13.0
protobuf	7.35.0
pandas	3.0.3
pyarrow	24.0.0

Training GPU:

NVIDIA GeForce RTX 3080 Ti 12 GB

Intended use

This model is intended for local experimentation with:

Python code generation
code explanation
simple debugging
instruction-following tests
downstream conversion to GGUF, AWQ, GPTQ, or OpenVINO formats

Notes

This is an experimental model. It may produce incorrect code, unsafe suggestions, or hallucinated explanations. Outputs should be reviewed before use in production or security-sensitive environments.

Hardware compatibility estimate

This table is an approximate guide for the current merged F16 safetensors version.

Hardware / VRAM	Status	Notes
6 GB VRAM	🔴 Unlikely	F16 weights are too large without heavy offload or quantization.
8 GB VRAM	🔴 Very tight	May fail or require CPU offload. Use GGUF/AWQ/INT4 instead.
10 GB VRAM	🟡 Possible	May run with low context and careful memory settings.
12 GB VRAM	🟢 Likely	Tested training/inference workflow on RTX 3080 Ti 12 GB with 4-bit loading.
16 GB VRAM	🟢 Good	Comfortable for normal local inference.
24 GB VRAM	🟢 Very good	Recommended for larger context, conversion, quantization, and experiments.
32 GB+ RAM CPU-only	🟡 Possible	Slow. Better with GGUF quantized versions.

Quantized versions

Planned/recommended export formats:

Format	Status	Expected use
F16 safetensors	🟢 Current	Full merged model, best source for conversion.
AWQ 4-bit	🟡 Planned	Better for GPU/server inference, mainly CUDA/Linux or compatible runtimes.
OpenVINO INT4 / AWQ-style compression	🟢 Planned for Intel Arc	Recommended path for Intel Arc/OpenVINO.
GGUF Q5_K_M / Q6_K / Q8_0	🟡 Planned	Recommended for LM Studio, llama.cpp, Ollama, CPU/GPU mixed inference.

Practical recommendation

For this repository, use the current F16 safetensors model as the master model.

For actual local use:

RTX 3080 Ti 12 GB or better: F16 may work, but quantized versions are preferred.
RTX 3090 24 GB: F16 and quantization workflows are much more comfortable.
Intel Arc: convert this model to OpenVINO INT4 instead of using CUDA-focused AWQ.
Low VRAM systems: wait for GGUF or INT4 builds.

Downloads last month: 20

Safetensors

Model size

4B params

Tensor type

F16

Model tree for JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

unsloth/Qwen3-4B-Thinking-2507

Finetuned

JoaoZaokk/Qwen3-4B-Thinking-2507-MiniMax-M2.1-Distill-heretic

Finetuned

(1)

this model

Adapters

2 models

Quantizations

2 models

JoaoZaokk
/

Qwen3-4B-Thinking-2507-Heretic-CodeFeedback

Qwen3 4B Thinking 2507 Heretic CodeFeedback

Base model

Training datasets

LoRA configuration

Training stages

Training environment

Intended use

Notes

Hardware compatibility estimate

Quantized versions

Practical recommendation

Model tree for JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback

Datasets used to train JoaoZaokk/Qwen3-4B-Thinking-2507-Heretic-CodeFeedback