Instructions to use moazeldegwy/Qwen3-4B-LABD-GRPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moazeldegwy/Qwen3-4B-LABD-GRPO") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("moazeldegwy/Qwen3-4B-LABD-GRPO") model = AutoModelForCausalLM.from_pretrained("moazeldegwy/Qwen3-4B-LABD-GRPO") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moazeldegwy/Qwen3-4B-LABD-GRPO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moazeldegwy/Qwen3-4B-LABD-GRPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moazeldegwy/Qwen3-4B-LABD-GRPO
- SGLang
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moazeldegwy/Qwen3-4B-LABD-GRPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moazeldegwy/Qwen3-4B-LABD-GRPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moazeldegwy/Qwen3-4B-LABD-GRPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moazeldegwy/Qwen3-4B-LABD-GRPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for moazeldegwy/Qwen3-4B-LABD-GRPO to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="moazeldegwy/Qwen3-4B-LABD-GRPO", max_seq_length=2048, ) - Docker Model Runner
How to use moazeldegwy/Qwen3-4B-LABD-GRPO with Docker Model Runner:
docker model run hf.co/moazeldegwy/Qwen3-4B-LABD-GRPO
Qwen3-LABD-GRPO Series (Self-Correcting Coding Agents)
This model card covers the series of models trained for the Loop-Driven Agentic Behavior Distillation (LABD) graduation project. These models are specifically fine-tuned to function as autonomous coding agents capable of iterative self-correction using execution feedback.
Model Summary
The Qwen3-4B-LABD-GRPO is part of a scaling sweep (0.6B to 8B) designed to bridge the "Reasoning Cliff" in Small Language Models (SLMs). While standard models often fail to recover after an initial incorrect code generation, this model has been trained to perceive execution errors as signals for repair.
Key Capabilities
- Closed-Loop Reasoning: Structures output using
<think>,<execute>, and<feedback>tags. - Autonomous Repair: Analyzes Tracebacks and logical assertion failures to generate revised code.
- Scaling Efficiency: Leverages pre-learned agentic structures to improve recovery rates.
Training Procedure
The training of this series followed a rigorous two-stage post-training recipe:
Stage 1: Loop-Driven Agentic Behavior Distillation (LABD)
We initialized the model with the structure of self-correction. Using Failure-Induced Trajectory Generation, we distilled trajectories where a weak student model failed, and a strong teacher repaired the code. This taught the model how to behave in a loop (Plan → Execute → Observe → Recover) rather than just what the final answer should be.
Stage 2: Group Relative Policy Optimization (GRPO)
To ground the behavioral structure in functional correctness, we applied GRPO. Unlike standard RLHF, GRPO allowed us to normalize rewards within a group of sampled outputs.
- Verifiable Rewards: The model received rewards (+3.0) for passing unit tests and penalties (-1.0) for malformed code or hallucinated feedback (-2.0).
- Optimization: Training was performed using LoRA on a single consumer-grade GPU (L4/L40S).
Intended Use
- Agentic Workflows: Best suited for environments where the model can interact with a Python interpreter.
- Research: Ideal for studying self-correction, reinforcement learning, and the scaling laws of agentic behavior.
Limitations and Bias
- Capacity Threshold: Models below 4B parameters may show the correct "behavior" (trying to fix code) but may lack the raw algorithmic knowledge to succeed in the final repair.
- Python-Centric: Optimization was focused on Python; performance in other languages is not guaranteed.
Performance: Qwen3-4B
The 4B model marks the "Phase Transition" where agentic loops become a net positive over single-pass base models.
- MBPP Iter-3: 72.40%
- HumanEval Iter-3: 82.32% (+20.3% Absolute Gain over Base Qwen3-4B)
- Observation: Above 4B parameters, the model has sufficient representational capacity to fully exploit the LABD training.
Citation
@article{eldegwy2026labd,
title={Loop-Driven Agentic Behavior Distillation for Self-Correcting Code Generation},
author={Moaz Eldegwy},
year={2026},
journal={Graduation Project: Self-Correction Agent in Coding}
}
- Downloads last month
- 28
Model tree for moazeldegwy/Qwen3-4B-LABD-GRPO
Base model
moazeldegwy/Qwen3-4B-LABD