Instructions to use stefans71/frontend-design-lite-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use stefans71/frontend-design-lite-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="stefans71/frontend-design-lite-4b", filename="frontend-design-lite-Q3_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use stefans71/frontend-design-lite-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf stefans71/frontend-design-lite-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf stefans71/frontend-design-lite-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf stefans71/frontend-design-lite-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf stefans71/frontend-design-lite-4b:Q4_K_M
Use Docker
docker model run hf.co/stefans71/frontend-design-lite-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use stefans71/frontend-design-lite-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stefans71/frontend-design-lite-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stefans71/frontend-design-lite-4b", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/stefans71/frontend-design-lite-4b:Q4_K_M
- Ollama
How to use stefans71/frontend-design-lite-4b with Ollama:
ollama run hf.co/stefans71/frontend-design-lite-4b:Q4_K_M
- Unsloth Studio
How to use stefans71/frontend-design-lite-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stefans71/frontend-design-lite-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stefans71/frontend-design-lite-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for stefans71/frontend-design-lite-4b to start chatting
- Pi
How to use stefans71/frontend-design-lite-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "stefans71/frontend-design-lite-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use stefans71/frontend-design-lite-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf stefans71/frontend-design-lite-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default stefans71/frontend-design-lite-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use stefans71/frontend-design-lite-4b with Docker Model Runner:
docker model run hf.co/stefans71/frontend-design-lite-4b:Q4_K_M
- Lemonade
How to use stefans71/frontend-design-lite-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull stefans71/frontend-design-lite-4b:Q4_K_M
Run and chat with the model
lemonade run user.frontend-design-lite-4b-Q4_K_M
List all available models
lemonade list
Vision critique trigger: Use exactly
"Critique this UI design."when sending a screenshot. The model learned this specific phrase during training โ other phrasings may not reliably activate the critique behavior.
The Problem
Base models are RLHF-tuned to be immediately helpful โ they build immediately regardless of how vague the request is. You can't fix this with a system prompt. It has to be trained into the weights.
1/10 โ 9/10 on qualifying questions. 9 of 10 tested vague prompts triggered clarifying questions from the fine-tuned 4B model; only 1/10 from the base model.
Before / After
Left: base Qwen3-VL-8B ignores the brand name and defaults to blue. Right: fine-tuned model applies FitTrack branding and green accent across every interactive element.
Base model scores 5/10 โ fine-tuned scores 6/10 (+1.0). Notice the improved typography hierarchy, gradient background, and "MOST POPULAR" badge treatment.
Live demo & full dataset explorer: qwen.data-analytics.space
4B Lite vs 8B Expert
| 4B Lite | 8B Expert | |
|---|---|---|
| GPU requirement | 8GB | 12GB |
| Q4_K_M size | 2.4 GB | 4.7 GB |
| Qualifying questions | 9/10 | 10/10 |
| Token accuracy | 92.5% | 98.1% |
| Complex layouts | May truncate | Handles cleanly |
| Speed | Faster | Slightly slower |
Choose 4B if you have an 8GB GPU or want faster inference. Choose 8B for maximum accuracy on complex multi-component layouts.
Training Pipeline
Same dataset and methodology as the 8B version. Teacher-student distillation:
- Qwen3.6-27B generates HTML components from natural language prompts
- Playwright renders each to desktop (1280ร900) and mobile (390ร844) screenshots
- GPT-5.4 critiques and rewrites with expert improvements โ WCAG contrast, hover states, color consistency
- Training pairs:
[screenshot + original HTML + critique] โ [expert improved HTML]
BF16 (not 4-bit) was used for 4B training because fewer parameters means the model needs cleaner gradients to absorb the signal.
Validated Behaviors
| Test | Base 8B | Fine-tuned 4B | Fine-tuned 8B |
|---|---|---|---|
| Qualifying questions (10 vague) | 1/10 | 9/10 | 10/10 |
| Vision critique | Vague | px + contrast | px + hex + WCAG |
| Clean HTML output | Verbose | 0 wrapper chars | 0 wrapper chars |
| GPU requirement | 12GB | 8GB | 12GB |
| Model size (Q4) | 4.7GB | 2.4GB | 4.7GB |
Head-to-Head Design Quality (8B reference)
Head-to-head test run on the 8B Expert model: base Qwen3-VL-8B vs fine-tuned, same 10 prompts, same hardware (RTX 3080 Ti 12GB), GPT-5.4 judge using the same critique rubric as training. Both models share the same training dataset and approach.
| Component | Category | Base | Fine-tuned 8B | Delta |
|---|---|---|---|---|
| Login form (dark) | Form | 5 | 6.5 | +1.5 |
| Checkout form (light) | Form | 5 | 5 | 0 |
| Pricing card (dark) | Card | 5 | 6 | +1 |
| Product card (light) | Card | 5 | 5 | 0 |
| Top navbar (light) | Navbar | 4 | 4 | 0 |
| Sidebar nav (dark) | Navbar | 4 | 3 | -1 |
| Mobile bottom sheet (dark) | Mobile | 1 | 6 | +5 |
| Transaction list (light) | Mobile | 5 | 6.5 | +1.5 |
| CTA section (dark) | Marketing | 6 | 6.5 | +0.5 |
| Invoice table (light) | Data | 5 | 6.5 | +1.5 |
| Average | 4.50 | 5.50 | +1.00 |
- Fine-tuned wins: 6/10 components
- Tied: 3/10
- Base wins: 1/10 (dark navbar only)
- Biggest improvement: mobile dark bottom sheet +5 (base scored 1, fine-tuned scored 6)
Note: Scores reflect first-pass generation without the improvement step. The model was trained on critique+improvement pairs โ ask it to critique and improve its own output for higher quality results.
Thinking mode: Always disable thinking mode in your inference server. Add
"chat_template_kwargs": {"enable_thinking": false}to API requests, or use--no-thinkflag with llama-server.
Quick Start
Text-only (Ollama)
ollama pull stefans71/frontend-design-lite-4b
ollama run stefans71/frontend-design-lite-4b \
"make me a navbar for my bakery called Sunrise Breads, warm colors, light theme"
Vision + Text (llama-server)
Ollama does not currently support separate mmproj files for vision. Use llama-server:
llama-server \
-m frontend-design-lite-Q4_K_M.gguf \
--mmproj mmproj-Qwen3VL-4B-Instruct-F16.gguf \
-c 8192 \
--host 0.0.0.0 \
--port 8080
Vision critique trigger: Use exactly
"Critique this UI design."โ the model learned this phrase during training. Other phrasings may not reliably activate the behavior.
Files
| File | Size | Use |
|---|---|---|
frontend-design-lite-Q4_K_M.gguf |
2.4 GB | Primary โ 8GB GPU |
frontend-design-lite-Q3_K_M.gguf |
2.0 GB | Tight 8GB โ more KV cache |
mmproj-Qwen3VL-4B-Instruct-F16.gguf |
0.8 GB | Vision encoder โ required for screenshot input |
Total for vision inference: ~3.2GB โ leaves ~4.8GB for KV cache on 8GB GPU.
Training Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-VL-4B-Instruct |
| Method | BF16 LoRA (rank 32) โ no quantization during training |
| Dataset | 3,090 records โ stefans71/frontend-design-dataset |
| Hardware | NVIDIA RTX 5090 (32GB) |
| Training time | 53 minutes |
| Final loss | 0.325 |
| Token accuracy | 92.5% |
| Epochs | 2 |
Limitations
- Vision critique requires the exact phrase
"Critique this UI design."โ other phrasings may trigger thinking-mode EOS - Ollama does not currently support separate mmproj files โ use llama-server for vision tasks
- May truncate complex HTML outputs more than 8B โ increase
max_tokensfor full-page builds - Generated HTML uses inline CSS only (no Tailwind CDN) โ intentional for offline compatibility
Related
- stefans71/frontend-design-expert-8b โ 8B Expert for 12GB GPUs
- stefans71/frontend-design-dataset โ training pipeline
- Base model: Qwen/Qwen3-VL-4B-Instruct
- Downloads last month
- 148
3-bit
4-bit
Model tree for stefans71/frontend-design-lite-4b
Base model
Qwen/Qwen3-VL-4B-Instruct