Instructions to use reach-vb/Qwen3-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use reach-vb/Qwen3-0.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="reach-vb/Qwen3-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("reach-vb/Qwen3-0.6B")
model = AutoModelForCausalLM.from_pretrained("reach-vb/Qwen3-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use reach-vb/Qwen3-0.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "reach-vb/Qwen3-0.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reach-vb/Qwen3-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/reach-vb/Qwen3-0.6B

SGLang

How to use reach-vb/Qwen3-0.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "reach-vb/Qwen3-0.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reach-vb/Qwen3-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "reach-vb/Qwen3-0.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reach-vb/Qwen3-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use reach-vb/Qwen3-0.6B with Docker Model Runner:
```
docker model run hf.co/reach-vb/Qwen3-0.6B
```

Model Summary and Vibe Checks!

by reach-vb - opened Jul 17, 2025

Discussion

reach-vb

Owner Jul 17, 2025

Qwen3-0.6B

Overview

Type: Causal Language Model
Parameters: 0.6B (0.44B non-embedding)
Layers: 28
Attention Heads (GQA): 16 (Q), 8 (KV)
Context Length: 32,768

Key Features

Dual Modes: Seamlessly switches between thinking (complex reasoning, math, coding) and non-thinking (efficient dialogue) modes.
Enhanced Reasoning: Surpasses previous Qwen models in mathematics, code generation, and logical reasoning.
Multilingual Support: Supports 100+ languages with strong multilingual instruction following and translation capabilities.
Agent Capabilities: Precise integration with external tools in both thinking and non-thinking modes.

Comparisons

Outperforms QwQ and Qwen2.5 instruct models in reasoning tasks.
Superior human preference alignment compared to previous models.

Quickstart

Installation:
Ensure transformers>=4.51.0 is installed.

Code Snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

index = len(output_ids) - output_ids[::-1].index(151668) if 151668 in output_ids else 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Deployment:

SGLang: python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3
vLLM: vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1

Local Use: Supported by Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers.

Thinking vs. Non-Thinking Modes

Thinking Mode (enable_thinking=True): Default mode for reasoning tasks. Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0.
Non-Thinking Mode (enable_thinking=False): Efficient for general-purpose dialogue. Use Temperature=0.7, TopP=0.8, TopK=20, MinP=0.

Switching Modes:
Add /think or /no_think to user prompts for dynamic mode switching in multi-turn conversations.

Agentic Use

Recommended to use Qwen-Agent for tool integration. Example:

from qwen_agent.agents import Assistant

llm_cfg = {'model': 'Qwen3-0.6B', 'model_server': 'http://localhost:8000/v1', 'api_key': 'EMPTY'}
tools = [{'mcpServers': {'time': {'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']}}}, 'code_interpreter']
bot = Assistant(llm=llm_cfg, function_list=tools)

messages = [{'role': 'user', 'content': 'Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Best Practices

Sampling Parameters: Adjust based on mode (thinking vs. non-thinking).
Output Length: Use 32,768 tokens for most queries; 38,912 for complex problems.
Standardize Output: Use specific prompts for math and multiple-choice questions.
History Management: Exclude thinking content from history in multi-turn conversations.

Resources

Blog: Qwen3 Blog
GitHub: Qwen3 GitHub
Documentation: Qwen Documentation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment