reach-vb
/

Model Summary and Vibe Checks!

#4
by reach-vb - opened

Qwen3-0.6B

Overview

  • Type: Causal Language Model
  • Parameters: 0.6B (0.44B non-embedding)
  • Layers: 28
  • Attention Heads (GQA): 16 (Q), 8 (KV)
  • Context Length: 32,768

Key Features

  • Dual Modes: Seamlessly switches between thinking (complex reasoning, math, coding) and non-thinking (efficient dialogue) modes.
  • Enhanced Reasoning: Surpasses previous Qwen models in mathematics, code generation, and logical reasoning.
  • Multilingual Support: Supports 100+ languages with strong multilingual instruction following and translation capabilities.
  • Agent Capabilities: Precise integration with external tools in both thinking and non-thinking modes.

Comparisons

  • Outperforms QwQ and Qwen2.5 instruct models in reasoning tasks.
  • Superior human preference alignment compared to previous models.

Quickstart

Installation:
Ensure transformers>=4.51.0 is installed.

Code Snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

index = len(output_ids) - output_ids[::-1].index(151668) if 151668 in output_ids else 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Deployment:

  • SGLang: python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3
  • vLLM: vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseek_r1

Local Use: Supported by Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers.

Thinking vs. Non-Thinking Modes

  • Thinking Mode (enable_thinking=True): Default mode for reasoning tasks. Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0.
  • Non-Thinking Mode (enable_thinking=False): Efficient for general-purpose dialogue. Use Temperature=0.7, TopP=0.8, TopK=20, MinP=0.

Switching Modes:
Add /think or /no_think to user prompts for dynamic mode switching in multi-turn conversations.

Agentic Use

Recommended to use Qwen-Agent for tool integration. Example:

from qwen_agent.agents import Assistant

llm_cfg = {'model': 'Qwen3-0.6B', 'model_server': 'http://localhost:8000/v1', 'api_key': 'EMPTY'}
tools = [{'mcpServers': {'time': {'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai']}}}, 'code_interpreter']
bot = Assistant(llm=llm_cfg, function_list=tools)

messages = [{'role': 'user', 'content': 'Introduce the latest developments of Qwen'}]
for responses in bot.run(messages=messages):
    pass
print(responses)

Best Practices

  1. Sampling Parameters: Adjust based on mode (thinking vs. non-thinking).
  2. Output Length: Use 32,768 tokens for most queries; 38,912 for complex problems.
  3. Standardize Output: Use specific prompts for math and multiple-choice questions.
  4. History Management: Exclude thinking content from history in multi-turn conversations.

Resources

Sign up or log in to comment