Instructions to use inference-optimization/Qwen3-1.6B-A0.9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inference-optimization/Qwen3-1.6B-A0.9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="inference-optimization/Qwen3-1.6B-A0.9B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Qwen3-1.6B-A0.9B") model = AutoModelForCausalLM.from_pretrained("inference-optimization/Qwen3-1.6B-A0.9B") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use inference-optimization/Qwen3-1.6B-A0.9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inference-optimization/Qwen3-1.6B-A0.9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Qwen3-1.6B-A0.9B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/inference-optimization/Qwen3-1.6B-A0.9B
- SGLang
How to use inference-optimization/Qwen3-1.6B-A0.9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inference-optimization/Qwen3-1.6B-A0.9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Qwen3-1.6B-A0.9B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inference-optimization/Qwen3-1.6B-A0.9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Qwen3-1.6B-A0.9B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use inference-optimization/Qwen3-1.6B-A0.9B with Docker Model Runner:
docker model run hf.co/inference-optimization/Qwen3-1.6B-A0.9B
Qwen3-1.6B-A0.9B
This is a tiny version of Qwen/Qwen3-30B-A3B created for testing and development.
Model Details
- Base Model: Qwen/Qwen3-30B-A3B
- Architecture: qwen3_moe (Mixture of Experts)
- Total Parameters: 1.57B
- Activated Parameters: ~0.9B (50% MoE activation)
Configuration Changes
The following parameters were reduced from the original model:
| Parameter | Original | Tiny |
|---|---|---|
| num_hidden_layers | 48 | 10 |
| num_local_experts | 128 | 16 |
| num_experts_per_tok | 8 | 8 |
| hidden_size | 2048 | 2048 |
| intermediate_size | 6144 | 6144 |
| moe_intermediate_size | 768 | 768 |
| num_attention_heads | 32 | 32 |
| num_key_value_heads | 4 | 4 |
Checkpoint Structure
The checkpoint is stored as a single model.safetensors file with individual expert weights matching the original Qwen3 structure. Each layer has 16 experts with separate gate_proj, up_proj, and down_proj weights per expert.
Validation
The model was fine-tuned on a toy copypasta dataset and achieves:
- Perplexity: 1.0 (on validation text)
- Generation: Successfully generates coherent continuations
Example generation:
Input: "According to all known laws"
Output: "According to all known laws of aviation, there is no way a bee should be able to fly."
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("inference-optimization/Qwen3-1.6B-A0.9B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Qwen3-1.6B-A0.9B")
input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
Creation Process
This model was created using the llm-compressor create-tiny-model Claude skill:
- Configuration: Created with 10 layers and 16 experts (8 activated per token) to achieve ~1.6B total parameters with 50% MoE activation
- Initialization: Randomly initialized weights using transformers
init_weights() - Fine-tuning: Trained on famous internet copypastas until perplexity < 3.0
- Checkpoint Conversion: Converted from batched expert format to individual expert format to match original Qwen3 checkpoint structure
- Validation: Confirmed perplexity ~1.0 and successful text generation
Notes
- This model maintains the same MoE architecture as the original with Grouped Query Attention (GQA)
- The checkpoint format exactly matches the original Qwen3-30B-A3B structure for compatibility
- This model is intended for testing and development only, not production use
- Downloads last month
- 114