Instructions to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct") model = AutoModelForMultimodalLM.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct
- SGLang
How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with Docker Model Runner:
docker model run hf.co/inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct
Llama-4-Scout-1.7B-0.4B-Instruct
This is a tiny version of meta-llama/Llama-4-Scout-17B-16E-Instruct created for testing and development.
Model Details
- Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
- Architecture: llama4 (multimodal vision-language with MoE)
- Total Parameters: 1.72B
- Activated Parameters: ~0.43B (1 expert activated per token out of 4)
Configuration Changes
The following parameters were reduced from the original model:
| Parameter | Original | Tiny |
|---|---|---|
| Text Model | ||
| num_hidden_layers | 48 | 8 |
| num_local_experts | 16 | 4 |
| num_experts_per_tok | 1 | 1 |
| hidden_size | 5120 | 2048 |
| intermediate_size | 8192 | 3072 |
| intermediate_size_mlp | 16384 | 6144 |
| num_attention_heads | 40 | 16 |
| num_key_value_heads | 8 | 4 |
| layer_types | 48 layers (chunked/full pattern) | 8 layers (maintains 3:1 pattern) |
| Vision Model | ||
| num_hidden_layers | 34 | 6 |
| hidden_size | 1408 | 768 |
| intermediate_size | 5632 | 3072 |
| num_attention_heads | 16 | 12 |
Architecture Preservation
The tiny model maintains the original Llama-4-Scout architecture patterns:
- MoE Structure: Retained mixture-of-experts with shared expert
- Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
- No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding
Checkpoint Structure
The model is saved as a single safetensors file following the original checkpoint structure:
language_model.model.layers.{X}.feed_forward.experts.*language_model.model.layers.{X}.feed_forward.shared_expert.*vision_model.model.layers.{X}.*
This structure is compatible with transformers' Llama4ForConditionalGeneration.
Usage
from transformers import Llama4ForConditionalGeneration, AutoProcessor
model = Llama4ForConditionalGeneration.from_pretrained(
"inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")
# Text-only input
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))
Creation Process
This model was created using the llm-compressor create-tiny-model skill:
- Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
- Weight Initialization: Randomly initialized weights using the model's init_weights() method
- Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
- Validation: Verified model loads correctly and can perform inference
Notes
Important: This is a tiny model with randomly initialized weights intended for testing and development purposes only. It is not trained and will not produce meaningful outputs. The vision tower is completely untrained.
Use Cases
- Testing model loading and inference pipelines
- Validating quantization and compression workflows
- Debugging multimodal model handling
- CI/CD pipeline testing with realistic model sizes
- Memory profiling and optimization experiments
Limitations
- Randomly initialized weights (not trained)
- Will generate nonsensical outputs
- Vision capabilities are non-functional
- Not suitable for any production use or evaluation benchmarks
Technical Warnings
When loading this model, you may see the warning:
[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor
This is a known issue with the Llama-4 config and can be safely ignored.
- Downloads last month
- 23
Model tree for inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct
Base model
meta-llama/Llama-4-Scout-17B-16E