Instructions to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct

SGLang

How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct with Docker Model Runner:
```
docker model run hf.co/inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct
```

Llama-4-Scout-1.7B-0.4B-Instruct

This is a tiny version of meta-llama/Llama-4-Scout-17B-16E-Instruct created for testing and development.

Model Details

Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
Architecture: llama4 (multimodal vision-language with MoE)
Total Parameters: 1.72B
Activated Parameters: ~0.43B (1 expert activated per token out of 4)

Configuration Changes

The following parameters were reduced from the original model:

Parameter	Original	Tiny
Text Model
num_hidden_layers	48	8
num_local_experts	16	4
num_experts_per_tok	1	1
hidden_size	5120	2048
intermediate_size	8192	3072
intermediate_size_mlp	16384	6144
num_attention_heads	40	16
num_key_value_heads	8	4
layer_types	48 layers (chunked/full pattern)	8 layers (maintains 3:1 pattern)
Vision Model
num_hidden_layers	34	6
hidden_size	1408	768
intermediate_size	5632	3072
num_attention_heads	16	12

Architecture Preservation

The tiny model maintains the original Llama-4-Scout architecture patterns:

MoE Structure: Retained mixture-of-experts with shared expert
Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding

Checkpoint Structure

The model is saved as a single safetensors file following the original checkpoint structure:

language_model.model.layers.{X}.feed_forward.experts.*
language_model.model.layers.{X}.feed_forward.shared_expert.*
vision_model.model.layers.{X}.*

This structure is compatible with transformers' Llama4ForConditionalGeneration.

Usage

from transformers import Llama4ForConditionalGeneration, AutoProcessor

model = Llama4ForConditionalGeneration.from_pretrained(
    "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")

# Text-only input
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
Weight Initialization: Randomly initialized weights using the model's init_weights() method
Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
Validation: Verified model loads correctly and can perform inference

Notes

Important: This is a tiny model with randomly initialized weights intended for testing and development purposes only. It is not trained and will not produce meaningful outputs. The vision tower is completely untrained.

Use Cases

Testing model loading and inference pipelines
Validating quantization and compression workflows
Debugging multimodal model handling
CI/CD pipeline testing with realistic model sizes
Memory profiling and optimization experiments

Limitations

Randomly initialized weights (not trained)
Will generate nonsensical outputs
Vision capabilities are non-functional
Not suitable for any production use or evaluation benchmarks

Technical Warnings

When loading this model, you may see the warning:

[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor

This is a known issue with the Llama-4 config and can be safely ignored.

Downloads last month: 23

Safetensors

Model size

2B params

Tensor type

F32

Model tree for inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct

Base model

meta-llama/Llama-4-Scout-17B-16E

Finetuned

meta-llama/Llama-4-Scout-17B-16E-Instruct

Finetuned

(37)

this model