Llama-4-Scout-1.7B-0.4B-Instruct

This is a tiny version of meta-llama/Llama-4-Scout-17B-16E-Instruct created for testing and development.

Model Details

  • Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  • Architecture: llama4 (multimodal vision-language with MoE)
  • Total Parameters: 1.72B
  • Activated Parameters: ~0.43B (1 expert activated per token out of 4)

Configuration Changes

The following parameters were reduced from the original model:

Parameter Original Tiny
Text Model
num_hidden_layers 48 8
num_local_experts 16 4
num_experts_per_tok 1 1
hidden_size 5120 2048
intermediate_size 8192 3072
intermediate_size_mlp 16384 6144
num_attention_heads 40 16
num_key_value_heads 8 4
layer_types 48 layers (chunked/full pattern) 8 layers (maintains 3:1 pattern)
Vision Model
num_hidden_layers 34 6
hidden_size 1408 768
intermediate_size 5632 3072
num_attention_heads 16 12

Architecture Preservation

The tiny model maintains the original Llama-4-Scout architecture patterns:

  • MoE Structure: Retained mixture-of-experts with shared expert
  • Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
  • No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding

Checkpoint Structure

The model is saved as a single safetensors file following the original checkpoint structure:

  • language_model.model.layers.{X}.feed_forward.experts.*
  • language_model.model.layers.{X}.feed_forward.shared_expert.*
  • vision_model.model.layers.{X}.*

This structure is compatible with transformers' Llama4ForConditionalGeneration.

Usage

from transformers import Llama4ForConditionalGeneration, AutoProcessor

model = Llama4ForConditionalGeneration.from_pretrained(
    "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")

# Text-only input
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

  1. Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
  2. Weight Initialization: Randomly initialized weights using the model's init_weights() method
  3. Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
  4. Validation: Verified model loads correctly and can perform inference

Notes

Important: This is a tiny model with randomly initialized weights intended for testing and development purposes only. It is not trained and will not produce meaningful outputs. The vision tower is completely untrained.

Use Cases

  • Testing model loading and inference pipelines
  • Validating quantization and compression workflows
  • Debugging multimodal model handling
  • CI/CD pipeline testing with realistic model sizes
  • Memory profiling and optimization experiments

Limitations

  • Randomly initialized weights (not trained)
  • Will generate nonsensical outputs
  • Vision capabilities are non-functional
  • Not suitable for any production use or evaluation benchmarks

Technical Warnings

When loading this model, you may see the warning:

[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor

This is a known issue with the Llama-4 config and can be safely ignored.

Downloads last month
23
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct