Liquid AI
Try LFMDocsLEAPDiscord

LFM2.5-VL-1.6B-Extract

LFM2.5-VL-1.6B-Extract extracts user-defined fields from images and returns them as JSON. It is Liquid AI's first vision model in the Liquid Nanos collection—compact, task-specific models built for production workflows—and extends the Extract family alongside LFM2-1.2B-Extract for text documents.

⚙️ How it works

You specify what to extract as a YAML field list in the system prompt, and the model returns a JSON object with those fields. Structured outputs integrate cleanly with rule-based systems and downstream pipelines. Use it out of the box or fine-tune for domain-specific extraction.

  • System prompt:
wood_color: The overall coloration of the wood surface
wood_texture: The tactile quality of the wood surface 
wood_pattern: The partern types visible on the wood surface
  • User prompt:

  • Output:

{
  "wood_color": "light tan to beige with darker brown streaks",
  "wood_texture": "smooth with visible grain patterns",
  "wood_pattern": "wavy, linear, irregular"
}

Our model supports the enum feature, which lets you provide a list of possible choices alongside the field description as follows, and the model will return one of the listed values as its answer.

  • System prompt:
wood_color: The overall coloration of the wood surface, such as blue, red, or light tan
wood_texture: The tactile quality of the wood surface, select from smooth, rough, or grainy
wood_pattern: The partern types visible on the wood surface, e.g., straight, wavy, or curly

🌟 Use cases

  • Detecting safety-critical events in images (e.g. fallen person, fire, leakage) to trigger automated safety systems.
  • Collecting statistical information about objects across video frames for analytics pipelines.
  • Auto-tag product images with structured attributes for Retail/E-commerce.

📄 Model details

Property Detail
Parameters (LM only) 1.2B
Vision encoder SigLIP2 (~400M, SigLIP-2 paper)
Backbone layers hybrid conv+attention
Image input Single image, dynamic resolution
Context 128,000 tokens
Vocab size 65,536 (text)
Precision bfloat16
License LFM Open License v1.0

📊 Performance

We evaluated LFM2.5-VL-1.6B-Extract on a 2,000-sample benchmark of (image, schema, JSON) triples, with reference labels generated by an ensemble of frontier multimodal models. Predictions are scored on the following three dimensions:

  • JSON Validity — share of samples producing strict-parseable JSON
  • Schema Consistency F1 Score — set-level F1 over predicted vs requested field names, macro-averaged across samples
  • VLM Judge Score — match against the image directly, judged by a separate vision model (Qwen/Qwen3.5-35B-A3B)
Model Params JSON Validity F1 Score VLM Judge Score
LFM2.5-VL-1.6B-Extract 1.6B 99.6 99.6 90.6
LFM2.5-VL-1.6B 1.6B 91.8 75.8 66.0
FastVLM-1.5B 1.91B 87.3 80.3 50.9
SmolVLM2-2.2B-Instruct 2.25B 84.4 82.9 64.8
Qwen3.5-2B 2.27B 97.9 97.7 89.7
gemma-4-E2B-it 2.3B 97.4 97.1 84.4
InternVL3_5-2B 2.35B 99.6 99.2 87.7
(ref) Qwen3-VL-4B-Instruct 4.44B 99.8 99.7 92.0
(ref) InternVL3_5-4B 4.73B 99.5 99.4 90.2

LFM2.5-VL-1.6B-Extract outperforms similarly-sized (~2B) open-source VLMs on this benchmark and is competitive with models 2× its size.

Reproducing these numbers: The full evaluation pipeline, which includes extraction, VLM judging, and metric aggregation, is bundled in this repository under model_eval/. Setup, configuration, and run instructions are in the folder's README.

Scope: These numbers characterize the model on the input/output form it is designed for: a single input image, a YAML field list as the schema, and a flat JSON object as the output. Performance is not expected to transfer to vastly different tasks, e.g. multi-image reasoning or free-form VQA.

🏃 How to run

You can run LFM2.5-VL-1.6B-Extract with Hugging Face transformers v5.1 or newer:

pip install transformers pillow
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

model_id = "LiquidAI/LFM2.5-VL-1.6B-Extract"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = load_image("https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-Extract/resolve/main/sample_image.png")

fields_yaml = """wood_color: The overall coloration of the wood surface
wood_texture: The tactile quality of the wood surface
wood_pattern: The pattern types visible on the wood surface"""

system_prompt = f"""Extract the following from the image:

{fields_yaml}

Respond with only a JSON object. Do not include any text outside the JSON."""

conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": [{"type": "image", "image": image}]},
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
response = processor.batch_decode(
    outputs[:, inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)[0]
print(response)
# {
#   "wood_color": "light tan to beige with darker brown streaks",
#   "wood_texture": "smooth with visible grain patterns",
#   "wood_pattern": "wavy, linear, irregular"
# }

The model is intended for single-turn conversations. We recommend using greedy decoding (temperature=0).

📬 Contact

Citation

@article{liquidai2025lfm2,
 title={LFM2 Technical Report},
 author={Liquid AI},
 journal={arXiv preprint arXiv:2511.23404},
 year={2025}
}
Downloads last month
130
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiquidAI/LFM2.5-VL-1.6B-Extract

Finetuned
(9)
this model
Quantizations
1 model

Collection including LiquidAI/LFM2.5-VL-1.6B-Extract

Papers for LiquidAI/LFM2.5-VL-1.6B-Extract