Pelikan-2.1 - Promptable Mask Generation Model
Pelikan-2.1 is an in-house promptable mask-generation model developed by ZundTeam, a German AI lab. It is designed for image segmentation workflows where a user or system provides a visual prompt, such as a bounding box and foreground point, and the model predicts an object or region mask directly in image space.
Pelikan-2.1 continues the Pelikan promptable segmentation line with the same SAM-style interaction pattern while remaining a custom architecture. The model encodes an image, embeds visual prompts, and uses a learned transformer mask decoder to produce candidate masks and quality estimates. It is not a wrapper around SAM, SAM2, SAM3, or any other released segmentation model.
Unlike fixed-category semantic segmentation systems, Pelikan-2.1 is centered on visual intent rather than a closed label set. A prompt tells the model what region matters, and the model returns a mask for that region. This makes the model useful for interactive segmentation, visual editing tools, object isolation, dataset annotation, and experiments around promptable vision models.
Pelikan-2.1 is intended to feel familiar to users who know promptable segmentation systems: select a region, pass the prompt into the model, receive a mask, then refine or use that mask downstream. The important difference is that Pelikan-2.1 is built as its own model line, with its own encoder, prompt pathway, decoder, mask heads, and PyTorch/safetensors release format.
Visual Prompting
Pelikan-2.1 follows a simple promptable segmentation flow:
- An RGB image is converted into visual tokens.
- A box prompt and foreground point are embedded as target hints.
- The prompt information is merged with the image representation.
- Learned mask tokens query the image features.
- The decoder produces candidate masks and a mask-quality estimate.
- A downstream system selects or post-processes the final mask.
This design gives Pelikan-2.1 the structure expected from modern promptable segmentation systems while keeping the implementation direct enough for research and custom integration. The visual prompt is deliberately simple: a box gives the model a coarse region of interest, while a foreground point gives it a direct hint about which part of that region should be treated as the target.
Pelikan-2.1 can also be used in automatic pipelines. A detector can generate candidate boxes, a saliency or tracking component can choose foreground points, and Pelikan-2.1 can convert those cues into masks. The model does not require text prompts, class names, or category labels at inference time.
Model Overview
The released Pelikan-2.1 model has approximately 764M parameters. The architecture is custom and identified in the configuration as Pelikan2ForPromptableMaskGeneration.
| Property | Value |
|---|---|
| Model name | Pelikan-2.1 |
| Developer | ZundTeam |
| Model type | Promptable mask generation |
| Architecture family | Pelikan Vision Transformer |
| Parameter count | ~764M |
| Image size | 512px |
| Patch size | 16 |
| Hidden size | 1280 |
| Encoder depth | 33 layers |
| Decoder type | Promptable transformer mask decoder |
| Output | Candidate masks with IoU-quality head |
| Format | PyTorch / safetensors |
Pelikan-2.1 is designed for image-space segmentation. Images are split into visual patches, encoded by a transformer, and decoded into dense masks. Prompt embeddings steer the decoder toward the selected object or region.
The model includes a quality head that predicts an IoU-style confidence value for each mask token. This allows downstream systems to rank candidate masks, pick the best result, or expose multiple possible masks to a user.
Pelikan-2.1 emits raw mask logits instead of only hard binary masks. This gives applications control over thresholding. A stricter threshold can produce cleaner regions with fewer uncertain pixels, while a softer threshold can preserve more of the target shape around difficult boundaries.
Architecture
Pelikan-2.1 uses a custom transformer-based promptable segmentation architecture with four main parts:
- Image encoder: converts an RGB image into a dense grid of visual tokens.
- Prompt encoder: turns a bounding box and foreground point into dense prompt embeddings.
- Mask decoder: lets learned mask tokens attend over image features and prompt-conditioned representations.
- Mask heads: project decoder outputs into image-space segmentation masks.
The image encoder uses patch embedding followed by stacked self-attention and feed-forward layers. This gives the model global scene context while preserving a spatial token grid for downstream mask prediction.
The prompt encoder gives the model a direct representation of user intent. Box coordinates and foreground point coordinates are embedded into the same hidden space as the visual features. These prompt signals guide the decoder toward the requested region.
The mask decoder uses learned mask tokens and an IoU token. The mask tokens attend to encoded visual features, then small per-mask heads convert the decoded token features into segmentation masks. A lightweight upsampling path restores spatial detail from the transformer grid to the mask output.
This design keeps Pelikan-2.1 focused on direct visual segmentation. It is not an image classifier, object detector, captioning system, text-to-image model, or general multimodal assistant. It is a dedicated promptable mask-generation model.
Real-Image Demo
The assets folder contains a real-image demonstration generated from the local Pelikan-2.1 weights. The demo uses image-and-prompt inputs, stores raw probability maps, creates prompt/mask panels, and renders a short MP4 showing the prompt and mask overlay sequence.
Demo video:
assets/pelikan21_real_image_segmentation_demo.mp4
The real-image demo was generated on CPU from the local safetensors checkpoint. Because the model outputs raw probability maps, the demo uses an adaptive threshold inside the prompt box for visualization. Applications can choose their own thresholding and post-processing strategy depending on the target use case.
Input and Output Behavior
Pelikan-2.1 expects an RGB image and prompt coordinates in the image coordinate space used by the loader. A typical application normalizes the image, resizes it to the configured input size, converts box and point prompts into the same coordinate frame, and passes all tensors into the model.
The main image input is a batch of RGB tensors. The prompt input contains a rectangular box and at least one foreground point. The output contains candidate mask logits and a mask-quality prediction. Applications can select the highest-quality candidate, expose all candidates to a user, or combine candidate masks with their own heuristics.
The output mask is best treated as a probability-like spatial field before thresholding. Downstream code can apply a sigmoid, choose a threshold, remove tiny disconnected regions, fill small holes, smooth contours, or crop the final result back into the original image coordinate system.
Data
Pelikan-2.1 was trained on a mixed segmentation data pool containing scene parsing, foreground object masks, and object-level segmentation examples. The training format converts segmentation annotations into promptable mask-generation samples.
Each training example contains:
- an RGB image
- a target mask
- a box prompt derived from the target region
- a foreground point sampled from the target mask
- a dense mask target for image-space supervision
The promptable format is important. Instead of only learning that a pixel belongs to a named semantic category, the model learns a conditional task: given this image and this prompt, recover the selected region. This better matches annotation, editing, and object-selection workflows where the user intent changes from one prompt to the next.
Intended Use
Pelikan-2.1 is intended for:
- promptable image segmentation
- mask-generation research
- visual annotation tools
- object and region selection workflows
- interactive segmentation systems
- segmentation-assisted image editing
- dataset labeling and mask bootstrapping
- transformer-based vision experiments
- SAM-style visual prompting research
Pelikan-2.1 can be used wherever a project needs a learned model that receives an image-space prompt and returns a segmentation mask. It is especially relevant for pipelines where another component proposes a box or point and Pelikan-2.1 refines that signal into a usable mask.
Example application areas include interactive image editors, semi-automatic dataset labeling, object extraction tools, visual search interfaces, robotics perception research, scene understanding experiments, and segmentation-assisted media workflows.
Usage
This repository contains the model weights and configuration for Pelikan-2.1. A compatible implementation should construct Pelikan2ForPromptableMaskGeneration using config.json, then load model.safetensors.
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
At a high level, a Pelikan-2.1 inference pipeline should:
- Prepare an RGB image at the expected input size.
- Normalize and batch the image tensor.
- Provide a visual prompt such as a bounding box and foreground point.
- Run the Pelikan image encoder.
- Run the promptable mask decoder.
- Select the best candidate mask using the quality head.
- Resize or post-process the selected mask for the application.
The exact preprocessing, threshold, prompt coordinate format, and post-processing depend on the surrounding system.
For repeated prompts on the same image, a serving implementation can cache the image encoder output and run only the prompt encoder and mask decoder for each new prompt. This follows the same basic interaction pattern used by promptable segmentation systems and can make interactive usage more responsive.
Post-Processing Suggestions
Pelikan-2.1 outputs mask logits so applications can decide how strict the final mask should be. A simple pipeline can apply sigmoid, threshold the mask, and resize it to the original image resolution.
For cleaner visual results, applications may add lightweight post-processing. Common steps include removing tiny disconnected components, filling small holes, smoothing boundaries, keeping only the largest connected component inside the prompted box, or snapping the mask to a crop around the target region.
For annotation workflows, it is often better to preserve uncertainty rather than aggressively clean the mask. A softer threshold can give a human annotator more boundary information to correct. For automated object extraction, a stricter threshold may be preferred.
Practical Prompting Tips
Pelikan-2.1 works best when the prompt clearly identifies the target region. A box should cover the object or region without including too much unrelated context. A point should be placed inside the desired foreground region rather than near the boundary.
If the first mask is too broad, use a tighter box. If the mask selects the wrong region inside the box, move the foreground point closer to the center of the intended object. If the target object is small, crop around the region before resizing to the model input size.
For objects with thin parts, reflective surfaces, or complex boundaries, post-processing and human review may be useful. The model can provide a starting mask even when the final production mask needs cleanup.
Model Line
Pelikan-2.1 is an incremental release in the Pelikan promptable segmentation line. The model keeps the same custom architecture family, prompt format, image-space mask objective, and PyTorch/safetensors release structure.
Future Pelikan releases can build on the same design with longer training, larger data mixtures, higher-resolution adaptation, stronger decoders, improved prompt handling, video-aware encoders, or more mask candidates.
Scope
Pelikan-2.1 is one part of a larger segmentation system. It does not include a graphical annotation interface, web server, prompt editor, dataset browser, model-serving wrapper, or production moderation layer.
The model is best suited for researchers and developers who are comfortable integrating raw PyTorch/safetensors weights into custom computer-vision systems. It can also serve as a reference point for experiments around promptable segmentation, custom mask decoders, and transformer-based visual prompting.
Limitations
- Pelikan-2.1 requires a compatible custom loader implementation.
- Output quality depends on prompt quality, image distribution, thresholding, and post-processing.
- Ambiguous prompts can produce ambiguous masks.
- Thin structures, transparent objects, tiny objects, and unusual boundaries may require additional refinement.
- The model should be tested on target data before deployment.
- It is not intended for safety-critical medical, legal, biometric, or identity-sensitive segmentation without independent expert review.
Citation
@misc{zundteam_2026_pelikan21,
title = {Pelikan-2.1: Promptable Mask Generation Model},
author = {ZundTeam},
year = {2026},
url = {https://huggingface.co/ZundTeam/Pelikan-2.1}
}
- Downloads last month
- 3