FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Website | arXiv | GitHub | 🤗 Demo | BibTeX

Official implementation and pre-trained models for:
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length, arXiv 2025
Roman Bachmann*, Jesse Allardice*, David Mizrahi*, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan

Installation

For install instructions, please see https://github.com/apple/ml-flextok.

Usage

To load the FlexTok d18-d28 ImageNet-1k model directly from HuggingFace Hub, call:

from flextok.flextok_wrapper import FlexTokFromHub
model = FlexTokFromHub.from_pretrained('EPFL-VILAB/flextok_d18_d28_in1k').eval()

The model can also be loaded by downloading the model.safetensors checkpoint in this repository manually and loading it using our helper functions:

from hydra.utils import instantiate
from flextok.utils.checkpoint import load_safetensors

ckpt, config = load_safetensors('/path/to/model.safetensors')
model = instantiate(config).eval()
model.load_state_dict(ckpt)

After loading a FlexTok model, image batches can be encoded using:

from flextok.utils.demo import imgs_from_urls
# Load example images of shape (B, 3, 256, 256), normalized to [-1,1]
imgs = imgs_from_urls(urls=['https://storage.googleapis.com/flextok_site/nb_demo_images/0.png'])

# tokens_list is a list of [1, 256] discrete token sequences
tokens_list = model.tokenize(imgs)

The list of token sequences can be truncated in a nested fashion:

k_keep = 64 # For example, only keep the first 64 out of 256 tokens
tokens_list = [t[:,:k_keep] for t in tokens_list]

To decode the tokens with FlexTok's rectified flow decoder, call:

# tokens_list is a list of [1, l] discrete token sequences, with l <= 256
# reconst is a [B, 3, 256, 256] tensor, normalized to [-1,1]
reconst = model.detokenize(
    tokens_list,
    timesteps=20, # Number of denoising steps
    guidance_scale=7.5, # Classifier-free guidance scale
    perform_norm_guidance=True, # See https://arxiv.org/abs/2410.02416
)

Citation

If you find this repository helpful, please consider citing our work:

@article{flextok,
    title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
    author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
    journal={arXiv 2025},
    year={2025},
}

License

The model weights in this repository are released under the Apple Model License for Research.

EPFL-VILAB
/

flextok_d18_d28_in1k

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Installation

Usage

Citation

License

Collection including EPFL-VILAB/flextok_d18_d28_in1k

FlexTok Tokenizers & VAEs