30 GB
133 files
Updated 4 days ago
NameSize
.gitattributes1.52 kB
xet
143068.jpg89.6 kB
xet
6-6月-上午9時00.slop14.2 kB
xet
Active_separatist_movements_in_Europe.svg.png1.87 MB
xet
Brazil.entity807 Bytes
xet
Canada–Brazil War.entity101 Bytes
xet
Disability_symbols.svg.png652 kB
xet
DopplegangerTowerV0.3.zip32.5 MB
xet
Egypt_Sudan_claims.svg.png436 kB
xet
Europe_subregion_map_world_factbook.svg.png1.05 MB
xet
Farmland (1).entity121 Bytes
xet
Farmland.entity161 Bytes
xet
IPC_logo_(2019).svg.png166 kB
xet
Inglehart_Values_Map.svg.png1.11 MB
xet
KPRF_Flag.svg.png318 kB
xet
KarteWEUStaaten.png41.4 kB
xet
Ledra_Street.jpg1.39 MB
xet
Linguistics_stub.svg.png106 kB
xet
Logo_of_the_Communist_Party_of_the_Russian_Federation.svg.png224 kB
xet
Logowvs.jpg6.3 kB
xet
Microsoft.Services.Store.winmd4.61 kB
xet
Mount Ah.entity104 Bytes
xet
N._America_separatism.svg.png1.27 MB
xet
PARADE_DES_CHAMPIONS_PARIS_2024_CHAMPS_ELYSEES_(53997937113).jpg1.07 MB
xet
Political_Compass_standard_model.svg.png107 kB
xet
Political_spectrum_Eysenck.svg.png193 kB
xet
PolyTrack-0.6.0-win32-x64.zip144 MB
xet
Quebec_referendum,_1995_-_Results_By_Riding.svg.png907 kB
xet
README.md6.79 kB
xet
Red_Bull.svg.png258 kB
xet
Retour_des_medaillés_de_Tokyo_2020_au_trocadero_(51367935546).jpg1.53 MB
xet
Separatismos_na_Espanha.svg.png727 kB
xet
Separatist_movements_in_Africa.png578 kB
xet
Speaker_Icon.svg.png121 kB
xet
Spotify - Music and Podcasts Installer.exe1.29 MB
xet
SpotifySetup.exe1.3 MB
xet
Stadium (1).entity180 Bytes
xet
Stadium.entity163 Bytes
xet
Sweden.entity694 Bytes
xet
Trucudgeh.entity168 Bytes
xet
Tussol e II.entity122 Bytes
xet
Tussol e-104.planet43.3 kB
xet
Tussol e-251.planet51.3 kB
xet
Tussol e-351.planet57.6 kB
xet
Tussol e-364.png45 kB
xet
Tussol.entity96 Bytes
xet
UNpeacekeeping.svg.png622 kB
xet
Vantablack_02.jpeg1.68 MB
xet
West Land.entity567 Bytes
xet
World_ocean_map.gif69.3 kB
xet
ae.safetensors335 MB
xet
alberville_japanese_ink_20260531_123035.png7.86 MB
xet
baden-baden_japanese_ink_20260531_121931.png12.4 MB
xet
bonsai-2026-05-30T05-35-31-137Z.png329 kB
xet
bonsai-2026-05-30T05-36-04-776Z.png368 kB
xet
bonsai-2026-05-30T05-36-31-002Z.png376 kB
xet
bonsai-2026-05-30T05-37-00-430Z.png373 kB
xet
bonsai-2026-05-30T05-37-29-443Z.png418 kB
xet
busan_japanese_ink_20260531_120720.png7.89 MB
xet
chamonix_japanese_ink_20260531_122531.png4.96 MB
xet
config.json1.44 kB
xet
doha_japanese_ink_20260531_114557.png7.6 MB
xet
ema.safetensors29.2 GB
xet
generation_config.json243 Bytes
xet
geneva_blueprint_20260531_115240.png14.3 MB
xet
hong_kong_neon_cyberpunk_20260531_115953.png4.53 MB
xet
htu_autobackup_20260530_incremental.tsv29.3 kB
xet
htu_backup_20260530_141732.tsv26.7 MB
xet
i2v_20260525_060405_1779681845.mp41.14 MB
xet
i2v_20260525_061150_1779682310.mp4377 kB
xet
i2v_20260528_113107_1779960667.mp4810 kB
xet
i2v_20260529_130131_1780052491.mp4608 kB
xet
i2v_20260530_154451_1780148691.mp4755 kB
xet
i2v_20260606_035257_1780710777.mp4723 kB
xet
image (1).jpg126 kB
xet
image (10).jpg47.1 kB
xet
image (10).png631 kB
xet
image (11).jpg123 kB
xet
image (11).png885 kB
xet
image (12).jpg144 kB
xet
image (12).png626 kB
xet
image (13).jpg21 kB
xet
image (13).png744 kB
xet
image (14).jpg66.2 kB
xet
image (14).png3.45 MB
xet
image (15).png3.55 MB
xet
image (16).png3.77 MB
xet
image (17).png3.96 MB
xet
image (18).png3.98 MB
xet
image (19).png4.29 MB
xet
image (2).jpg109 kB
xet
image (20).png3.98 MB
xet
image (21).png3.67 MB
xet
image (22).png3.29 MB
xet
image (23).png3.55 MB
xet
image (24).png4.12 MB
xet
image (25).png3.99 MB
xet
image (3).jpg14.2 kB
xet
image (4).jpg43.7 kB
xet
image (4).png3.48 MB
xet
README.md

BAGEL

🥯 BAGEL • Unified Model for Multimodal Understanding and Generation

BAGEL Website BAGEL Paper on arXiv Github BAGEL Demo BAGEL Discord

We present BAGEL, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3. Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.

This repository hosts the model weights for BAGEL. For installation, usage instructions, and further documentation, please visit our GitHub repository.

🧠 Method

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.

BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.

🌱 Emerging Properties

As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.

📊 Benchmarks

1. Visual Understanding

Model MME ↑ MMBench ↑ MMMU ↑ MM-Vet ↑ MathVista ↑
Janus-Pro-7B - 79.2 41.0 50.0
Qwen2.5-VL-7B 2347 83.5 58.6 67.1 68.2
BAGEL 2388 85.0 55.3 67.2 73.1

2. Text-to-Image Generation · GenEval

Model Overall ↑
FLUX-1-dev 0.82
SD3-Medium 0.74
Janus-Pro-7B 0.80
BAGEL 0.88

3. Image Editing

Model GEdit-Bench-EN (SC) ↑ GEdit-Bench-EN (PQ) ↑ GEdit-Bench-EN (O) ↑ IntelligentBench ↑
Step1X-Edit 7.09 6.76 6.70 14.9
Gemini-2-exp. 6.73 6.61 6.32 57.6
BAGEL 7.36 6.83 6.52 44.0
BAGEL+CoT 55.3

License

BAGEL is licensed under the Apache 2.0 license. It is finetuned from Qwen2.5-7B-Instruct and siglip-so400m-14-384-flash-attn2 model, and uses the FLUX.1-schnell VAE model, all under Apache 2.0.

✍️ Citation

@article{deng2025bagel,
  title   = {Emerging Properties in Unified Multimodal Pretraining},
  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},
  journal = {arXiv preprint arXiv:2505.14683},
  year    = {2025}
}
Total size
30 GB
Files
133
Last updated
Jun 6
Pre-warmed CDN
US EU US EU

Contributors