"The Greatest Path is the Simplest"
中文 | English
- This project aims to train a super-small multimodal vision-language model, MiniMind-V, with just a cost of 1.3 RMB and 1 hours of work, starting from scratch!
- The smallest version of MiniMind-V is only about $\frac{1}{2600}$ the size of GPT-3, designed to enable fast inference and even training on personal GPUs.
- MiniMind-V is an extension of the visual capabilities of the MiniMind pure language model.
- The project includes full code for the minimalist structure of large VLM models, dataset cleaning, pretraining, and supervised fine-tuning (SFT).
- This is not only the smallest implementation of an open-source VLM model but also a concise tutorial for beginners in vision-language models.
- The hope is that this project can provide a useful example to inspire others and share the joy of creation, helping to drive progress in the wider AI community!
To avoid misunderstandings, the "1 hours" is based on testing (
1 epoch) with an NVIDIA 3090 hardware device (single GPU), and the "1.3 RMB" refers to GPU server rental costs.
📌 Introduction
“Building a plane with Legos is much more exciting than flying in first class!” Is it really as complex as imagined to build a VLM-based multimodal large model? How is the code implementation done? Is the training process difficult? Now, let's explore the answers and feel the joy of creation together!
(As of 2026-02-15) The MiniMind-V series has completed the training of the following model versions, with the smallest requiring only 67M (0.067B) parameters, capable of both image recognition and conversation!
| Model (Size) | Inference Memory | Release |
|---|---|---|
| minimind-3v-moe (201M-A67M) | 1.0 GB | 2026.04.01 |
| minimind-3v (67M) | 0.5 GB | 2026.04.01 |
| MiniMind2-V (104M) | 1.1 GB | 2025.02.20 |
| MiniMind2-Small-V (26M) | 0.6 GB | 2025.02.20 |
| minimind-v-v1-small (27M) | 0.6 GB | 2024.10.04 |
| minimind-v-v1 (109M) | 1.1 GB | 2024.10.04 |
👉Recent Updates
2026-02-15
- Added minimind-3v (67M) and minimind-3v-moe (201M-A67M) models
- Unified 768+8 architecture, supporting both dense and moe modes
- Dataset format updated to parquet, added LLaVA-SFT-665K data source
- Updated tokenizer, image placeholder changed to
<|image_pad|>
2025-10-24
- Bug fix: model weights mismatch
- Adapted to "minimind-1024 update"
- Code refactoring: training and evaluation scripts standardized
- Added complete checkpoint resumption support
2025-04-27
- Compatibility updates
- Adapted to the new feature in the "minimind" repository
- Standardized parts of the code
More...
2025-02-20
- MiniMind2-V updated alongside MiniMind2
- Significant reduction of all redundant code, standardized code format
- Major simplification of the model's redundant structure
- Updated dataset format, expanded with new SFT datasets
- Better performance than the previous VLM version!
2024-10-05
- MiniMind-V released on schedule, first open-source release
📌 Quick Start
Sharing my hardware and software configuration (for reference only)
- CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
- RAM: 128 GB
- GPU: NVIDIA GeForce RTX 3090(24GB) * 8
- Ubuntu==20.04
- CUDA==12.2
- Python==3.10.16
- requirements.txt
Step 0
# Clone the code repository
git clone https://github.com/jingyaogong/minimind-v
# Download the siglip2 model to the ./model directory
git clone https://huggingface.co/jingyaogong/siglip2-base-p16-ve
# or
git clone https://modelscope.cn/models/gongjy/siglip2-base-p16-ve
# Download the minimind language model to the ./out directory (as the base language model for training VLM):
# HuggingFace
https://huggingface.co/jingyaogong/minimind-3v-pytorch/blob/main/llm_768.pth
# Domestic source
https://modelscope.cn/models/gongjy/minimind-3v-pytorch/resolve/master/llm_768.pth
Ⅰ Test an existing model's performance
1' Environment Preparation
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
2' Download the model
git clone https://huggingface.co/jingyaogong/minimind-3v
3' Command-line Q&A
# load_from='model': load native PyTorch weights, load_from='other path': load transformers format
python eval_vlm.py --load_from model --weight sft_vlm
# Or use transformers format model
python eval_vlm.py --load_from minimind-3v
4' Or start the WebUI
# ⚠️ You must first copy the transformers model folder to the ./scripts/ directory (e.g.: cp -r minimind-3v ./scripts/minimind-3v). The web_demo_vlm script will automatically scan subdirectories containing weight files; it will report an error if none are found.
cd scripts && python web_demo_vlm.py
Ⅱ Train from scratch
1' Environment Preparation
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
Note: Test if Torch can use CUDA
import torch
print(torch.cuda.is_available())
If unavailable, download the whl file from torch_stable for installation. Refer to this link for help.
2' Download Data
Download the required content from the dataset link
and place it under ./dataset.
Note: Dataset Details
[Note 1] Previously, extracting 500k fragmented image files could be very slow. From 2025-12-27, dataset format is unified to Parquet with image-text integrated storage, smaller size, no decompression needed, faster loading.
[Note 2] Parquet is a columnar storage format supporting efficient compression and fast reading. To preview data content, run python lm_dataset.py in the dataset/ directory to visualize the first 5 image-text pairs.
Pretrain data:
wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/pretrain_i2t.parquet
SFT data:
wget https://hf-mirror.com/datasets/jingyaogong/minimind-v_dataset/resolve/main/sft_i2t.parquet
Please reserve about ~2GB of space for the dataset. If there is insufficient space for pretrain data, you can try skipping the pretrain training step and proceed directly to SFT training.
3' Start Training
3.1 Pretraining (Learning image description)
# Basic training command (start from LLM weights, train vision_proj only)
python train_pretrain_vlm.py --epochs 4 --from_weight llm
Run pretraining to get
pretrain_vlm_*.pthas the pretrained model's output weights (* represents the model dimension, default is 768).
3.2 Supervised Fine-Tuning (Learning image-caption dialogue style)
# Basic training command (start from pretrain weights, full parameter fine-tuning)
python train_sft_vlm.py --epochs 2 --from_weight pretrain_vlm
Perform supervised fine-tuning to get
sft_vlm_*.pthas the output weights for the fine-tuned model.
Note: Training Details
Training Features:
- Support checkpoint resumption: add
--from_resume 1parameter to continue from last interruption - Support GPU count changes: automatically convert steps when GPU count changes during resumption
- Atomic saving: use temporary file + replacement mechanism to prevent weight corruption from interruption
- Each save generates
out/**.pth(model weights) andcheckpoints/**_resume.pth(training state) files
# To resume training after interruption, use the same command and add --from_resume 1
python train_sft_vlm.py --epochs 4 --from_resume 1
Parameter Description:
--from_weight: base weight name (llm, pretrain_vlm, none, etc.)--save_weight: save weight prefix name--from_resume: whether to resume training (0=start from scratch, 1=continue from checkpoint)--freeze_llm: whether to freeze LLM parameters (pretrain use only)- More details can be found in the code
4' Test the Model's Performance
Ensure that the model *.pth file you want to test is located in the ./out/ directory.
You can also directly download the pre-trained *.pth file
from here.
# Test SFT model (default)
python eval_vlm.py --weight sft_vlm
# Test Pretrain model
python eval_vlm.py --weight pretrain_vlm
The training scripts are based on PyTorch's native framework and support multi-card acceleration. If your device has N (N>1) GPUs:
Single-machine N-card training method (DDP, supports multi-machine multi-card cluster)
torchrun --nproc_per_node N train_xxx.py
Note: Other Details
Single-machine N-card training (DeepSpeed)
deepspeed --master_port 29500 --num_gpus=N train_xxx.py
You can enable wandb logging during training:
# You need to log in: wandb login
torchrun --nproc_per_node N train_xxx.py --use_wandb
# and
python train_xxx.py --use_wandb
By adding the --use_wandb parameter, you can log the training process, and after training is complete, you can view
the process on the wandb website. You can specify the project name and run name by modifying the wandb_project
and wandb_run_name parameters.
[Note]: After June 2025, the domestic network environment cannot directly connect to WandB. The MiniMind project by default switches to using SwanLab as the training visualization tool (fully compatible with WandB API), that is, just change import wandb to import swanlab as wandb, no other changes are needed.
📌 VLM Detail
The base language model of MiniMind-V (VLM), MiniMind (LLM), comes from the twin project minimind. For detailed information on the model structure, training specifics, principles, and testing results, please refer to the minimind project. To reduce redundancy, the discussion on LLM-related topics is omitted here, assuming you have a basic understanding of MiniMind (LLM).
Even if you are not very familiar with the details of LLMs, you can still follow the "Quick Start" guide to train a MiniMind-V, as it remains unaffected and the repository focuses on the lowest cost for out-of-the-box use!
MiniMind-V's structure adds two submodules, a Visual Encoder and a feature projection, with a modality-mixing branch to
support inputs from multiple modalities:

[Important] Some Interesting Thoughts
Let's take a moment to think about two questions:
- What is a Large Language Model (LLM)?
- What is a multimodal model?
This article perfectly aligns with my thoughts:
Although the name "large language model" (LLM) contains the word "language," they are actually not closely related to
language; this is just a historical issue. A more accurate name would be self-regressive Transformer or something else.
LLMs are more of a general statistical modeling technology, mainly using a self-regressive Transformer to simulate token
flows. These tokens can represent text, images, audio, action choices, and even molecules—anything, really.
Therefore, as long as the problem can be converted into a process of simulating a series of discrete tokens, LLM can
theoretically solve it. In fact, with the increasing maturity of large language model technologies, we may see more and
more problems falling under this modeling paradigm. In other words, the problem is fixed in using LLM to "predict the
next token," but the role and meaning of the tokens differ in each domain.
ZJU-LiXi has also mentioned a similar viewpoint (roughly stated below):
Text, video, audio, actions, etc., are considered "multimodal" signals in human perception, but the term "modality" is
essentially just a classification concept based on how humans store information. Just like .txt and .png files,
though they differ in visual presentation and higher-level forms, they are fundamentally the same. The concept of "
multimodal" arose simply because humans need to categorize these signals based on different sensory dimensions.
However, for machines, regardless of the signal's "modality," they are ultimately presented as a sequence of binary "
monomodal" numbers. Machines do not differentiate the origin of these signals; they just process and analyze the
information contained within these sequences.
Personally, I think Generative Pretrained Transformer (GPT) is a more fitting term than **Large Language Model (LLM) **, and I prefer to use "GPT" to represent models in the LLM/VLM/GPT-like architecture series rather than to ride on OpenAI's coattails.
To summarize what GPTs do in one sentence:
A GPT model predicts the next, next-next, next-next-next token, etc., based on the current token... until the model outputs the end token; here, the "token" doesn’t necessarily have to be text!
> For an LLM model, if we need to understand an "image," we just treat the "image" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
> For an LLM model, if we need to understand "audio," we just treat "audio" as a special "foreign language" that has never been encountered before, and translate it into the "LLM language" via a "foreign language dictionary."
> ...
To obtain MiniMind-V, we only need to do these 2 things:
- Use the "foreign language dictionary" that is good at translating images, to translate the image from the " foreign language" into a model-understandable "LLM language."
- Fine-tune the LLM so that it and the "foreign language dictionary" go through a period of adaptation, thereby better understanding images.
The "foreign language dictionary" is referred to as the Visual Encoder model.
Like LlaVA, Qwen-VL, and other visual language models, MiniMind-V now uses the open-source SigLIP2 series models as the
Visual Encoder.
Specifically, we use siglip2-base-p16-ve, a Visual
Encoder based on the ViT-B/16 architecture for describing image-text information.
The current SigLIP2 NaFlex vision encoder generates up to 256 patch tokens from the processor output as the input to the
encoder layer, which produces a 1×768 dimensional embedding vector for calculating error with the text.
We don’t need the final embedding representation, so we only take the output from the encoder layer, which is the output
feature from the core ViT backbone.
It receives 256×768 features from the previous layer, which are then reshaped by concatenating every 4 adjacent tokens into 1 (256×768 → 64×3072), then projected to the LLM's hidden dimension via a 2-layer MLP (Linear→GELU→Linear), resulting in 64 visual tokens input into MiniMind-V.
After obtaining the image encoder features, the integration with the LLM requires aligning the visual features to the LLM's text token dimension, and mapping the image features into the same space as text embeddings. In other
words, the image features and native visual tokens cannot be directly treated the same; they require cross-modal feature
alignment.
LlaVA-1 achieves good alignment with a simple linear transformation, LlaVA-1.5 upgrades to a 2-layer MLP. MiniMind-V adopts the same MLP Projection approach as LlaVA-1.5, combined with reshape for token compression.
With that, the internal structural changes of MiniMind-V are now fully presented.
Next, let's briefly discuss the changes in the external input and output of MiniMind-V.
The input to the VLM is still a segment of text containing special <image> placeholders.
After computing the text embedding, the vector generated by the image encoder can be projected onto the corresponding
embedding part of the placeholder, replacing the original placeholder embedding.
For example:
<image>\nWhat is in this image?
In minimind-v, the image is replaced by 64 <|image_pad|> tokens as placeholder (the 256 SigLIP2 patch features are compressed to 64 tokens via reshape+MLP),
thus the minimind-v prompt becomes:
<|image_pad|><|image_pad|>...<|image_pad|>(×64)\nWhat is this image describing?
After calculating the embedding and projection, the vision features replace the corresponding placeholder embeddings, and the rest of the computation is identical to the LLM part.
At this point, all the details of MiniMind-V have been presented. The VLM model subclass inherits from MiniMind with only minimal changes, core algorithm modifications < 50 lines, very low migration difficulty. The specific implementation may differ from LlaVA and similar models, but the overall idea is consistent.
📌 Experiment
Ⅰ Dataset
Original Source:
- Chinese-LLaVA-Vision: Contains approximately 570,000 pre-trained images from CC-3M and COCO 2014
- llava-en-zh-300k: Contains 300k instruction fine-tuning data and 150k images
- LLaVA-SFT-665K: Contains 665k instruction fine-tuning data
The dataset contains both Chinese and English data. The Q&A content has been translated, with better support for Chinese, further organized and resized (pretrain resolution 128×128, sft resolution 160×160).
(pretrain_i2t.parquet) Pre-training dataset format:
Columns: conversations (json string), image_bytes (binary), image_names (string)
conversations example:
[
{"role": "user", "content": "Provide a brief description of the given image.\n<image>"},
{"role": "assistant", "content": "Olive oil is a healthy ingredient for free use."}
]
image_bytes: <binary image data>
(sft_i2t.parquet) Single image instruction fine-tuning dataset format:
Columns: conversations (json string), image_bytes (binary), image_names (string)
conversations example:
[
{"role": "user", "content": "What impact does the location of the alarm clock have on sleep quality?<image>"},
{"role": "assistant", "content": "Place the digital alarm clock on the nightstand..."}
]
image_bytes: <binary image data>
Note: sft_i2t.parquet contains ~580K samples in total, of which ~236K are image-text conversations (i2t) and ~346K are pure text conversations (t2t). The latter is used to preserve the model's base language capabilities.
Dataset download link: (ModelScope | HuggingFace)
Ⅱ Training
Training is divided into two stages, both freezing the Visual Encoder gradients and only training the Projection and LLM parts. Training is initialized from LLM pre-trained weights, with support for DDP multi-GPU training, mixed precision (bfloat16), torch.compile acceleration, and swanlab logging.
train_pretrain_vlm
The pre-training stage learns general image knowledge from ~1.13M image-text description pairs (e.g., a deer is a deer, a dog is a dog). This stage uses a higher learning rate (1e-4), max sequence length of 360, freezes the LLM main parameters, and only sets the Projection and LLM's layer 0 as learnable, aiming to quickly establish a basic mapping from visual features to the language space while avoiding damage to the LLM's existing language capabilities.
train_sft_vlm
The instruction fine-tuning stage learns real Q&A formats from 580K samples, of which ~236K are image-text multi-turn conversations and ~346K are pure text conversations (to preserve LLM base capabilities).
This stage uses a lower learning rate (1e-51e-6), max sequence length of 768, unfreezes all Projection and LLM parameters for full fine-tuning,
enabling the model to conduct multi-turn conversations based on image content, while mitigating catastrophic forgetting through the mixed-in pure text data.
Training Time and Loss Trend (for reference only)
Pretrain [768+8] (dense & moe)
Ⅲ Model Weights
| Format | ModelScope | HuggingFace |
|---|---|---|
Native PyTorch (*.pth) |
minimind-3v-pytorch | minimind-3v-pytorch |
| Transformers | minimind-v collection | minimind-v collection |
Note: The Transformers version is the
MiniMind-Vmodel after single-image instruction fine-tuning
📌 Test
Effect Test
Single Image Dialogue
| image | minimind-3v | minimind-3v-moe |
|---|---|---|
|
The image features a white and black airplane parked on a grassy terrain. The layers of a building are likely to be filled with air traffic control, such as the building's tall building, the large building, or the overall pavement. There are also two airplanes parked in the background. The airplanes are displayed on a board, and the airplane is flying through the air while the black and white airplane is parked on the ground. | The image features a large jetliner with a large airplane sitting on the ground. It is likely an airplane or an aircraft, possibly a flight jet or a runway. |
|
The image features a white cake in an old-fashioned cake with cake placed on it. It is surrounded by a few cooked ingredients, including the wedding cake. | The image features a white cake in an old-fashioned cake with cake placed on it. It is surrounded by a few cooked ingredients, including the wedding cake. |
|
The image depicts a delicious pizza pizza with fresh toppings, which are likely present in a slice of pizza. The pizza is perfectly crispy, as it has a crispy crust and slightly crispy, making it a delightful pizza presentation. The pizza is filled with fresh toppings, adding to the crispy crust. The pizza is also a bit scrambled, as it has a fresh topping, while the crispy crust is cooked with a pizza pan. The pizza is likely to be a pizza with its crunchy texture and flavorfully. The pizza is also a popular choice for pizza with others, and it is a filling and crispy crust. | The image features a scenic burning pizza in a pasta-style board, surrounded by a wooden cabinet, a garner topping for cheese and vegetables. There is a small cabinet nearby, with a pan-familt, a pizza flatter topping. The pizza is situated on the left side of the board, with a bowl of olives placed on top of the side. |
|
The image shows a yellow yellow and red turf coastline on a street, with a couple of cars and a yellow and red traffic lights in the background. | The image features a couple of female vanity van lying on a car. The car is situated on the ground, surrounded by a wet furniture. The couple is seen in the middle of the car, possibly observing the scene. |
|
IoT tree is a holiday that is often associated with Christmas culture, history, and celebration. The tree has a unique black and white striped pattern, which features a sweet treat, a budget-friendly chocolate cake with brown spots. The cake is burnt and has a balcony with a sweet treat, with the rich, vibrant colors of the striped pattern. The tree has a rich and burnt color, while the rich and vibrant colors are visually appealing. The tree has a striped pattern, which adds to the overall atmosphere of the image. | The image is a colorful scene of a colorful vintage house, with a large pink roof of the wall and a red color scheme. The colorful vintage house has a pink color, and the color scheme appears to be fine, with a red color scheme. |
|
The image features a snowy mountain surrounded by a large, dense mountains. The large body of water is quite dense, with the body of water being hot and the water is sandy. The tall trees are swirls, and the tall trees are standing on the snow-covered ground. The body of water is also sink, creating a savanna-like appearance. | The image is an image of a large mountain visible in a lake, surrounded by the idyllic mountains and the mountains. It appears to be a blanket in the ocean, with the river and idylis watching the sea. The mountains are lined with fresh sand and waves, adding a sense of tranquility to the scene. |
|
The image shows a variety of cooking options, including baking meat, cupcakes, and spinach. The baked vegetables are rich in a variety of flavors, including grilled, sautéed, and baked vegetables. The presence of a baked vegetable with a variety of vegetables in different parts suggests a variety of options, including baking, cooking, and baking. The cooking process is highly recommended, with a mix of vegetables and baking times, making it an ideal choice for those who prefer a variety of cooking options. The cooking process is also highly compatible, with a variety of flavors and textures enjoying the cooking process. | The image features a variety of freshwater salads, glasses, and coworkers displayed on a table. There is a mix of freshwater ingredients, likely a bun, which can be seen in the menu. The freshwater ingredients are placed on the table, and there is a portion of the coworkers displayed in the middle of the room. There is also a bowl filled with various ingredients. |
|
The image features a table filled with people standing together, casing bars, and a pair of brown cats. The table is filled with pink and white cats. | The image is a black and white detail of a miscellaneous bunch of toys, which is likely to be a part of a group or a similar artistic field. |
|
The image features a brown wooden coat. | The image shows a sandy beach with an umbrella on top of a chair, providing a visual appeal for people to sit on. |
|
The image displays a group of people sitting on the bus. They are waiting to be cautious and attentive to their feet, which indicates they are likely to be cautious and followed by the bus. | The image features a large collection of school buses, a brick-and-middle bus, and a stack of cars visible in the background. The school bus is situated next to a school bus, and there are several people watching the bus. The school bus is visible in the background, with one person standing behind the other, while the other person is watching the bus. The bus is positioned behind the school bus, creating a seamless and dynamic visual effect. |
Effect Summary:
Both models can identify image subjects (airplane, cake, car, beach, etc.), but commonly exhibit repetitive expressions and hallucinated details. Limited by model and data scale, the overall performance is at a stage of "understanding the gist but inaccurate on details".
Visual signals are treated as a special foreign language by LLMs, so the "language learning" ability highly depends on the LLM's capacity. The stronger the LLM, the more powerful the corresponding VLM, and the performance boost becomes significant.
Future Areas for Improvement:
> Introduce dynamic resolution and Tile-based encoding (like LLaVA-NeXT) to break through the fixed resolution limit.
> Visual Encoder could be upgraded to stronger vision encoders for finer-grained image features.
> Extend multi-image understanding, video understanding, and Visual Grounding capabilities.
> ...
📌 Acknowledge
If you find
MiniMind-Vhelpful, please consider giving it a ⭐ on GitHub.
Given the limited expertise, there may be unknown issues, and we welcome everyone to discuss, correct, or submit PRs to improve the project in Issues.
Your support is the driving force behind continuous improvements to the project. Thank you!
🤝 Contributors
😊 Acknowledgments
@xinyanghuang7: Multi-image VLM branch | Repository provided up to this version
Reference Links & Thanks to the following excellent papers or projects
- No particular order
- LlaVA
- LlaVA-VL
- Chinese-LLaVA-Vision-Instructions
🫶Supporter
🎓 Citation
If you find MiniMind-V helpful in your research or work, please cite:
@misc{minimind-v,
title = {MiniMind-V: Train a Tiny VLM from Scratch},
author = {Jingyao Gong},
year = {2024},
url = {https://github.com/jingyaogong/minimind-v},
note = {GitHub repository, accessed 2026}
}
📜 License
This repository is licensed under the Apache-2.0 License.




