Keural VLM (V0.1)

MKD Co., Ltd. | Full vision-language model built on a custom 24.7M vision encoder trained from scratch.

Status: V0.1 β€” Phase 2B Complete Β· SFT (30K steps) + DPO (3K steps) Β· All Benchmarks Evaluated


Keural VLM Architecture


What This Is

A complete Vision-Language Model (VLM) PoC built entirely from scratch β€” no CLIP, no pretrained backbone.

Component Details
Vision Encoder 24.7M params, trained from scratch on CC3M + CC12M (~15M pairs)
Projector LevelAwareProjector (384 β†’ 2048 β†’ 4096)
LLM Mistral-7B-Instruct-v0.3 (4-bit NF4 QLoRA)
SFT LLaVA-Instruct-150K, 30,000 steps
DPO RLHF-V Dataset, 5,733 pairs, 3,000 steps

Architecture

Image (256Γ—256)
    ↓
CNN Stem  β†’  ATB Tokenizer  β†’  Spatial Transformer (12 layers, embed_dim=384)
    ↓
KeuralEncoderOutput  {tokens, level_ids, spatial_metadata, saliency_scores, pooled}
    ↓
LevelAwareProjector  (384 β†’ 2048 β†’ 4096)
    ↓
Visual Tokens  (N_vis Γ— 4096)
    ↓
Mistral-7B-Instruct-v0.3  +  SFT LoRA  +  DPO LoRA
    ↓
Text Response

Key Innovations

Adaptive Token Budget (ATB) Tokenization Token count is a runtime parameter β€” dense regions get more tokens, blank regions get fewer.

out = encoder(image, token_budget=64)    # fast / cheap
out = encoder(image, token_budget=256)   # default
out = encoder(image, token_budget=1024)  # full fidelity

Hierarchical Concept Tokenization (HCT) Every token carries a semantic level tag.

out = encoder(image)
print(out.level_ids)  # {0=global, 1=region, 2=detail}

Training Pipeline

Phase 1 β€” Vision Encoder Pretraining -- COMPLETE

Metric Result
Steps ~75,000
Dataset CC3M + CC12M (~15.3M pairs)
Architecture CNN Stem + ATB Tokenizer + 12-layer Spatial Transformer
Parameters 24.7M (trained from scratch)
Hardware 1Γ— RTX 5090 (32 GB VRAM)

Phase 2A β€” Projector Alignment -- COMPLETE

Property Value
What trains LevelAwareProjector + LoRA on LLM (r=64)
Vision encoder Frozen
Dataset LLaVA-Instruct-150K
Steps 10,000

Phase 2B β€” SFT Instruction Fine-tuning -- COMPLETE

Property Value
What trains LoRA on LLM (r=64, Ξ±=128)
Dataset LLaVA-Instruct-150K
Steps 30,000
Final loss 1.022
Hardware 1Γ— RTX 5090 (32 GB VRAM)

Phase 2B β€” DPO Alignment -- COMPLETE

Property Value
What trains DPO LoRA on LLM (r=16, Ξ±=32)
Dataset RLHF-V Dataset (5,733 pairs)
Steps 3,000
Final loss 0.235
Reward accuracy 95%
Reward margin 2.11
Training time 3h 44min
Hardware 1Γ— RTX 5090 (32 GB VRAM)

Benchmark Results

Evaluated on 1,000 samples each (where applicable). Vision encoder is 12.4Γ— smaller than LLaVA's CLIP encoder (307M).

Benchmark Keural SFT-30K Keural SFT+DPO LLaVA 1.5 (307M enc) LLaVA 1.6 (307M enc)
VQAv2 Accuracy 12.9% 43.6% 78.5% 81.8%
POPE F1 66.9% 67.0% 85.9% 86.5%
MME Total Score 704.3 838.8 1510.7 1519.3
TextVQA Accuracy 0.8% 6.6% 58.2% 64.9%
ScienceQA Accuracy 39.7% 53.7% 66.8% 70.6%

POPE F1 (67.0%) is the standout result β€” within 19pp of LLaVA 1.6 using a 12Γ— smaller encoder. DPO improved every benchmark, most dramatically VQAv2 (+30.7pp) and ScienceQA (+14.0pp). TextVQA is low by design β€” no OCR training. EasyOCR integration in the GUI bridges this gap.

All Benchmarks

VQAv2 POPE
MME ScienceQA

DPO Training Curves

Metric Step 0 Step 3000
Loss 0.694 0.235
Reward Accuracy ~50% 95%
Reward Margin 0.0 2.11

Repository Structure

This repo (keural-vlm-poc) contains:

  • adapter_config.json β€” DPO LoRA config (stacks on top of SFT LoRA)
  • adapter_model.safetensors β€” DPO LoRA weights
  • tokenizer.json, tokenizer_config.json β€” Mistral tokenizer

The SFT LoRA (checkpoint-30000) and projector weights are bundled together. For inference, load: Vision Encoder β†’ Projector β†’ Mistral-7B + SFT LoRA + DPO LoRA.


Usage

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from alignment.projectors import LevelAwareProjector

device = "cuda"

# 1. Load vision encoder (frozen)
encoder = AutoModel.from_pretrained(
    "mkd-hika/keural-vision-encoder-poc",
    trust_remote_code=True, torch_dtype=torch.bfloat16
).to(device).eval()

# 2. Load projector
projector = LevelAwareProjector(encoder_dim=384, hidden_dim=2048, llm_dim=4096)
projector.load_state_dict(torch.load("projector.pt", map_location=device))
projector = projector.to(device, dtype=torch.bfloat16).eval()

# 3. Load LLM + SFT LoRA + DPO LoRA
bnb_cfg = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                              bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
base_llm = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_cfg, device_map="auto", torch_dtype=torch.bfloat16
)
llm = PeftModel.from_pretrained(base_llm, "path/to/sft_lora_adapter")
llm = PeftModel.from_pretrained(llm, "mkd-hika/keural-vlm-poc")  # DPO LoRA
llm.eval()

Roadmap

Phase Params Hardware Status
POC (this model) 24.7M encoder 1Γ— RTX 5090 Complete (SFT + DPO)
Mid-level ~183.4 encoder 4Γ— H100 80 GB Planned
Commercial ~1.1B encoder 64Γ— H100 80 GB Future

License

License

Training data: CC3M, CC12M, LLaVA-Instruct-150K, RLHF-V β€” respective data licenses apply.

MKD Co., Ltd. β€” 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support