granite-4.1-3b-vl
Vision-language model trained with a CLIP vision encoder, a learned VisionProjection MLP, and a LoRA-fine-tuned causal LM backbone.
Architecture
| Component | Detail |
|---|---|
| Base LLM | ibm-granite/granite-4.1-3b (LoRA merged) |
| CLIP dim | ? |
| LLM dim | ? |
Files
| Path | Contents |
|---|---|
projector/projector.pt |
VisionProjection weights + config |
llm/ |
Merged LLM (SafeTensors) + tokenizer |
Loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
from core.models import VisionProjection
REPO = "Issactoto/granite-4.1-3b-vl"
# Projector
ckpt = torch.load(hf_hub_download(REPO, "projector/projector.pt"))
cfg = ckpt["config"]
projector = VisionProjection(cfg["clip_dim"], cfg["llm_dim"])
projector.load_state_dict(ckpt["state_dict"])
projector.eval()
# LLM + tokenizer
llm = AutoModelForCausalLM.from_pretrained(f"{REPO}/llm")
tokenizer = AutoTokenizer.from_pretrained(f"{REPO}/llm")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support