You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

PALL-VLM — A Dental Vision-Language Model

PALL-VLM is a multimodal dental assistant that adds image understanding to the PALL-Text dental LLM. It follows a LLaVA-style recipe: a frozen SigLIP vision tower is grafted onto the dental Llama-3.1-8B backbone through a trainable MLP projector, then trained on dental images.

This repository hosts the final, fully-merged bf16 model (~8.5B parameters).


Model description

PALL-VLM turns the text-only dental specialist into a vision-language model capable of interpreting dental imagery (clinical photos, histopathology, radiographs) alongside text.

Architecture

  • Vision tower: SigLIP-so400m-patch14-384, 384px input, 729 patch tokens/image (frozen).
  • Projector: 2-layer GELU MLP (LLaVA-1.5 style), maps vision features → LLM embedding space.
  • Language model: dental Llama-3.1-8B (PALL-Text), fine-tuned with LoRA (r=16).
  • <image> token index: 128256. Total ≈ 8.5B params (vision ~0.4B, projector ~10M, LLM 8B).

Two-stage training

Stage Trainable Data Purpose
1 — Alignment projector only (vision + LLM frozen) single-image subset bind vision features to the LLM embedding space
2 — Instruction tuning LoRA on LLM + projector (vision frozen) full set incl. multi-image dental visual question answering & classification

Trained on a single L40S 48GB GPU. Stage-3 multimodal DPO is deferred (no multimodal preference data yet).

Evaluation note

Because the data is classification-heavy, evaluation includes an image-shuffle control: accuracy must drop when images are randomly permuted, guarding against modality collapse (the model ignoring the image).


Usage

import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "Harisundar/PALL-VLM"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("dental_image.jpg").convert("RGB")
text = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": "<image>\nWhat is shown? Give an ICDAS score if applicable."}],
    tokenize=False, add_generation_prompt=True,
)
batch = processor(images=[image], text=text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(processor.tokenizer.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Training Data Sources & Acknowledgements

PALL-VLM is trained on 32,884 records / 52,461 images assembled from multiple publicly available dental image datasets. We gratefully acknowledge the creators:

Source Records Task(s) Attribution
Oral cancer clinical photos (PQ) 10,002 classification Kaggle oral cancer image dataset contributors
CODE oral classification 7,546 classification CODE oral lesion classification dataset
Oral cancer histopathology 5,127 classification Community histopathology datasets
Dental textbook figures 3,221 VQA, caption Various textbook authors (see PALL-Text card)
Radiograph caries (ICDAS) 1,431 classification, detection ICDAS Foundation; Ismail, A.I. et al. (2007). The International Caries Detection and Assessment System (ICDAS). Community Dentistry and Oral Epidemiology, 35(3), 170–178
Dental samples 1,082 mixed Community dental image datasets
SMART oral photos 1,071 classification SMART oral lesion dataset contributors
Tufts Dental Database 998 report generation Panetta, K., Rajendran, R., Ramesh, A., Rao, S., & Agaian, S. (2022). Tufts Dental Database. IEEE J. Biomed. Health Inform., 26(4), 1650–1659
DENTEX — quadrant detection 676 detection Hamamci, I.E. et al. (2023). DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays. arXiv:2305.19093
Dental radiology 580 classification Community dental radiology datasets
Oral cancer clinical photos (2) 544 classification Kaggle oral cancer datasets
DENTEX — disease classification 407 classification Hamamci, I.E. et al. (2023) (same as above)
Dental jaw captions 144 captioning Community dental datasets
DENTEX — enumeration 50 enumeration Hamamci, I.E. et al. (2023) (same as above)
Dental image dataset 5 mixed Community contribution

Text backbone data

The language backbone (PALL-Text) was trained on 30+ public datasets across CPT/SFT/DPO stages. See the PALL-Text model card for the complete dataset attribution list.


Intended use & limitations

  • Intended: dental image understanding for education and clinical-decision support (VQA, description, classification cues).
  • Out of scope: autonomous diagnosis; primary triage without clinician review; out-of-distribution / non-dental images.
  • Limitations: wide panoramic radiographs are square-resized in v1 (no AnyRes tiling); performance on OOD clinical images is unverified; classification-heavy training may bias toward terse categorical answers.

⚕️ For research and clinical-decision-support only. Not for autonomous diagnosis or treatment.


Citation

@misc{rajendran2026pallvlm,
  title        = {PALL-VLM: A Low-Cost Dental Vision-Language Model via LLaVA-style
                  Grafting on a Dental Llama-3.1-8B},
  author       = {Rajendran, Harisundar},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Harisundar/PALL-VLM}},

}

Foundational works

@inproceedings{liu2023llava,
  title={Visual Instruction Tuning},
  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  booktitle={NeurIPS}, year={2023}
}
@inproceedings{zhai2023siglip,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  booktitle={ICCV}, year={2023}
}
@article{grattafiori2024llama3,
  title={The Llama 3 Herd of Models},
  author={Grattafiori, Aaron and others}, journal={arXiv:2407.21783}, year={2024}
}
@inproceedings{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan
          and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  booktitle={ICLR}, year={2022}
}

Key dataset citations

@article{panetta2022tufts,
  title={Tufts Dental Database: A Multimodal Panoramic X-Ray Dataset for Benchmarking Diagnostic Systems},
  author={Panetta, Karen and Rajendran, Rahul and Ramesh, Aruna and Rao, Shishir and Agaian, Sos},
  journal={IEEE Journal of Biomedical and Health Informatics},
  volume={26}, number={4}, pages={1650--1659}, year={2022}, doi={10.1109/JBHI.2021.3117575}
}
@article{hamamci2023dentex,
  title={DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays},
  author={Hamamci, Ibrahim Ethem and Er, Sezgin and Simsar, Enis and Sekuboyina, Anjany
          and Gundogar, Mustafa and Stadlinger, Bernd and Mehl, Albert and Menze, Bjoern},
  journal={arXiv preprint arXiv:2305.19112}, year={2023}
}
@article{ismail2007icdas,
  title={The International Caries Detection and Assessment System (ICDAS): an integrated system for measuring dental caries},
  author={Ismail, Amid I. and Sohn, Woosung and Tellez, Marisol and Amaya, Ashley
          and Sen, Ananda and Hasson, Hana and Pitts, Nigel B.},
  journal={Community Dentistry and Oral Epidemiology}, volume={35}, number={3}, pages={170--178},
  year={2007}, doi={10.1111/j.1600-0528.2007.00347.x}
}

See the text backbone — PALL-Text — for the full CPT→SFT→DPO recipe, text-domain results, and complete training data attribution.

Downloads last month
8
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Harisundar/PALL-VLM

Finetuned
(1)
this model

Papers for Harisundar/PALL-VLM