Instructions to use Harisundar/PALL-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Harisundar/PALL-VLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Harisundar/PALL-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Harisundar/PALL-VLM") model = AutoModelForMultimodalLM.from_pretrained("Harisundar/PALL-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Harisundar/PALL-VLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Harisundar/PALL-VLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harisundar/PALL-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Harisundar/PALL-VLM
- SGLang
How to use Harisundar/PALL-VLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Harisundar/PALL-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harisundar/PALL-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Harisundar/PALL-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Harisundar/PALL-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Harisundar/PALL-VLM with Docker Model Runner:
docker model run hf.co/Harisundar/PALL-VLM
PALL-VLM — A Dental Vision-Language Model
PALL-VLM is a multimodal dental assistant that adds image understanding to the PALL-Text dental LLM. It follows a LLaVA-style recipe: a frozen SigLIP vision tower is grafted onto the dental Llama-3.1-8B backbone through a trainable MLP projector, then trained on dental images.
This repository hosts the final, fully-merged bf16 model (~8.5B parameters).
- Developed by: Harisundar R
- Architecture:
LlavaForConditionalGeneration - Vision tower:
google/siglip-so400m-patch14-384(frozen) - Language backbone:
Harisundar/PALL-Text(dental CPT+SFT+DPO Llama-3.1-8B) - Code: PALL on GitHub
- VLM training data:
Harisundar/PALL-VLM-data - License: Llama 3.1 Community License (SigLIP component is Apache-2.0)
Model description
PALL-VLM turns the text-only dental specialist into a vision-language model capable of interpreting dental imagery (clinical photos, histopathology, radiographs) alongside text.
Architecture
- Vision tower: SigLIP-so400m-patch14-384, 384px input, 729 patch tokens/image (frozen).
- Projector: 2-layer GELU MLP (LLaVA-1.5 style), maps vision features → LLM embedding space.
- Language model: dental Llama-3.1-8B (PALL-Text), fine-tuned with LoRA (r=16).
<image>token index: 128256. Total ≈ 8.5B params (vision ~0.4B, projector ~10M, LLM 8B).
Two-stage training
| Stage | Trainable | Data | Purpose |
|---|---|---|---|
| 1 — Alignment | projector only (vision + LLM frozen) | single-image subset | bind vision features to the LLM embedding space |
| 2 — Instruction tuning | LoRA on LLM + projector (vision frozen) | full set incl. multi-image | dental visual question answering & classification |
Trained on a single L40S 48GB GPU. Stage-3 multimodal DPO is deferred (no multimodal preference data yet).
Evaluation note
Because the data is classification-heavy, evaluation includes an image-shuffle control: accuracy must drop when images are randomly permuted, guarding against modality collapse (the model ignoring the image).
Usage
import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
model_id = "Harisundar/PALL-VLM"
model = LlavaForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
processor = AutoProcessor.from_pretrained(model_id)
image = Image.open("dental_image.jpg").convert("RGB")
text = processor.tokenizer.apply_chat_template(
[{"role": "user", "content": "<image>\nWhat is shown? Give an ICDAS score if applicable."}],
tokenize=False, add_generation_prompt=True,
)
batch = processor(images=[image], text=text, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(processor.tokenizer.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))
Training Data Sources & Acknowledgements
PALL-VLM is trained on 32,884 records / 52,461 images assembled from multiple publicly available dental image datasets. We gratefully acknowledge the creators:
| Source | Records | Task(s) | Attribution |
|---|---|---|---|
| Oral cancer clinical photos (PQ) | 10,002 | classification | Kaggle oral cancer image dataset contributors |
| CODE oral classification | 7,546 | classification | CODE oral lesion classification dataset |
| Oral cancer histopathology | 5,127 | classification | Community histopathology datasets |
| Dental textbook figures | 3,221 | VQA, caption | Various textbook authors (see PALL-Text card) |
| Radiograph caries (ICDAS) | 1,431 | classification, detection | ICDAS Foundation; Ismail, A.I. et al. (2007). The International Caries Detection and Assessment System (ICDAS). Community Dentistry and Oral Epidemiology, 35(3), 170–178 |
| Dental samples | 1,082 | mixed | Community dental image datasets |
| SMART oral photos | 1,071 | classification | SMART oral lesion dataset contributors |
| Tufts Dental Database | 998 | report generation | Panetta, K., Rajendran, R., Ramesh, A., Rao, S., & Agaian, S. (2022). Tufts Dental Database. IEEE J. Biomed. Health Inform., 26(4), 1650–1659 |
| DENTEX — quadrant detection | 676 | detection | Hamamci, I.E. et al. (2023). DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays. arXiv:2305.19093 |
| Dental radiology | 580 | classification | Community dental radiology datasets |
| Oral cancer clinical photos (2) | 544 | classification | Kaggle oral cancer datasets |
| DENTEX — disease classification | 407 | classification | Hamamci, I.E. et al. (2023) (same as above) |
| Dental jaw captions | 144 | captioning | Community dental datasets |
| DENTEX — enumeration | 50 | enumeration | Hamamci, I.E. et al. (2023) (same as above) |
| Dental image dataset | 5 | mixed | Community contribution |
Text backbone data
The language backbone (PALL-Text) was trained on 30+ public datasets across CPT/SFT/DPO stages. See the PALL-Text model card for the complete dataset attribution list.
Intended use & limitations
- Intended: dental image understanding for education and clinical-decision support (VQA, description, classification cues).
- Out of scope: autonomous diagnosis; primary triage without clinician review; out-of-distribution / non-dental images.
- Limitations: wide panoramic radiographs are square-resized in v1 (no AnyRes tiling); performance on OOD clinical images is unverified; classification-heavy training may bias toward terse categorical answers.
⚕️ For research and clinical-decision-support only. Not for autonomous diagnosis or treatment.
Citation
@misc{rajendran2026pallvlm,
title = {PALL-VLM: A Low-Cost Dental Vision-Language Model via LLaVA-style
Grafting on a Dental Llama-3.1-8B},
author = {Rajendran, Harisundar},
year = {2026},
howpublished = {\url{https://huggingface.co/Harisundar/PALL-VLM}},
}
Foundational works
@inproceedings{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
booktitle={NeurIPS}, year={2023}
}
@inproceedings{zhai2023siglip,
title={Sigmoid Loss for Language Image Pre-Training},
author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
booktitle={ICCV}, year={2023}
}
@article{grattafiori2024llama3,
title={The Llama 3 Herd of Models},
author={Grattafiori, Aaron and others}, journal={arXiv:2407.21783}, year={2024}
}
@inproceedings{hu2022lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan
and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
booktitle={ICLR}, year={2022}
}
Key dataset citations
@article{panetta2022tufts,
title={Tufts Dental Database: A Multimodal Panoramic X-Ray Dataset for Benchmarking Diagnostic Systems},
author={Panetta, Karen and Rajendran, Rahul and Ramesh, Aruna and Rao, Shishir and Agaian, Sos},
journal={IEEE Journal of Biomedical and Health Informatics},
volume={26}, number={4}, pages={1650--1659}, year={2022}, doi={10.1109/JBHI.2021.3117575}
}
@article{hamamci2023dentex,
title={DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays},
author={Hamamci, Ibrahim Ethem and Er, Sezgin and Simsar, Enis and Sekuboyina, Anjany
and Gundogar, Mustafa and Stadlinger, Bernd and Mehl, Albert and Menze, Bjoern},
journal={arXiv preprint arXiv:2305.19112}, year={2023}
}
@article{ismail2007icdas,
title={The International Caries Detection and Assessment System (ICDAS): an integrated system for measuring dental caries},
author={Ismail, Amid I. and Sohn, Woosung and Tellez, Marisol and Amaya, Ashley
and Sen, Ananda and Hasson, Hana and Pitts, Nigel B.},
journal={Community Dentistry and Oral Epidemiology}, volume={35}, number={3}, pages={170--178},
year={2007}, doi={10.1111/j.1600-0528.2007.00347.x}
}
See the text backbone — PALL-Text — for the full CPT→SFT→DPO recipe, text-domain results, and complete training data attribution.
- Downloads last month
- 8