Instructions to use Harisundar/PALL-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Harisundar/PALL-VLM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Harisundar/PALL-VLM")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Harisundar/PALL-VLM")
model = AutoModelForMultimodalLM.from_pretrained("Harisundar/PALL-VLM")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Harisundar/PALL-VLM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Harisundar/PALL-VLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Harisundar/PALL-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Harisundar/PALL-VLM

SGLang

How to use Harisundar/PALL-VLM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Harisundar/PALL-VLM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Harisundar/PALL-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Harisundar/PALL-VLM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Harisundar/PALL-VLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Harisundar/PALL-VLM with Docker Model Runner:
```
docker model run hf.co/Harisundar/PALL-VLM
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PALL-VLM — A Dental Vision-Language Model

PALL-VLM is a multimodal dental assistant that adds image understanding to the PALL-Text dental LLM. It follows a LLaVA-style recipe: a frozen SigLIP vision tower is grafted onto the dental Llama-3.1-8B backbone through a trainable MLP projector, then trained on dental images.

This repository hosts the final, fully-merged bf16 model (~8.5B parameters).

Developed by: Harisundar R
Architecture: LlavaForConditionalGeneration
Vision tower: google/siglip-so400m-patch14-384 (frozen)
Language backbone: Harisundar/PALL-Text (dental CPT+SFT+DPO Llama-3.1-8B)
Code: PALL on GitHub
VLM training data: Harisundar/PALL-VLM-data
License: Llama 3.1 Community License (SigLIP component is Apache-2.0)

Model description

PALL-VLM turns the text-only dental specialist into a vision-language model capable of interpreting dental imagery (clinical photos, histopathology, radiographs) alongside text.

Architecture

Vision tower: SigLIP-so400m-patch14-384, 384px input, 729 patch tokens/image (frozen).
Projector: 2-layer GELU MLP (LLaVA-1.5 style), maps vision features → LLM embedding space.
Language model: dental Llama-3.1-8B (PALL-Text), fine-tuned with LoRA (r=16).
<image> token index: 128256. Total ≈ 8.5B params (vision ~0.4B, projector ~10M, LLM 8B).

Two-stage training

Stage	Trainable	Data	Purpose
1 — Alignment	projector only (vision + LLM frozen)	single-image subset	bind vision features to the LLM embedding space
2 — Instruction tuning	LoRA on LLM + projector (vision frozen)	full set incl. multi-image	dental visual question answering & classification

Trained on a single L40S 48GB GPU. Stage-3 multimodal DPO is deferred (no multimodal preference data yet).

Evaluation note

Because the data is classification-heavy, evaluation includes an image-shuffle control: accuracy must drop when images are randomly permuted, guarding against modality collapse (the model ignoring the image).

Usage

import torch
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "Harisundar/PALL-VLM"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("dental_image.jpg").convert("RGB")
text = processor.tokenizer.apply_chat_template(
    [{"role": "user", "content": "<image>\nWhat is shown? Give an ICDAS score if applicable."}],
    tokenize=False, add_generation_prompt=True,
)
batch = processor(images=[image], text=text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(processor.tokenizer.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Training Data Sources & Acknowledgements

PALL-VLM is trained on 32,884 records / 52,461 images assembled from multiple publicly available dental image datasets. We gratefully acknowledge the creators:

Source	Records	Task(s)	Attribution
Oral cancer clinical photos (PQ)	10,002	classification	Kaggle oral cancer image dataset contributors
CODE oral classification	7,546	classification	CODE oral lesion classification dataset
Oral cancer histopathology	5,127	classification	Community histopathology datasets
Dental textbook figures	3,221	VQA, caption	Various textbook authors (see PALL-Text card)
Radiograph caries (ICDAS)	1,431	classification, detection	ICDAS Foundation; Ismail, A.I. et al. (2007). The International Caries Detection and Assessment System (ICDAS). Community Dentistry and Oral Epidemiology, 35(3), 170–178
Dental samples	1,082	mixed	Community dental image datasets
SMART oral photos	1,071	classification	SMART oral lesion dataset contributors
Tufts Dental Database	998	report generation	Panetta, K., Rajendran, R., Ramesh, A., Rao, S., & Agaian, S. (2022). Tufts Dental Database. IEEE J. Biomed. Health Inform., 26(4), 1650–1659
DENTEX — quadrant detection	676	detection	Hamamci, I.E. et al. (2023). DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays. arXiv:2305.19093
Dental radiology	580	classification	Community dental radiology datasets
Oral cancer clinical photos (2)	544	classification	Kaggle oral cancer datasets
DENTEX — disease classification	407	classification	Hamamci, I.E. et al. (2023) (same as above)
Dental jaw captions	144	captioning	Community dental datasets
DENTEX — enumeration	50	enumeration	Hamamci, I.E. et al. (2023) (same as above)
Dental image dataset	5	mixed	Community contribution

Text backbone data

The language backbone (PALL-Text) was trained on 30+ public datasets across CPT/SFT/DPO stages. See the PALL-Text model card for the complete dataset attribution list.

Intended use & limitations

Intended: dental image understanding for education and clinical-decision support (VQA, description, classification cues).
Out of scope: autonomous diagnosis; primary triage without clinician review; out-of-distribution / non-dental images.
Limitations: wide panoramic radiographs are square-resized in v1 (no AnyRes tiling); performance on OOD clinical images is unverified; classification-heavy training may bias toward terse categorical answers.

⚕️ For research and clinical-decision-support only. Not for autonomous diagnosis or treatment.

Citation

@misc{rajendran2026pallvlm,
  title        = {PALL-VLM: A Low-Cost Dental Vision-Language Model via LLaVA-style
                  Grafting on a Dental Llama-3.1-8B},
  author       = {Rajendran, Harisundar},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/Harisundar/PALL-VLM}},

}

Foundational works

@inproceedings{liu2023llava,
  title={Visual Instruction Tuning},
  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  booktitle={NeurIPS}, year={2023}
}
@inproceedings{zhai2023siglip,
  title={Sigmoid Loss for Language Image Pre-Training},
  author={Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
  booktitle={ICCV}, year={2023}
}
@article{grattafiori2024llama3,
  title={The Llama 3 Herd of Models},
  author={Grattafiori, Aaron and others}, journal={arXiv:2407.21783}, year={2024}
}
@inproceedings{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan
          and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  booktitle={ICLR}, year={2022}
}

Key dataset citations

@article{panetta2022tufts,
  title={Tufts Dental Database: A Multimodal Panoramic X-Ray Dataset for Benchmarking Diagnostic Systems},
  author={Panetta, Karen and Rajendran, Rahul and Ramesh, Aruna and Rao, Shishir and Agaian, Sos},
  journal={IEEE Journal of Biomedical and Health Informatics},
  volume={26}, number={4}, pages={1650--1659}, year={2022}, doi={10.1109/JBHI.2021.3117575}
}
@article{hamamci2023dentex,
  title={DENTEX: An Abnormal Tooth Detection with Dental Enumeration and Diagnosis Benchmark for Panoramic X-rays},
  author={Hamamci, Ibrahim Ethem and Er, Sezgin and Simsar, Enis and Sekuboyina, Anjany
          and Gundogar, Mustafa and Stadlinger, Bernd and Mehl, Albert and Menze, Bjoern},
  journal={arXiv preprint arXiv:2305.19112}, year={2023}
}
@article{ismail2007icdas,
  title={The International Caries Detection and Assessment System (ICDAS): an integrated system for measuring dental caries},
  author={Ismail, Amid I. and Sohn, Woosung and Tellez, Marisol and Amaya, Ashley
          and Sen, Ananda and Hasson, Hana and Pitts, Nigel B.},
  journal={Community Dentistry and Oral Epidemiology}, volume={35}, number={3}, pages={170--178},
  year={2007}, doi={10.1111/j.1600-0528.2007.00347.x}
}

See the text backbone — PALL-Text — for the full CPT→SFT→DPO recipe, text-domain results, and complete training data attribution.