Instructions to use joyfox/JoyFox-PawScope-VL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use joyfox/JoyFox-PawScope-VL with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="joyfox/JoyFox-PawScope-VL")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("joyfox/JoyFox-PawScope-VL")
model = AutoModelForMultimodalLM.from_pretrained("joyfox/JoyFox-PawScope-VL")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use joyfox/JoyFox-PawScope-VL with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "joyfox/JoyFox-PawScope-VL"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joyfox/JoyFox-PawScope-VL",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/joyfox/JoyFox-PawScope-VL

SGLang

How to use joyfox/JoyFox-PawScope-VL with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "joyfox/JoyFox-PawScope-VL" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joyfox/JoyFox-PawScope-VL",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "joyfox/JoyFox-PawScope-VL" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "joyfox/JoyFox-PawScope-VL",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use joyfox/JoyFox-PawScope-VL with Docker Model Runner:
```
docker model run hf.co/joyfox/JoyFox-PawScope-VL
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

JoyFox-PawScope-VL

JoyFox-PawScope-VL is a domain-adapted visual language model for cat and dog breed understanding. It is built on top of openbmb/MiniCPM-V-4_6 and tuned to describe visible pet traits before giving a natural-language breed judgement and supporting reasons.

Given a pet image and an instruction, the model produces a Chinese response covering:

visible appearance traits such as coat color, coat length, face shape, ears, eyes, muzzle, body proportion, and posture,
age-stage cues such as adult cat/dog, kitten, or puppy when visually inferable,
the most likely cat or dog breed, with a concise reason grounded in the image.

This model is designed for pet image demos, breed-recognition assistants, data annotation workflows, and pet-care product prototypes. It is not a veterinary diagnostic system and should not be used as the sole source of truth for breed certification.

Qualitative Examples

The following examples illustrate the preferred response style: the model first describes visible pet traits, then gives a breed judgement and a short visual rationale. These examples are intended to demonstrate output format and qualitative behavior, not to serve as a benchmark.

Pug

Image: dog_pug_08238.png

Model output

这是一只面部特征非常典型的成年犬，拥有标志性的短平口鼻部和深色的面部褶皱。它的毛色呈现出浅黄褐色，眼睛大而圆，眼神专注，耳朵较小且位置较低，整体面部结构紧凑，展现出成年犬的成熟特征。

判断结果：这只狗更可能是巴哥犬。

理由：该犬具备巴哥犬最核心的品种特征，包括极短的吻部、面部深色的褶皱、大而圆的眼睛以及标志性的浅黄褐色短毛。这些外观特征与巴哥犬的品种标准高度吻合，且其面部比例和成熟感表明这是一只成年犬。

Border Collie

Image: dog_border_collie_07425.png

Model output

这只狗拥有典型的黑白双色长毛，面部中央有一条明显的白色条纹延伸至鼻梁，眼睛呈深褐色，眼神专注。它的耳朵直立且尖端带有轻微的折叠，口鼻部宽阔，嘴巴微张露出牙齿，整体体型比例匀称，从其面部轮廓的成熟度和毛发质感来看，这是一只成年犬。

判断结果：这只狗更可能是边境牧羊犬。

理由：该犬的毛色分布、直立的耳型以及面部特征与边境牧羊犬的品种标准高度吻合，尤其是其黑白相间的毛色模式和警觉的神态，是该品种常见的典型外观表现。

Sphynx Cat

Image: cat_sphynx_01647.png

Model output

这只猫全身几乎没有毛发，皮肤呈现出明显的褶皱，头部轮廓呈现出斯芬克斯猫特有的楔形，耳朵大而直立，眼睛呈杏仁状，整体体型和面部特征显示其处于成年阶段，正趴在绿色的毯子上休息。

判断结果：这只猫更可能是斯芬克斯猫。

理由：该猫最显著的特征是几乎完全无毛的皮肤，这是斯芬克斯猫的典型品种特征。此外，其楔形的头部、大而直立的耳朵以及独特的面部褶皱，都与斯芬克斯猫的品种标准高度吻合，这些特征共同支持了这一判断。

Highlights

Pet-focused visual intelligence: optimized for cat and dog image understanding rather than generic image captioning.
Natural judgement format: describes visible traits first, then outputs 判断结果 and 理由 in a stable Chinese style.
Fine-grained breed grounding: tuned with breed labels and generated natural-language rationales for detailed pet breed recognition.
Age-stage awareness: prompts and data include puppy/kitten/adult cues when they are visually inferable.
MiniCPM-V compatible: released as a Hugging Face Transformers checkpoint with MiniCPM-V-4.6 remote code.
Practical inference script: the included script infer_pet_vision_hf.py loads the model and runs direct image inference.

Model Details

Item	Description
Model name	JoyFox-PawScope-VL
Base model	`openbmb/MiniCPM-V-4_6`
Model family	MiniCPM-V multimodal model
Released format	Hugging Face Transformers checkpoint, Safetensors
Primary modality	Image + text instruction
Main task	Cat and dog breed image understanding
Primary output language	Chinese
Model dtype	bf16
Recommended image detail mode	`downsample_mode="4x"`, `max_slice_nums=36`
Remote code	Required: `trust_remote_code=True`

Intended Use

JoyFox-PawScope-VL is intended for applications such as:

cat and dog breed-recognition demos,
pet-care assistants that need image-aware breed explanations,
pet image dataset annotation and review workflows,
structured labeling of cat/dog image collections,
educational tools for comparing common pet breed traits.

The model should be used as an assistive interpretation layer. It can summarize likely visual cues and suggest a likely breed, but it should not replace pedigree documents, professional breed assessment, veterinary care, or direct owner knowledge.

Model Capabilities

JoyFox-PawScope-VL is specialized for structured pet image understanding. It is designed to identify whether the image contains a cat or dog, describe the most relevant visual traits, infer the likely breed, and explain the judgement in a concise Chinese response.

Capability	Description
Species focus	Cat and dog images
Breed reasoning	Produces a likely breed judgement from visible traits
Visual explanation	Mentions coat, face shape, ears, eyes, body proportion, posture, and other cues when visible
Age-stage cueing	Can mention adult, kitten, or puppy when visually inferable
Output style	Chinese visual description, breed judgement, and rationale

Model Design

JoyFox-PawScope-VL is designed to produce a concise but interpretable visual judgement rather than a bare class label. Its response style emphasizes observable features first, then the likely breed, followed by a short rationale. This makes the model suitable for product demos, annotation review, and pet-care workflows where users need to understand why a breed was suggested.

Quick Start with Transformers

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_path = "joyfox/JoyFox-PawScope-VL"
image_path = "your_pet_image.jpg"

prompt = """请观察图片中的宠物，先自然说明可见外观特征，再判断它最可能的具体品种，并给出理由。

回答格式：
从图片看，...

判断结果：这只猫/狗更可能是...。

理由：..."""

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": image_path},
        {"type": "text", "text": prompt},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    downsample_mode="4x",
    max_slice_nums=36,
    enable_thinking=False,
)
inputs = inputs.to(next(model.parameters()).device)

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        downsample_mode="4x",
        max_new_tokens=512,
        do_sample=False,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
answer = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(answer.strip())

Using `infer_pet_vision_hf.py`

The included inference script is designed for direct file-based testing. Edit the configuration block at the top of infer_pet_vision_hf.py:

MODEL_PATH = str(MODEL_DIR)
IMAGE_PATH = str(MODEL_DIR / "assets" / "dog_pug_08238.png")
MAX_NEW_TOKENS = 512
DOWNSAMPLE_MODE = "4x"
MAX_SLICE_NUMS = 36
DO_SAMPLE = False

Then run:

python infer_pet_vision_hf.py

The script performs the following steps:

resolves MODEL_PATH and validates the input image path,
converts file-based images to a standard temporary JPEG for robust decoding,
loads the model with AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True),
builds a MiniCPM-V user message with image + prompt,
runs model.generate(...) under torch.inference_mode(),
trims the prompt tokens and prints the decoded answer,
removes empty <think></think> prefixes from display output when present.

Notes on Warnings

You may see warnings about the MiniCPM-V fast path not being available, or messages such as Kwargs passed to processor.__call__ have to be in processor_kwargs dict. These warnings generally do not block image inference.

For detailed breed recognition, downsample_mode="4x" and max_slice_nums=36 preserve more visual detail. For faster but less detailed inference, use downsample_mode="16x".

Limitations

The model is focused on cat and dog images. Other animals, mixed scenes, toys, drawings, or non-pet images may produce unreliable outputs.
Breed identification from a single image is inherently uncertain, especially for mixed-breed pets and visually similar breeds.
The model may overstate confidence when the image lacks clear breed-specific features.
Lighting, occlusion, grooming style, age, camera angle, and partial body visibility can reduce reliability.
The model primarily follows a Chinese response format. English output may require a separate prompt and has not been the main tuning target.
The model should not be used for veterinary diagnosis, legal breed certification, insurance decisions, shelter intake decisions, or safety-critical judgement.

License

This model follows the license terms of the base model and the released model metadata. Please also review the license and usage terms of openbmb/MiniCPM-V-4_6 before redistribution or commercial use.

Acknowledgements

JoyFox-PawScope-VL is built on OpenBMB's MiniCPM-V-4.6 multimodal model. The model card structure and usage style are aligned with the JoyFox MiniCPM-PawSense-Audio release.

Downloads last month: 47

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for joyfox/JoyFox-PawScope-VL

Base model

openbmb/MiniCPM-V-4.6

Finetuned

(15)

this model

Quantizations

1 model

JoyFox-PawScope-VL

Qualitative Examples

Pug

Border Collie

Sphynx Cat

Highlights

Model Details

Intended Use

Model Capabilities

Model Design

Quick Start with Transformers

Using infer_pet_vision_hf.py

Notes on Warnings

Limitations

License

Acknowledgements

Model tree for joyfox/JoyFox-PawScope-VL

Using `infer_pet_vision_hf.py`