Instructions to use joyfox/JoyFox-PawScope-VL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use joyfox/JoyFox-PawScope-VL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="joyfox/JoyFox-PawScope-VL") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("joyfox/JoyFox-PawScope-VL") model = AutoModelForMultimodalLM.from_pretrained("joyfox/JoyFox-PawScope-VL") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use joyfox/JoyFox-PawScope-VL with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "joyfox/JoyFox-PawScope-VL" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joyfox/JoyFox-PawScope-VL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/joyfox/JoyFox-PawScope-VL
- SGLang
How to use joyfox/JoyFox-PawScope-VL with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "joyfox/JoyFox-PawScope-VL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joyfox/JoyFox-PawScope-VL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "joyfox/JoyFox-PawScope-VL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "joyfox/JoyFox-PawScope-VL", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use joyfox/JoyFox-PawScope-VL with Docker Model Runner:
docker model run hf.co/joyfox/JoyFox-PawScope-VL
JoyFox-PawScope-VL
JoyFox-PawScope-VL is a domain-adapted visual language model for cat and dog breed understanding. It is built on top of openbmb/MiniCPM-V-4_6 and tuned to describe visible pet traits before giving a natural-language breed judgement and supporting reasons.
Given a pet image and an instruction, the model produces a Chinese response covering:
- visible appearance traits such as coat color, coat length, face shape, ears, eyes, muzzle, body proportion, and posture,
- age-stage cues such as adult cat/dog, kitten, or puppy when visually inferable,
- the most likely cat or dog breed, with a concise reason grounded in the image.
This model is designed for pet image demos, breed-recognition assistants, data annotation workflows, and pet-care product prototypes. It is not a veterinary diagnostic system and should not be used as the sole source of truth for breed certification.
Qualitative Examples
The following examples illustrate the preferred response style: the model first describes visible pet traits, then gives a breed judgement and a short visual rationale. These examples are intended to demonstrate output format and qualitative behavior, not to serve as a benchmark.
Pug
Image: dog_pug_08238.png
Model output
这是一只面部特征非常典型的成年犬,拥有标志性的短平口鼻部和深色的面部褶皱。它的毛色呈现出浅黄褐色,眼睛大而圆,眼神专注,耳朵较小且位置较低,整体面部结构紧凑,展现出成年犬的成熟特征。
判断结果:这只狗更可能是巴哥犬。
理由:该犬具备巴哥犬最核心的品种特征,包括极短的吻部、面部深色的褶皱、大而圆的眼睛以及标志性的浅黄褐色短毛。这些外观特征与巴哥犬的品种标准高度吻合,且其面部比例和成熟感表明这是一只成年犬。
Border Collie
Image: dog_border_collie_07425.png
Model output
这只狗拥有典型的黑白双色长毛,面部中央有一条明显的白色条纹延伸至鼻梁,眼睛呈深褐色,眼神专注。它的耳朵直立且尖端带有轻微的折叠,口鼻部宽阔,嘴巴微张露出牙齿,整体体型比例匀称,从其面部轮廓的成熟度和毛发质感来看,这是一只成年犬。
判断结果:这只狗更可能是边境牧羊犬。
理由:该犬的毛色分布、直立的耳型以及面部特征与边境牧羊犬的品种标准高度吻合,尤其是其黑白相间的毛色模式和警觉的神态,是该品种常见的典型外观表现。
Sphynx Cat
Image: cat_sphynx_01647.png
Model output
这只猫全身几乎没有毛发,皮肤呈现出明显的褶皱,头部轮廓呈现出斯芬克斯猫特有的楔形,耳朵大而直立,眼睛呈杏仁状,整体体型和面部特征显示其处于成年阶段,正趴在绿色的毯子上休息。
判断结果:这只猫更可能是斯芬克斯猫。
理由:该猫最显著的特征是几乎完全无毛的皮肤,这是斯芬克斯猫的典型品种特征。此外,其楔形的头部、大而直立的耳朵以及独特的面部褶皱,都与斯芬克斯猫的品种标准高度吻合,这些特征共同支持了这一判断。
Highlights
- Pet-focused visual intelligence: optimized for cat and dog image understanding rather than generic image captioning.
- Natural judgement format: describes visible traits first, then outputs
判断结果and理由in a stable Chinese style. - Fine-grained breed grounding: tuned with breed labels and generated natural-language rationales for detailed pet breed recognition.
- Age-stage awareness: prompts and data include puppy/kitten/adult cues when they are visually inferable.
- MiniCPM-V compatible: released as a Hugging Face Transformers checkpoint with MiniCPM-V-4.6 remote code.
- Practical inference script: the included script
infer_pet_vision_hf.pyloads the model and runs direct image inference.
Model Details
| Item | Description |
|---|---|
| Model name | JoyFox-PawScope-VL |
| Base model | openbmb/MiniCPM-V-4_6 |
| Model family | MiniCPM-V multimodal model |
| Released format | Hugging Face Transformers checkpoint, Safetensors |
| Primary modality | Image + text instruction |
| Main task | Cat and dog breed image understanding |
| Primary output language | Chinese |
| Model dtype | bf16 |
| Recommended image detail mode | downsample_mode="4x", max_slice_nums=36 |
| Remote code | Required: trust_remote_code=True |
Intended Use
JoyFox-PawScope-VL is intended for applications such as:
- cat and dog breed-recognition demos,
- pet-care assistants that need image-aware breed explanations,
- pet image dataset annotation and review workflows,
- structured labeling of cat/dog image collections,
- educational tools for comparing common pet breed traits.
The model should be used as an assistive interpretation layer. It can summarize likely visual cues and suggest a likely breed, but it should not replace pedigree documents, professional breed assessment, veterinary care, or direct owner knowledge.
Model Capabilities
JoyFox-PawScope-VL is specialized for structured pet image understanding. It is designed to identify whether the image contains a cat or dog, describe the most relevant visual traits, infer the likely breed, and explain the judgement in a concise Chinese response.
| Capability | Description |
|---|---|
| Species focus | Cat and dog images |
| Breed reasoning | Produces a likely breed judgement from visible traits |
| Visual explanation | Mentions coat, face shape, ears, eyes, body proportion, posture, and other cues when visible |
| Age-stage cueing | Can mention adult, kitten, or puppy when visually inferable |
| Output style | Chinese visual description, breed judgement, and rationale |
Model Design
JoyFox-PawScope-VL is designed to produce a concise but interpretable visual judgement rather than a bare class label. Its response style emphasizes observable features first, then the likely breed, followed by a short rationale. This makes the model suitable for product demos, annotation review, and pet-care workflows where users need to understand why a breed was suggested.
Quick Start with Transformers
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
model_path = "joyfox/JoyFox-PawScope-VL"
image_path = "your_pet_image.jpg"
prompt = """请观察图片中的宠物,先自然说明可见外观特征,再判断它最可能的具体品种,并给出理由。
回答格式:
从图片看,...
判断结果:这只猫/狗更可能是...。
理由:..."""
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
model.eval()
messages = [{
"role": "user",
"content": [
{"type": "image", "url": image_path},
{"type": "text", "text": prompt},
],
}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
downsample_mode="4x",
max_slice_nums=36,
enable_thinking=False,
)
inputs = inputs.to(next(model.parameters()).device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
downsample_mode="4x",
max_new_tokens=512,
do_sample=False,
)
output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
answer = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(answer.strip())
Using infer_pet_vision_hf.py
The included inference script is designed for direct file-based testing. Edit the configuration block at the top of infer_pet_vision_hf.py:
MODEL_PATH = str(MODEL_DIR)
IMAGE_PATH = str(MODEL_DIR / "assets" / "dog_pug_08238.png")
MAX_NEW_TOKENS = 512
DOWNSAMPLE_MODE = "4x"
MAX_SLICE_NUMS = 36
DO_SAMPLE = False
Then run:
python infer_pet_vision_hf.py
The script performs the following steps:
- resolves
MODEL_PATHand validates the input image path, - converts file-based images to a standard temporary JPEG for robust decoding,
- loads the model with
AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True), - builds a MiniCPM-V user message with image + prompt,
- runs
model.generate(...)undertorch.inference_mode(), - trims the prompt tokens and prints the decoded answer,
- removes empty
<think></think>prefixes from display output when present.
Notes on Warnings
You may see warnings about the MiniCPM-V fast path not being available, or messages such as Kwargs passed to processor.__call__ have to be in processor_kwargs dict. These warnings generally do not block image inference.
For detailed breed recognition, downsample_mode="4x" and max_slice_nums=36 preserve more visual detail. For faster but less detailed inference, use downsample_mode="16x".
Limitations
- The model is focused on cat and dog images. Other animals, mixed scenes, toys, drawings, or non-pet images may produce unreliable outputs.
- Breed identification from a single image is inherently uncertain, especially for mixed-breed pets and visually similar breeds.
- The model may overstate confidence when the image lacks clear breed-specific features.
- Lighting, occlusion, grooming style, age, camera angle, and partial body visibility can reduce reliability.
- The model primarily follows a Chinese response format. English output may require a separate prompt and has not been the main tuning target.
- The model should not be used for veterinary diagnosis, legal breed certification, insurance decisions, shelter intake decisions, or safety-critical judgement.
License
This model follows the license terms of the base model and the released model metadata. Please also review the license and usage terms of openbmb/MiniCPM-V-4_6 before redistribution or commercial use.
Acknowledgements
JoyFox-PawScope-VL is built on OpenBMB's MiniCPM-V-4.6 multimodal model. The model card structure and usage style are aligned with the JoyFox MiniCPM-PawSense-Audio release.
- Downloads last month
- 47


