JoyFox-PawScope-VL

JoyFox-PawScope-VL is a domain-adapted visual language model for cat and dog breed understanding. It is built on top of openbmb/MiniCPM-V-4_6 and tuned to describe visible pet traits before giving a natural-language breed judgement and supporting reasons.

Given a pet image and an instruction, the model produces a Chinese response covering:

  • visible appearance traits such as coat color, coat length, face shape, ears, eyes, muzzle, body proportion, and posture,
  • age-stage cues such as adult cat/dog, kitten, or puppy when visually inferable,
  • the most likely cat or dog breed, with a concise reason grounded in the image.

This model is designed for pet image demos, breed-recognition assistants, data annotation workflows, and pet-care product prototypes. It is not a veterinary diagnostic system and should not be used as the sole source of truth for breed certification.

Qualitative Examples

The following examples illustrate the preferred response style: the model first describes visible pet traits, then gives a breed judgement and a short visual rationale. These examples are intended to demonstrate output format and qualitative behavior, not to serve as a benchmark.

Pug

Image: dog_pug_08238.png

Pug example

Model output

这是一只面部特征非常典型的成年犬,拥有标志性的短平口鼻部和深色的面部褶皱。它的毛色呈现出浅黄褐色,眼睛大而圆,眼神专注,耳朵较小且位置较低,整体面部结构紧凑,展现出成年犬的成熟特征。

判断结果:这只狗更可能是巴哥犬。

理由:该犬具备巴哥犬最核心的品种特征,包括极短的吻部、面部深色的褶皱、大而圆的眼睛以及标志性的浅黄褐色短毛。这些外观特征与巴哥犬的品种标准高度吻合,且其面部比例和成熟感表明这是一只成年犬。

Border Collie

Image: dog_border_collie_07425.png

Border Collie example

Model output

这只狗拥有典型的黑白双色长毛,面部中央有一条明显的白色条纹延伸至鼻梁,眼睛呈深褐色,眼神专注。它的耳朵直立且尖端带有轻微的折叠,口鼻部宽阔,嘴巴微张露出牙齿,整体体型比例匀称,从其面部轮廓的成熟度和毛发质感来看,这是一只成年犬。

判断结果:这只狗更可能是边境牧羊犬。

理由:该犬的毛色分布、直立的耳型以及面部特征与边境牧羊犬的品种标准高度吻合,尤其是其黑白相间的毛色模式和警觉的神态,是该品种常见的典型外观表现。

Sphynx Cat

Image: cat_sphynx_01647.png

Sphynx cat example

Model output

这只猫全身几乎没有毛发,皮肤呈现出明显的褶皱,头部轮廓呈现出斯芬克斯猫特有的楔形,耳朵大而直立,眼睛呈杏仁状,整体体型和面部特征显示其处于成年阶段,正趴在绿色的毯子上休息。

判断结果:这只猫更可能是斯芬克斯猫。

理由:该猫最显著的特征是几乎完全无毛的皮肤,这是斯芬克斯猫的典型品种特征。此外,其楔形的头部、大而直立的耳朵以及独特的面部褶皱,都与斯芬克斯猫的品种标准高度吻合,这些特征共同支持了这一判断。

Highlights

  • Pet-focused visual intelligence: optimized for cat and dog image understanding rather than generic image captioning.
  • Natural judgement format: describes visible traits first, then outputs 判断结果 and 理由 in a stable Chinese style.
  • Fine-grained breed grounding: tuned with breed labels and generated natural-language rationales for detailed pet breed recognition.
  • Age-stage awareness: prompts and data include puppy/kitten/adult cues when they are visually inferable.
  • MiniCPM-V compatible: released as a Hugging Face Transformers checkpoint with MiniCPM-V-4.6 remote code.
  • Practical inference script: the included script infer_pet_vision_hf.py loads the model and runs direct image inference.

Model Details

Item Description
Model name JoyFox-PawScope-VL
Base model openbmb/MiniCPM-V-4_6
Model family MiniCPM-V multimodal model
Released format Hugging Face Transformers checkpoint, Safetensors
Primary modality Image + text instruction
Main task Cat and dog breed image understanding
Primary output language Chinese
Model dtype bf16
Recommended image detail mode downsample_mode="4x", max_slice_nums=36
Remote code Required: trust_remote_code=True

Intended Use

JoyFox-PawScope-VL is intended for applications such as:

  • cat and dog breed-recognition demos,
  • pet-care assistants that need image-aware breed explanations,
  • pet image dataset annotation and review workflows,
  • structured labeling of cat/dog image collections,
  • educational tools for comparing common pet breed traits.

The model should be used as an assistive interpretation layer. It can summarize likely visual cues and suggest a likely breed, but it should not replace pedigree documents, professional breed assessment, veterinary care, or direct owner knowledge.

Model Capabilities

JoyFox-PawScope-VL is specialized for structured pet image understanding. It is designed to identify whether the image contains a cat or dog, describe the most relevant visual traits, infer the likely breed, and explain the judgement in a concise Chinese response.

Capability Description
Species focus Cat and dog images
Breed reasoning Produces a likely breed judgement from visible traits
Visual explanation Mentions coat, face shape, ears, eyes, body proportion, posture, and other cues when visible
Age-stage cueing Can mention adult, kitten, or puppy when visually inferable
Output style Chinese visual description, breed judgement, and rationale

Model Design

JoyFox-PawScope-VL is designed to produce a concise but interpretable visual judgement rather than a bare class label. Its response style emphasizes observable features first, then the likely breed, followed by a short rationale. This makes the model suitable for product demos, annotation review, and pet-care workflows where users need to understand why a breed was suggested.

Quick Start with Transformers

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

model_path = "joyfox/JoyFox-PawScope-VL"
image_path = "your_pet_image.jpg"

prompt = """请观察图片中的宠物,先自然说明可见外观特征,再判断它最可能的具体品种,并给出理由。

回答格式:
从图片看,...

判断结果:这只猫/狗更可能是...。

理由:..."""

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": image_path},
        {"type": "text", "text": prompt},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    downsample_mode="4x",
    max_slice_nums=36,
    enable_thinking=False,
)
inputs = inputs.to(next(model.parameters()).device)

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        downsample_mode="4x",
        max_new_tokens=512,
        do_sample=False,
    )

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
answer = processor.batch_decode(output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(answer.strip())

Using infer_pet_vision_hf.py

The included inference script is designed for direct file-based testing. Edit the configuration block at the top of infer_pet_vision_hf.py:

MODEL_PATH = str(MODEL_DIR)
IMAGE_PATH = str(MODEL_DIR / "assets" / "dog_pug_08238.png")
MAX_NEW_TOKENS = 512
DOWNSAMPLE_MODE = "4x"
MAX_SLICE_NUMS = 36
DO_SAMPLE = False

Then run:

python infer_pet_vision_hf.py

The script performs the following steps:

  1. resolves MODEL_PATH and validates the input image path,
  2. converts file-based images to a standard temporary JPEG for robust decoding,
  3. loads the model with AutoModelForImageTextToText.from_pretrained(..., trust_remote_code=True),
  4. builds a MiniCPM-V user message with image + prompt,
  5. runs model.generate(...) under torch.inference_mode(),
  6. trims the prompt tokens and prints the decoded answer,
  7. removes empty <think></think> prefixes from display output when present.

Notes on Warnings

You may see warnings about the MiniCPM-V fast path not being available, or messages such as Kwargs passed to processor.__call__ have to be in processor_kwargs dict. These warnings generally do not block image inference.

For detailed breed recognition, downsample_mode="4x" and max_slice_nums=36 preserve more visual detail. For faster but less detailed inference, use downsample_mode="16x".

Limitations

  • The model is focused on cat and dog images. Other animals, mixed scenes, toys, drawings, or non-pet images may produce unreliable outputs.
  • Breed identification from a single image is inherently uncertain, especially for mixed-breed pets and visually similar breeds.
  • The model may overstate confidence when the image lacks clear breed-specific features.
  • Lighting, occlusion, grooming style, age, camera angle, and partial body visibility can reduce reliability.
  • The model primarily follows a Chinese response format. English output may require a separate prompt and has not been the main tuning target.
  • The model should not be used for veterinary diagnosis, legal breed certification, insurance decisions, shelter intake decisions, or safety-critical judgement.

License

This model follows the license terms of the base model and the released model metadata. Please also review the license and usage terms of openbmb/MiniCPM-V-4_6 before redistribution or commercial use.

Acknowledgements

JoyFox-PawScope-VL is built on OpenBMB's MiniCPM-V-4.6 multimodal model. The model card structure and usage style are aligned with the JoyFox MiniCPM-PawSense-Audio release.

Downloads last month
47
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joyfox/JoyFox-PawScope-VL

Finetuned
(15)
this model
Quantizations
1 model