csmithxc's picture
Upload 146 files
1530901 verified

Fine-tuning YOLO-World for Instance Segmentation

Models

We fine-tune YOLO-World on LVIS (LVIS-Base) with mask annotations for open-vocabulary (zero-shot) instance segmentation.

We provide two fine-tuning strategies YOLO-World towards open-vocabulary instance segmentation:

  • fine-tuning all modules: leads to better LVIS segmentation accuracy but affects the zero-shot performance.

  • fine-tuning the segmentation head: maintains the zero-shot performanc but lowers LVIS segmentation accuracy.

Model Fine-tuning Data Fine-tuning Modules APmask APr APc APf Weights
YOLO-World-Seg-M LVIS-Base all modules 25.9 13.4 24.9 32.6 HF Checkpoints πŸ€—
YOLO-World-v2-Seg-M LVIS-Base all modules 25.9 13.4 24.9 32.6 HF Checkpoints πŸ€—
YOLO-World-Seg-L LVIS-Base all modules 28.7 15.0 28.3 35.2 HF Checkpoints πŸ€—
YOLO-World-v2-Seg-L LVIS-Base all modules 28.7 15.0 28.3 35.2 HF Checkpoints πŸ€—
YOLO-World-Seg-M LVIS-Base seg head 16.7 12.6 14.6 20.8 HF Checkpoints πŸ€—
YOLO-World-v2-Seg-M LVIS-Base seg head 17.8 13.9 15.5 22.0 HF Checkpoints πŸ€—
YOLO-World-Seg-L LVIS-Base seg head 19.1 14.2 17.2 23.5 HF Checkpoints πŸ€—
YOLO-World-v2-Seg-L LVIS-Base seg head 19.8 17.2 17.5 23.6 HF Checkpoints πŸ€—
NOTE:
  1. The mask AP are evaluated on the LVIS val 1.0.
  2. All models are fine-tuned for 80 epochs on LVIS-Base (866 categories, common + frequent).
  3. The YOLO-World-Seg with only seg head fine-tuned maintains the original zero-shot detection capability and segments objects.