Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation
Abstract
Vision-language dataset distillation method using rank-aware hyperbolic alignment to optimize synthetic image-text pairs for efficient contrastive model training while preserving modality-specific diversity.
Vision-language dataset distillation (VLDD) compresses a large image-text paired dataset into a small set of synthetic pairs that can efficiently train contrastive vision-language models under strict data and compute budgets. Most existing methods match expert trajectories or cross-modal statistics, yet still enforce full-dimensional alignment in a Euclidean embedding space. This is often overly restrictive due to rank-deficient image--text correlation, with shared semantics concentrated in a low-dimensional range and remaining variation spread across a weakly correlated residual subspace. LoRS relaxes alignment at the similarity level by low-rank factorization, but does not explicitly control dominant alignment capacity and structure in the representation space. We thus propose a rank-aware hyperbolic alignment (RAHA) that combines hierarchical geometry with explicit alignment-capacity control. RAHA lifts multimodal representations to hyperbolic space and optimizes distilled pairs with asymmetric objectives that enforce geodesic alignment in the shared range while regularizing the residual subspace to preserve modality-private diversity and improve transfer robustness. Experiments on benchmarks show that RAHA demonstrates competitive cross-modal retrieval and improved transfer indicators under fixed budgets.
Community
Accepted for publication at ECCV 2026
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multimodal Distribution Matching for Vision-Language Dataset Distillation (2026)
- HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding (2026)
- Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data (2026)
- LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives (2026)
- DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency (2026)
- GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective (2026)
- MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.29464 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper