When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. Thatβs where our chunk-based approach comes in.
Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:
In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isnβt just a performance boost. Itβs a rethinking of how we manage models and datasets on the Hub.
We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co/xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.
Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022βSept 2024), we now have insights into what weβre storing. Check out the Gradio app to explore: - Storage growth over time - File types over all repositories - Some simple optimizations we're investigating