As xet-team infrastructure begins backing hundreds of repositories on the Hugging Face Hub, weβre getting to put on our researcher hats and peer into the bytes. π π€
IMO, one of the most interesting ideas Xet storage introduces is a globally shared store of data.
When you upload a file through Xet, the contents are split into ~64KB chunks and deduplicated, but what if those same chunks already exist in another repo on the Hub?
Because of this, different repositories can share bytes we store. That opens up something cool - we can draw a graph of which repos actually share data at the chunk level, where:
- Nodes = repositories - Edges = shared chunks - Edge thickness = how much they overlap
Come find the many BERT islands. Or see how datasets relate in practice, not just in theory. See how libraries or tasks can tie repositories together. You can play around with node size using storage/likes/downloads too.
The result is a super fun visualization from @saba9 and @znation that Iβve already lost way too much time to. I'm excited to see how the networks grow as we add more repositories!
If you've been following along with the Xet Team's (xet-team) work, you know we've been working to migrate the Hugging Face Hub from Git LFS and to Xet.
Recently, we launched a waitlist to join the movement to Xet (join here! https://huggingface.co/join/xet ) but getting to this point was a journey.
From the initial proof of concept in August, to launching on the Hub internally, to migrating a set of repositories and routing a small chunk of download traffic on the Hub through our infrastructure. Every step of the way has been full of challenges, big and small, and well worth the effort.
Over the past few weeks, with real traffic flowing through our services weβve tackled some truly gnarly issues (unusual upload/download patterns, memory leaks, load imbalances, and more) and resolved each without major disruptions.
If you're curious about how this sliver of Hub infrastructure looks as we routed traffic through it for the first time (and want a deep dive full of Grafana and Kibana charts π€) I have a post for you.
Here's an inside look into the day of our first migrations and the weeks following, where we pieced together solutions in real time.