Jared Sulzdorf's picture

Jared Sulzdorf PRO

jsulz

AI & ML interests

Infrastructure, law, policy

Recent Activity

liked a Space 2 days ago
xet-team/repo-graph
liked a model 2 days ago
reach-vb/yolo
View all activity

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture wut?'s profile picture

Posts 14

view post
Post
522
As xet-team infrastructure begins backing hundreds of repositories on the Hugging Face Hub, we’re getting to put on our researcher hats and peer into the bytes. πŸ‘€ πŸ€“

IMO, one of the most interesting ideas Xet storage introduces is a globally shared store of data.

When you upload a file through Xet, the contents are split into ~64KB chunks and deduplicated, but what if those same chunks already exist in another repo on the Hub?

If we can detect and reuse them, we skip them as well saving time and bandwidth for AI builders. More on how that works here:
πŸ”— https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation

Because of this, different repositories can share bytes we store. That opens up something cool - we can draw a graph of which repos actually share data at the chunk level, where:

- Nodes = repositories
- Edges = shared chunks
- Edge thickness = how much they overlap

xet-team/repo-graph

Come find the many BERT islands. Or see how datasets relate in practice, not just in theory. See how libraries or tasks can tie repositories together. You can play around with node size using storage/likes/downloads too.

The result is a super fun visualization from @saba9 and @znation that I’ve already lost way too much time to. I'm excited to see how the networks grow as we add more repositories!

Articles 5

Article
139

Welcome Llama 4 Maverick & Scout on Hugging Face!