Hugging Face

Enterprise

company

Verified

https://huggingface.co

huggingface

Activity Feed

AI & ML interests

The AI community building the future.

Recent Activity

lysandre updated a dataset about 1 hour ago

huggingface/transformers-metadata

mishig updated a Space about 18 hours ago

huggingface/inference-playground

fdaudens updated a dataset about 19 hours ago

huggingface/documentation-images

View all activity

Articles

Yay! Organizations can now publish blog Articles

Jan 20

• 39

huggingface's activity

lysandre

updated a dataset about 1 hour ago

huggingface/transformers-metadata

Viewer • Updated about 1 hour ago • 1.62k • 806 • 22

mishig

updated a Space about 18 hours ago

147

Inference Playground

🔋

Set and update website theme based on user preference

fdaudens

updated a dataset about 19 hours ago

huggingface/documentation-images

Viewer • Updated about 19 hours ago • 52 • 3.22M • 60

thomwolf

posted an update about 21 hours ago

Post

1669

If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at

pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!

1 reply

mcpotato

in huggingface/documentation-images 1 day ago

pai-6-month

#476 opened 5 days ago by

warmiros

merve

updated a dataset 1 day ago

huggingface/documentation-images

Viewer • Updated about 19 hours ago • 52 • 3.22M • 60

jsulz

posted an update 6 days ago

Post

705

xet-team infrastructure begins backing hundreds of repositories on the Hugging Face Hub, we’re getting to put on our researcher hats and peer into the bytes. 👀 🤓

IMO, one of the most interesting ideas Xet storage introduces is a globally shared store of data.

When you upload a file through Xet, the contents are split into ~64KB chunks and deduplicated, but what if those same chunks already exist in another repo on the Hub?

If we can detect and reuse them, we skip them as well saving time and bandwidth for AI builders. More on how that works here:
🔗 https://huggingface.co/blog/from-chunks-to-blocks#scaling-deduplication-with-aggregation

Because of this, different repositories can share bytes we store. That opens up something cool - we can draw a graph of which repos actually share data at the chunk level, where:

- Nodes = repositories
- Edges = shared chunks
- Edge thickness = how much they overlap

xet-team/repo-graph

Come find the many BERT islands. Or see how datasets relate in practice, not just in theory. See how libraries or tasks can tie repositories together. You can play around with node size using storage/likes/downloads too.

The result is a super fun visualization from @saba9 and @znation that I’ve already lost way too much time to. I'm excited to see how the networks grow as we add more repositories!

jsulz

posted an update 7 days ago

Post

2849

What does it mean when models share the same bytes?

We've investigated some quants and have seen that a considerable portion of quantizations of the same model share the same bytes and can be deduplicated to save considerable upload time for quantizers on the Hub.

This space where we crack open a repo from @bartowski shows we can get significant dedupe xet-team/quantization-dedup

You can get a sense of why by reading this write-up: https://github.com/bartowski1182/llm-knowledge/blob/main/quantization/quantization.md

But what about finetuned models?

Since going into production the

xet-team has migrated hundreds of repositories on the Hub to our storage layer, including classic "pre-Hub" open-source models like FacebookAI/xlm-roberta-large (XLM-R) from

FacebookAI

XLM-R, introduced in 2019, set new benchmarks for multilingual NLP by learning shared representations across 100 languages. It was then fine-tuned on English, Spanish, Dutch, and German, generating language-specific derivations for each - check out the paper here Unsupervised Cross-lingual Representation Learning at Scale (1911.02116)

These finetunes share much of the same architecture and layout as XLM-R with similar training methods and goals. It makes sense that they would share bytes, but it's still fascinating to see.

We put together a similar space to explore these models to see where they overlap - check it out for yourself xet-team/finetune-dedupe

The darker each block in the heatmap, the more the bytes are shared. Clicking on a repos blocks shows all other repos that share blocks.

1 reply

severo

posted an update 7 days ago

Post

1866

Need to convert CSV to Parquet?

Use https://www.chatdb.ai/tools/csv-to-parquet-converter. It does the job instantly.

@cfahlgren1 provides many other tools on his website. Approved and bookmarked!