Ann Huang

erinys

AI & ML interests

None yet

Recent Activity

Organizations

Hugging Face's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Dev Mode Explorers's profile picture Xet Team's profile picture open/ acc's profile picture

erinys's activity

published an article 3 months ago
view article
Article

Rearchitecting Hugging Face Uploads and Downloads

β€’ 43
reacted to elliesleightholm's post with πŸ€— 3 months ago
reacted to jsulz's post with πŸ”₯ 3 months ago
view post
Post
2937
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
πŸš€ Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
reacted to reach-vb's post with πŸš€πŸ”₯ 3 months ago
view post
Post
4449
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! πŸ€—
published an article 3 months ago
view article
Article

From Files to Chunks: Improving Hugging Face Storage Efficiency

β€’ 51
reacted to maxiw's post with πŸ€—β€οΈ 3 months ago
view post
Post
4655
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
❀️: 7048 times
πŸ”₯: 5921 times
πŸ‘: 4856 times
πŸš€: 2549 times
πŸ€—: 2065 times
Β·
published an article 3 months ago
view article
Article

Share your open ML datasets on Hugging Face Hub!

β€’ 27
updated a Space 4 months ago
posted an update 4 months ago
upvoted an article 4 months ago
view article
Article

How to optimize your data labelling project with custom interfaces

By burtenshaw and 9 others β€’
β€’ 18
reacted to jsulz's post with πŸ”₯ 4 months ago
view post
Post
1662
The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co/xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.

Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore:
- Storage growth over time
- File types over all repositories
- Some simple optimizations we're investigating

xet-team/lfs-analysis
New activity in xet-team/lfs-analysis 4 months ago
upvoted an article 5 months ago
view article
Article

Improving Parquet Dedupe on Hugging Face Hub

β€’ 32
New activity in xet-team/lfs-analysis 5 months ago

Suggested text changes

1
#1 opened 5 months ago by
erinys