We LZ4 everything automatically, but when we do encounter a model, we also perform a format-agnostic byte grouping inspired by ZipNN before LZ4ing. This does empirically save about 20%.
https://github.com/huggingface/xet-core/blob/main/cas_object/src/byte_grouping/bg4.rs
yuchenglow
yuchenglow
AI & ML interests
Graphs, Interpretability, Performance. Pragmatic Bayesian.
Recent Activity
commented on
their
article
17 days ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
upvoted
an
article
about 2 months ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
published
an
article
about 2 months ago
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
Organizations
yuchenglow's activity

commented on
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
17 days ago

upvoted
an
article
about 2 months ago
Article
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
By
and 3 others
•
•
55
published
an
article
about 2 months ago
Article
From Chunks to Blocks: Accelerating Uploads and Downloads on the Hub
By
and 3 others
•
•
55
published
an
article
6 months ago
Article
Improving Parquet Dedupe on Hugging Face Hub
By
and 1 other
•
•
32
upvoted
an
article
6 months ago
Article
Improving Parquet Dedupe on Hugging Face Hub
By
and 1 other
•
•
32
published
an
article
8 months ago
Article
XetHub is joining Hugging Face!
By
and 1 other
•
•
86