Jared Sulzdorf PRO

jsulz

AI & ML interests

NLP + (Law|Medicine) & Ethics

Recent Activity

Articles

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture

jsulz's activity

reacted to cfahlgren1's post with ๐Ÿ‘๐Ÿ”ฅ๐Ÿš€ about 3 hours ago
view post
Post
1314
We just dropped an LLM inside the SQL Console ๐Ÿคฏ

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset โœจ

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think ๐Ÿค—
reacted to fdaudens's post with ๐Ÿš€โค๏ธ about 22 hours ago
view post
Post
993
Keeping up with open-source AI in 2024 = overwhelming.

Here's help: We're launching our Year in Review on what actually matters, starting today!

Fresh content dropping daily until year end. Come along for the ride - first piece out now with @clem 's predictions for 2025.

Think of it as your end-of-year AI chocolate calendar.

Kudos to @BrigitteTousi @clefourrier @Wauplin @thomwolf for making it happen. We teamed up with aiworld.eu for awesome visualizations to make this digestibleโ€”it's a charm to work with their team.

Check it out: huggingface/open-source-ai-year-in-review-2024
reacted to prithivMLmods's post with ๐Ÿค—โค๏ธ๐Ÿ”ฅ 6 days ago
view post
Post
3179
HF Posts Receipts ๐Ÿ†๐Ÿš€

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

๐Ÿฅ The one thing that needs to be remembered is the 'username'.

๐Ÿฅ And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! ๐Ÿ™Œ

๐Ÿฅ [ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods
replied to their post 7 days ago
view reply

Great question, we've talked about torrents before, actually!

How would you include torrents in your workflows today?

There's nothing stopping us from doing it, but the user/developer experience doesn't quite align with what we're trying to support right now. There are benefits to leveraging CDNs as we do today, and this integrates relatively seamlessly with existing clients (e.g., huggingface_hub) that are used across the Hub.

Maybe if there's enough interest in the future!

posted an update 7 days ago
view post
Post
1451
Something I love about working at Hugging Face is the opportunity to design and work in public. Right now, weโ€™re redesigning the architecture that supports uploads and downloads on the Hub.

Datasets and models are growing fast, and so are the challenges of storing and transferring them efficiently. To keep up, we're introducing a new protocol for uploads and downloads, supported by a content-addressed store (CAS).

Hereโ€™s whatโ€™s coming:

๐Ÿ“ฆ Smarter uploads: Chunk-level management enables advanced deduplication, compression, and reduces redundant transfers, speeding up uploads.
โšก Efficient downloads: High throughput and low latency ensure fast access, even during high-demand model releases.
๐Ÿ”’ Enhanced security: Validate uploads before storage to block malicious or invalid data.

We analyzed 24 hours of global upload activity in October (88 countries, 130TB of data!) to design a system that scales with your needs.

The result? A proposed infrastructure with CAS nodes in us-east-1, eu-west-3, and ap-southeast-1.

๐Ÿ”— Read the blog post for the full details: https://huggingface.co/blog/rearchitecting-uploads-and-downloads

๐ŸŒŸ Check out our interactive demo to explore the data yourself!
xet-team/cas-analysis

Weโ€™d love to hear your feedback - let us know if you have questions or want to see more.
ยท
reacted to davanstrien's post with ๐Ÿ”ฅ 8 days ago
view post
Post
1334
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API ๐Ÿšฐ

bluesky-community/bluesky-posts-over-time
  • 1 reply
ยท
reacted to reach-vb's post with โค๏ธ 9 days ago
view post
Post
2417
Massive week for Open AI/ ML:

Mistral Pixtral & Instruct Large - ~123B, 128K context, multilingual, json + function calling & open weights
mistralai/Pixtral-Large-Instruct-2411
mistralai/Mistral-Large-Instruct-2411

Allen AI Tรผlu 70B & 8B - competive with claude 3.5 haiku, beats all major open models like llama 3.1 70B, qwen 2.5 and nemotron
allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5
allenai/tulu-3-datasets-673b8df14442393f7213f372

Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision
Xkev/Llama-3.2V-11B-cot

Black Forest Labs Flux.1 tools - four new state of the art model checkpoints & 2 adapters for fill, depth, canny & redux, open weights
reach-vb/black-forest-labs-flux1-6743847bde9997dd26609817

Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64)
jinaai/jina-clip-v2

Apple AIM v2 & CoreML MobileCLIP - large scale vision encoders outperform CLIP and SigLIP. CoreML optimised MobileCLIP models
apple/aimv2-6720fe1558d94c7805f7688c
apple/coreml-mobileclip

A lot more got released like, OpenScholar ( OpenScholar/openscholar-v1-67376a89f6a80f448da411a6), smoltalk ( HuggingFaceTB/smoltalk), Hymba ( nvidia/hymba-673c35516c12c4b98b5e845f), Open ASR Leaderboard ( hf-audio/open_asr_leaderboard) and much more..

Can't wait for the next week! ๐Ÿค—
reacted to BrigitteTousi's post with ๐Ÿš€ 11 days ago
reacted to fdaudens's post with โค๏ธ 11 days ago
view post
Post
1867
๐Ÿฆ‹ Hug the butterfly! You can now add your Bluesky handle to your Hugging Face profile! โœจ
reacted to elliesleightholm's post with ๐Ÿค— 12 days ago
posted an update 13 days ago
view post
Post
2885
When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. Thatโ€™s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

โฉ Only upload the chunks that changed.
๐Ÿš€ Download just the updates, not the whole file.
๐Ÿง  We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isnโ€™t just a performance boost. Itโ€™s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks
reacted to reach-vb's post with ๐Ÿค—๐Ÿ”ฅ 15 days ago
view post
Post
4222
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! ๐Ÿค—
reacted to m-ric's post with ๐Ÿ”ฅ 17 days ago
view post
Post
3702
๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—ฏ๐—ถ๐—ด ๐˜€๐—ผ๐—ฐ๐—ถ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ ๐Ÿฆ‹, ๐—ถ๐˜'๐˜€ ๐—›๐˜‚๐—ฏ ๐—ฃ๐—ผ๐˜€๐˜๐˜€! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

โš™๏ธ Computed with the great dataset maxiw/hf-posts
โš™๏ธ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
ยท
reacted to cfahlgren1's post with ๐Ÿ”ฅ 18 days ago
view post
Post
2214
Why use Google Drive when you can have:

โ€ข Free storage with generous limits๐Ÿ†“
โ€ข Dataset Viewer (Sorting, Filtering, FTS) ๐Ÿ”
โ€ข Third Party Library Support
โ€ข SQL Console ๐ŸŸง
โ€ข Security ๐Ÿ”’
โ€ข Community, Reach, and Visibility ๐Ÿ“ˆ

It's a no brainer!

Check out our post on what you get instantly out of the box when you create a dataset.
https://huggingface.co/blog/researcher-dataset-sharing
  • 1 reply
ยท