Uplimit (Uplimit)

davidberenstein1957

posted an update 15 days ago

Post

3196

🚀 Find banger tools for your smolagents!

I created the Tools gallery, which makes tools specifically developed by/for smolagents searchable and visible. This will help with:
- inspiration
- best practices
- finding cool tools

Space: davidberenstein1957/smolagents-and-tools

1 reply

·

davidberenstein1957

posted an update 16 days ago

Post

2428

Fine-tune Deepseek-R1 with a Synthetic Reasoning Dataset

Blog: https://huggingface.co/blog/sdiazlor/fine-tune-deepseek-with-a-synthetic-reasoning-data

davidberenstein1957

posted an update 21 days ago

Post

2046

Agentic RAG: Applied, visual, and step-by-step! 🐾

Get familiar with the Agents and tools, not the bells and whistles!

Retrieve - Augment and now GENERATE.

part 3: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-3-generate

davidberenstein1957

posted an update 22 days ago

Post

2852

Anyone can create free hosted tools for their AI agents! 🔥

Agentic RAG stack part 2 - augment
Augment retrieval results by reranking optimises content without increasing time too much

part2: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-2-augment
code: https://github.com/huggingface/ai-blueprint

davidberenstein1957

posted an update 23 days ago

Post

1994

Creating an agentic RAG stack on the Hugging Face Hub - part 1 - retrieval (1/5).

🚀 Web apps and microservices included!

Chunk, embed and index documents at a huge scale without overhead.

Blog: https://huggingface.co/blog/davidberenstein1957/ai-blueprint-agentic-rag-part-1-retrieve

davidberenstein1957

posted an update 28 days ago

Post

1624

tldr; Parquet is awesome, DuckDB too!

Datasets on the Hugging Face Hub rely on parquet files. We can interact with these files using DuckDB as a fast in-memory database system. One of DuckDB’s features is vector similarity search which can be used with or without an index.

blog:
https://huggingface.co/learn/cookbook/vector_search_with_hub_as_backend

davidberenstein1957

posted an update about 1 month ago

Post

1795

Let's uncover the post-training dataset from DeepSeek-R1 with Magpie!

Pass pre-query tokens <｜begin▁of▁sentence｜>User: , let the model generate the rest.

We can get realistic examples!

Gist: https://gist.github.com/davidberenstein1957/3f20046ce57395a6aba13f8b4e956b59

6 replies

·

davidberenstein1957

posted an update about 1 month ago

Post

1884

The RAG's in the bag!

You can now use the Synthetic Data Generator with your own domain-specific seed data to generate a dataset for fine-tuning retrieval or reranking models.

GitHub: https://buff.ly/49IDSmd
Spaces: https://buff.ly/3Y1S99z
Blog: https://huggingface.co/blog/sdiazlor/fine-tune-modernbert-for-rag-with-synthetic-data

1 reply

·

davidberenstein1957

posted an update about 1 month ago

Post

1250

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥

Install, configure, launch!

Space: argilla/synthetic-data-generator
Examples: https://github.com/argilla-io/synthetic-data-generator/tree/main/examples

davidberenstein1957

posted an update about 1 month ago

Post

2112

🔦 What? The Hub as a vector search backend!

code: https://gist.github.com/davidberenstein1957/f0157a471ec59d9dd44ae6957f1d52ec
build on DuckDB: https://huggingface.co/docs/hub/en/datasets-duckdb

davidberenstein1957

posted an update about 2 months ago

Post

1949

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

Blog: https://huggingface.co/blog/davidberenstein1957/fine-tune-a-smollm-on-synthetic-data-of-llm

1 reply

·

davidberenstein1957

posted an update about 2 months ago

Post

2029

Fine-tuning ModernBERT for text classification using synthetic data generation

From prompt to model in 3 steps.
1 dataset description
20 minutes of generating
60 minutes of fine-tuning on my Macbook Pro

Tutorial: https://nbsanity.com/static/552eb50cbd91bedb4e5b73fddca2664a/fine-tune-modernbert-classifier.html

davidberenstein1957

posted an update 2 months ago

Post

1369

🐇 Tumble down the AI rabbit hole without any technical knowledge!

Explore AI models on the Hub by a simple and quick search

Demo: davidberenstein1957/transformers-pipeline-playground

davidberenstein1957

posted an update 2 months ago

Post

4229

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator

4 replies

·

davidberenstein1957

posted an update 3 months ago

Post

2085

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.

https://huggingface.co/blog/image-preferences

davidberenstein1957

posted an update 3 months ago

Post

1191

This is amazing for cheap models fine-tunes without the hassle of actual deployment! TIL: LoRA fine-tunes for models on the Hub can directly be used for inference!

davidberenstein1957

posted an update 3 months ago

Post

3473

The Data Is Better Together community is set to release the first Apache 2 licensed image preference dataset!

Great work and let's give this a final push :)

@aashish1904 congrats on your month of HF pro. There is more to win during this sprint!

@aashish1904 @AnyaDesdein @davidberenstein1957 @Malalatiana @beta3 @fffiloni @munish0838 @Reza2kn @bbunzeck @Creazycreator @andrei-saceleanu @jafhaponiuk @rca-etl @kf120 @burtenshaw @mmhamdy @grib0ed0v @Doopus @AnyaDes @ttkap @Xceron @Lewox @davanstrien @Azazelle @adirik @Ashish08 @AntonVic @kenantang @sdiazlor @g-ronimo @dennis-rall @prithivMLmods @girtss3 @flozi00 @WaveCut @Taylor658 @Wildminder @Sara9999 @phaelishall @sararob @dvilasuero @pgabrys @plaguss @CDS899 @timajwilliams @rudzinskimaciej @pavel-ai @aggr8 @ignacioct @MouseAI @Leeps @MaksKul @NicolasDmln @Muinez @kusht55 @caiolang @Jakub-Brand24 @loamy @Demijan @eliab96 @Viewegger @JosephCatrambone @p1atdev @mrshu @o639 @Targezed @Aviv-anthonnyolime @thliang01 @Ahmed-Amine @glards @pranaykoppula @nataliaElv @MaPirlet @alvarobartt @gabrielmbmb @zlicastro @Jaydip @Chouettecheveche @lilcheaty @ruyrdiaz @robintema @fdaudens @ggcristian @a-r-r-o-w @pates @joheras @stopsatgreen @bezo97 @chachi902 @iamyann @liamcripwell @dmb23 @korbih @anonymous7743 @akbdx18 @OVAWARE @severo @akontra @lichorosario @lhoestq @SebastianBodza @Vishnou @ameerazam08 @appoose @Mukei @mearco @joaquincabezas @Fizzarolli @thomastraum @igortopolski @OxxoCodes @patrickfleith @asoria @bn22 @sitammeur @Krodolf @bergr7f @Sbxxn @wietsevenema @sugatoray @Iamladi @MikeTrizna @feveromo @mokady @Bolero @prath @Dowwie @kfahn @decodingchris @alili2050 @RahulRaman @yzimmermann @Ameeeee @ecyht2 @MattMC001 @hemanthkumarak @Thegorgibus @akos2 @LawRun @ramithuh @SuperMuel @sjans @peterizsak @mosama @Eyel @mtr3 @cfahlgren1 @legentil @clem @Citaman @Aurelien-Morgan @AntoineBourgois @TotoB12 @Stanmey @osanseviero @multimodalart @maxiw @ariG23498 @ngk89 @femboysLover @dvs @tacohiddink @blanchon @DavidJimenez

1 reply

·

davidberenstein1957

posted an update 3 months ago

Post

1595

🔥 Dataset Drop - Open Image Preferences

BlackForest Labs Flux Dev VS. Stability AI Stable Diffusion Large 3.5

Together with the ⁠data-is-better-together community, we've worked on an Apache 2.0 licensed open image preference dataset based on the fal ai imgsys prompts dataset. Thanks to the awesome community, we have managed to get 5K preference pairs in less than 2 days. The annotation alignment among annotators is great too.

Aashish Kumar won a month of Hugging Face Pro by making the most contributions! Congrats from the entire team 🥇

The best thing?! We are not done yet! Let's keep the annotations coming for 5K more in the second part of the sprint! (with more prices to go around).

Dataset: https://huggingface.co/datasets/data-is-better-together/image-preferences-results

davidberenstein1957

posted an update 3 months ago

Post

1718

Let’s make a generation of amazing image-generation models

The best image generation models are trained on human preference datasets, where annotators have selected the best image from a choice of two. Unfortunately, many of these datasets are closed source so the community cannot train open models on them. Let’s change that!

The community can contribute image preferences for an open-source dataset that could be used for building AI models that convert text to image, like the flux or stable diffusion families. The dataset will be open source so everyone can use it to train models that we can all use.

Blog: https://huggingface.co/blog/burtenshaw/image-preferences

davidberenstein1957

posted an update 3 months ago

Post

971

Watch and learn!

Let's observe Qwen2.5-coder:0.5b on OpenAI HumanEval.

pip install observers

And start collecting your data on the Hugging Face Hub.
Dataset: davidberenstein1957/openai_records
Library: https://github.com/cfahlgren1/observers

Uplimit

AI & ML interests

Uplimit's activity

AI & ML interests

Team members 1

Uplimit's activity