AI & ML interests

Webhooks are now publicly available on Hugging Face!

Recent Activity

webhooks-explorers's activity

davanstrien 
posted an update 1 day ago
view post
Post
792
Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
julien-c 
posted an update 11 days ago
view post
Post
7359
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
julien-c 
posted an update 22 days ago
view post
Post
2145
wow 😮

INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.

PrimeIntellect/INTELLECT-1-Instruct
davanstrien 
posted an update 23 days ago
view post
Post
462
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.

However, there are quite a few cases where it still doesn't work well. This is a nice paper looking at the limitations of LLM as an annotator for Low Resource Languages: On Limitations of LLM as Annotator for Low Resource Languages (2411.17637).

Humans will still have an important role in the loop to help improve models for all languages (and domains).
davanstrien 
posted an update 25 days ago
view post
Post
2465
First dataset for the new Hugging Face Bluesky community organisation: bluesky-community/one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

Excited to see people build more open tools for a more open social media platform!
davanstrien 
posted an update 26 days ago
view post
Post
1348
The Bluesky AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community

My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API 🚰

bluesky-community/bluesky-posts-over-time
  • 1 reply
·
davanstrien 
posted an update about 1 month ago
awacke1 
posted an update about 1 month ago
view post
Post
813
🕊️Hope🕊️ and ⚖️Justice⚖️ AI
🚲 Stolen bike in Denver FOUND - Sometimes hope & justice DO prevail.

🎬 So I Created an AI+Art+Music tribute:
-🧠 AI App that Evaluates GPT-4o vs Claude:
awacke1/RescuerOfStolenBikes
https://x.com/Aaron_Wacker/status/1857640877986033980?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1857640877986033980%7Ctwgr%5E203a5022b0eb4c41ee8c1dd9f158330216ac5be1%7Ctwcon%5Es1_c10&ref_url=https%3A%2F%2Fpublish.twitter.com%2F%3Furl%3Dhttps%3A%2F%2Ftwitter.com%2FAaron_Wacker%2Fstatus%2F1857640877986033980

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">QT your 🕊️Hope🕊️ and ⚖️Justice⚖️ art🎨<br><br>🚲 Stolen bike in Denver FOUND! <br> - Sometimes hope &amp; justice DO prevail! <br><br>🎬 Created an AI+Art+Music tribute: <br> -🧠 AI App that Evaluates GPT-4o vs Claude: <a href="https://t.co/odrYdaeizZ">https://t.co/odrYdaeizZ</a><br> <a href="https://twitter.com/hashtag/GPT?src=hash&amp;ref_src=twsrc%5Etfw">#GPT</a> <a href="https://twitter.com/hashtag/Claude?src=hash&amp;ref_src=twsrc%5Etfw">#Claude</a> <a href="https://twitter.com/hashtag/Huggingface?src=hash&amp;ref_src=twsrc%5Etfw">#Huggingface</a> <a href="https://twitter.com/OpenAI?ref_src=twsrc%5Etfw">@OpenAI</a> <a href="https://twitter.com/AnthropicAI?ref_src=twsrc%5Etfw">@AnthropicAI</a> <a href="https://t.co/Q9wGNzLm5C">pic.twitter.com/Q9wGNzLm5C</a></p>&mdash; Aaron Wacker (@Aaron_Wacker) <a href="https://twitter.com/Aaron_Wacker/status/1857640877986033980?ref_src=twsrc%5Etfw">November 16, 2024</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


#GPT #Claude #Huggingface
@OpenAI
@AnthropicAI
chansung 
posted an update about 1 month ago
view post
Post
1792
🎙️ Listen to the audio "Podcast" of every single Hugging Face Daily Papers.

Now, "AI Paper Reviewer" project can automatically generates audio podcasts on any papers published on arXiv, and this is integrated into the GitHub Action pipeline. I sounds pretty similar to hashtag#NotebookLM in my opinion.

🎙️ Try out yourself at https://deep-diver.github.io/ai-paper-reviewer/

This audio podcast is powered by Google technologies: 1) Google DeepMind Gemini 1.5 Flash model to generate scripts of a podcast, then 2) Google Cloud Vertex AI's Text to Speech model to synthesize the voice turning the scripts into the natural sounding voices (with latest addition of "Journey" voice style)

"AI Paper Reviewer" is also an open source project. Anyone can use it to build and own a personal blog on any papers of your interests. Hence, checkout the project repository below if you are interested in!
: https://github.com/deep-diver/paper-reviewer

This project is going to support other models including open weights soon for both text-based content generation and voice synthesis for the podcast. The only reason I chose Gemini model is that it offers a "free-tier" which is enough to shape up this projects with non-realtime batch generations. I'm excited to see how others will use this tool to explore the world of AI research, hence feel free to share your feedback and suggestions!
  • 2 replies
·
chansung 
posted an update about 2 months ago
view post
Post
4628
Effortlessly stay up-to-date with AI research trends using a new AI tool, "AI Paper Reviewer" !!

It analyzes a list of Hugging Face Daily Papers(w/ @akhaliq ) and turn them into insightful blog posts. This project leverages Gemini models (1.5 Pro, 1.5 Flash, and 1.5 Flash-8B) for content generation and Upstage Document Parse for parsing the layout and contents.
blog link: https://deep-diver.github.io/ai-paper-reviewer/

Also, here is the link of GitHub repository for parsing and generating pipeline. By using this, you can easily build your own GitHub static pages based on any arXiv papers with your own interest!
: https://github.com/deep-diver/paper-reviewer
davanstrien 
posted an update about 2 months ago
awacke1 
posted an update about 2 months ago
view post
Post
1867
Since 2022 I have been trying to understand how to support advancement of the two best python patterns for AI development which are:
1. Streamlit
2. Gradio

The reason I chose them in this order was the fact that the streamlit library had the timing drop on gradio by being available with near perfection about a year or two before training data tap of GPT.

Nowadays its important that if you want current code to be right on generation it requires understanding of consistency in code method names so no manual intervention is required with each try.

With GPT and Claude being my top two for best AI pair programming models, I gravitate towards streamlit since aside from common repeat errors on cache and experimental functions circa 2022 were not solidified.
Its consistency therefore lacks human correction needs. Old dataset error situations are minimal.

Now, I seek to make it consistent on gradio side. Why? Gradio lapped streamlit for blocks paradigm and API for free which are I feel are amazing features which change software engineering forever.

For a few months I thought BigCode would become the new best model due to its training corpus datasets, yet I never felt it got to market as the next best AI coder model.

I am curious on Gradio's future and how. If the two main models (GPT and Claude) pick up the last few years, I could then code with AI without manual intervention. As it stands today Gradio is better if you could get the best coding models to not repeatedly confuse old syntax as current syntax yet we do live in an imperfect world!

Is anyone using an AI pair programming model that rocks with Gradio's latest syntax? I would like to code with a model that knows how to not miss the advancements and syntax changes that gradio has had in the past few years. Trying grok2 as well.

My IDE coding love is HF. Its hands down faster (100x) than other cloud paradigms. Any tips on models best for gradio coding I can use?

--Aaron
·
mrfakename 
posted an update 2 months ago
view post
Post
5771
I just released an unofficial demo for Moonshine ASR!

Moonshine is a fast, efficient, & accurate ASR model released by Useful Sensors. It's designed for on-device inference and licensed under the MIT license!

HF Space (unofficial demo): mrfakename/Moonshine
GitHub repo for Moonshine: https://github.com/usefulsensors/moonshine
awacke1 
posted an update 2 months ago
view post
Post
698
Today I was able to solve a very difficult coding session with GPT-4o which ended up solving integrations on a very large scale. So I decided to look a bit more into how its reasoners work. Below is a fun markdown emoji outline about what I learned today and what I'm pursuing.

Hope you enjoy! Cheers, Aaron.

Also here are my favorite last 4 spaces I am working on:
1. GPT4O: awacke1/GPT-4o-omni-text-audio-image-video
2. Claude:
awacke1/AnthropicClaude3.5Sonnet-ACW
3. MSGraph M365: awacke1/MSGraphAPI
4. Azure Cosmos DB: Now with Research AI! awacke1/AzureCosmosDBUI

# 🚀 OpenAI's O1 Models: A Quantum Leap in AI

## 1. 🤔 From 🦜 to 🧠: O1's Evolution

- **Thinking AI**: O1 ponders before replying; GPT models just predict. 💡

## 2. 📚 AI Memory: 💾 + 🧩 = 🧠

- **Embeddings & Tokens**: Words ➡️ vectors, building knowledge. 📖

## 3. 🔍 Swift Knowledge Retrieval

- **Vector Search & Indexing**: O1 finds info fast, citing reliable sources. 🔎📖

## 4. 🌳 Logic Trees with Mermaid Models

- **Flowchart Reasoning**: O1 structures thoughts like diagrams. 🎨🌐

## 5. 💻 Coding Mastery

- **Multilingual & Current**: Speaks many code languages, always up-to-date. 💻🔄

## 6. 🏆 Breaking Records

- **92.3% MMLU Score**: O1 outperforms humans, setting new AI standards. 🏅

## 7. 💡 Versatile Applications

- **Ultimate Assistant**: From fixing code to advancing research. 🛠️🔬

## 8. 🏁 Racing Toward AGI

- **OpenAI Leads**: O1 brings us closer to true AI intelligence. 🚀

## 9. 🤖 O1's Reasoning Pillars

- **🧠 Chain of Thought**: Step-by-step logic.
- **🎲 MCTS**: Simulates options, picks best path.
- **🔍 Reflection**: Self-improves autonomously.
- **🏋️‍♂️ Reinforcement Learning**: Gets smarter over time.

---

*Stay curious, keep coding!* 🚀
awacke1 
posted an update 2 months ago
view post
Post
577
I have finally completed a working full Azure and Microsoft MS Graph API implementation which can use all the interesting MS AI features in M365 products to manage CRUD patterns for the graph features across products.

This app shows initial implementation of security, authentication, scopes, and access to Outlook, Calendar, Tasks, Onedrive and other apps for CRUD pattern as AI agent service skills to integrate with your AI workflow.


Below are initial screens showing integration:

URL: awacke1/MSGraphAPI
Discussion: awacke1/MSGraphAPI#5

Best of AI on @Azure and @Microsoft on @HuggingFace :
https://huggingface.co/microsoft
https://www.microsoft.com/en-us/research/
---
Aaron
davanstrien 
posted an update 3 months ago
awacke1 
posted an update 3 months ago
view post
Post
993
Updated my 📺RTV🖼️ - Real Time Video AI app this morning.
URL: awacke1/stable-video-diffusion

It uses Stable Diffusion to dynamically create videos from images in input directory or uploaded using A10 GPU on Huggingface.


Samples below.

I may transition this to Zero GPU if I can. During Christmas when I revised this I had my highest billing from HF yet due to GPU usage. It is still the best turn key GPU out and Image2Video is a killer app. Thanks HF for the possibilities!
awacke1 
posted an update 3 months ago
davanstrien 
posted an update 3 months ago
view post
Post
2187
Yesterday, I shared a blog post on generating data for fine-tuning ColPali using the Qwen/Qwen2-VL-7B-Instruct model.

To simplify testing this approach, I created a Space that lets you generate queries from an input document page image: davanstrien/ColPali-Query-Generator

I think there is much room for improvement, but I'm excited about the potential for relatively small VLMs to create synthetic data.

You can read the original blog post that goes into more detail here: https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html