congratulations! well deserved!
Clem ๐ค PRO
clem
AI & ML interests
multi-modal, time-series, biology and chemistry
Organizations
clem's activity
replied to
danielhanchen's
post
7 days ago
gotta catch them all!
you should create an org on HF for it
posted
an
update
11 days ago
Post
2059
Great in-depth Llama-3 tests from
@wolfram
, of the models from Meta of course but also
@MaziyarPanahi
@emozilla
@turboderp
: https://huggingface.co/blog/wolfram/llm-comparison-test-llama-3
Spotted by @jack-kumar
Spotted by @jack-kumar
replied to
their
post
11 days ago
posted
an
update
12 days ago
Post
2651
Already almost 1,000 llama3 model variations have been shared publicly on HF (many more in private use at companies): https://huggingface.co/models?p=5&sort=trending&search=llama3.
Everyone should fine-tune their own models for their use-cases, languages, industry, infra constraints,...
10,000 llama3 variants by the end of next week?
Everyone should fine-tune their own models for their use-cases, languages, industry, infra constraints,...
10,000 llama3 variants by the end of next week?
replied to
visheratin's
post
16 days ago
Thank you! You should tweet it mentioning @elonmuskceo !
posted
an
update
17 days ago
Post
2637
We noticed that all the open-source models and datasets from https://huggingface.co/WizardLM in their personal Hugging Face account & in the Microsoft Hugging Face organization (https://huggingface.co/microsoft) have been made private by the author, which will lead some demos to fail (these models were collectively downloaded over a hundred thousand times a month).
This is the explanation that @WizardLM communicated a few hours ago: https://huggingface.co/posts/WizardLM/329547800484476#661e0d17bca1a6038b60503e
We apologize for the inconvenience & are trying to get in touch with the author & Microsoft in order to try to find a good resolution for community members. Let us know if you have any questions!
This is the explanation that @WizardLM communicated a few hours ago: https://huggingface.co/posts/WizardLM/329547800484476#661e0d17bca1a6038b60503e
We apologize for the inconvenience & are trying to get in touch with the author & Microsoft in order to try to find a good resolution for community members. Let us know if you have any questions!
posted
an
update
18 days ago
Post
2400
Fun dataset added last week by
@esind
from https://huggingface.co/Anthropic to compare persuasiveness between AI and human outputs:
Anthropic/persuasion
Anthropic/persuasion
posted
an
update
29 days ago
Post
2490
Introducing
gretelai/synthetic_text_to_sql by https://huggingface.co/gretelai
It stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.
The dataset includes:
- 105,851 records partitioned into 100,000 train and 5,851 test records
~23M total tokens, including ~12M SQL tokens
- Coverage across 100 distinct domains/verticals
- Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting
- Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations
- Database context, including table and view create statements
- Natural language explanations of what the SQL query is doing
- Contextual tags to optimize model training
Blogpost: https://gretel.ai/blog/synthetic-text-to-sql-dataset
Dataset: gretelai/synthetic_text_to_sql
It stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.
The dataset includes:
- 105,851 records partitioned into 100,000 train and 5,851 test records
~23M total tokens, including ~12M SQL tokens
- Coverage across 100 distinct domains/verticals
- Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting
- Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations
- Database context, including table and view create statements
- Natural language explanations of what the SQL query is doing
- Contextual tags to optimize model training
Blogpost: https://gretel.ai/blog/synthetic-text-to-sql-dataset
Dataset: gretelai/synthetic_text_to_sql
Thanks for sharing!
Welcome @josefprusa !
posted
an
update
2 months ago
Post
Terribly excited about open-source + on-device AI these days! Great to see
@qualcomm
release 80+ models optimized and curated for their devices and chips on HF: https://huggingface.co/qualcomm
replied to
dvilasuero's
post
2 months ago
Unpopular opinion: this is the most impactful release of the day (because open)!
replied to
DmitryRyumin's
post
2 months ago
would be cool to have some integration with the HF hub
replied to
trisfromgoogle's
post
2 months ago
This is awesome!
very cool!
This comment has been hidden
๐ซ๐ท๐ซ๐ท๐ซ๐ท
replied to
dvilasuero's
post
3 months ago
Very cool!
replied to
clefourrier's
post
3 months ago
very useful! This is the link to the leaderboard btw: https://huggingface.co/spaces/PatronusAI/enterprise_scenarios_leaderboard
very cool!
posted
an
update
3 months ago
Post
So impressed with the speed and accuracy of
vikhyatk/moondream1 by
@vikhyatk (especially the last answer ๐๐๐).
Open multi-modal models have gone a long way!
Model: vikhyatk/moondream1
@vikhyatk (especially the last answer ๐๐๐).
Open multi-modal models have gone a long way!
Model: vikhyatk/moondream1
posted
an
update
3 months ago
Post
With the Google announcement last week, I think we're now officially the only AI startup out there who has commercial collaborations with all the major cloud providers (AWS, GCP, Azure) and hardware providers (Nvidia, AMD, Intel, Qualcomm,...), making our vision of being the independent and agnostic platform for all AI builders truer than ever!
Let's go!
Let's go!
posted
an
update
3 months ago
Post
In 2024, we're expanding from open weights to open EVERYTHING (datasets, training scripts,...).
Excited to see this dataset release in French by @Pclanglais @carbonbasedLLM @anastasiastasenko :
PleIAs/French-PD-Newspapers
"To give you an idea of the size, the full French Wikipedia is about 2 billon words. This is 40 times larger."
Excited to see this dataset release in French by @Pclanglais @carbonbasedLLM @anastasiastasenko :
PleIAs/French-PD-Newspapers
"To give you an idea of the size, the full French Wikipedia is about 2 billon words. This is 40 times larger."
Very cool!
posted
an
update
3 months ago
Post
Google + Hugging Face + Open-Source AI = ๐ฅ๐ฅ๐ฅ
https://huggingface.co/blog/gcp-partnership
https://finance.yahoo.com/video/google-hugging-face-alliance-spur-173016882.html
https://www.theverge.com/2024/1/25/24050445/google-cloud-hugging-face-ai-developer-access
https://www.bloomberg.com/news/articles/2024-01-25/google-to-team-up-with-startup-hugging-face-to-host-ai-software
https://www.reuters.com/technology/google-cloud-partners-with-hugging-face-attract-ai-developers-2024-01-25/
https://huggingface.co/blog/gcp-partnership
https://finance.yahoo.com/video/google-hugging-face-alliance-spur-173016882.html
https://www.theverge.com/2024/1/25/24050445/google-cloud-hugging-face-ai-developer-access
https://www.bloomberg.com/news/articles/2024-01-25/google-to-team-up-with-startup-hugging-face-to-host-ai-software
https://www.reuters.com/technology/google-cloud-partners-with-hugging-face-attract-ai-developers-2024-01-25/
posted
an
update
3 months ago
Post
Re-posting
@karpathy
's blogpost here because it's down on https://karpathy.github.io/2024/01/21/selfdriving-agi. What do you all think?
very cool!
very cool!
posted
an
update
4 months ago
Post
Most upvoted papers of 2023 on HF. What do you think are going to be the most prominent research topics in AI for 2024 (also, don't forget to add your papers to the hub this year!).
From: hysts/daily-papers
From: hysts/daily-papers
replied to
their
post
4 months ago
๐ฅ๐ฅ๐ฅ
posted
an
update
4 months ago
Post
Is synthetic data the future of AI? ๐ฅ๐ฅ๐ฅ
@HugoLaurencon @Leyo & @VictorSanh are introducing HuggingFaceM4/WebSight , a multimodal dataset featuring 823,000 pairs of synthetically generated HTML/CSS codes along with screenshots of the corresponding rendered websites to train GPT4-V-like models ๐๐ป
While crafting their upcoming foundation vision language model, they faced the challenge of converting website screenshots into usable HTML/CSS codes. Most VLMs suck at this and there was no public dataset available for this specific task, so they decided to create their own.
They prompted existing LLMs to generate 823k HTML/CSS codes of very simple websites. Through supervised fine-tuning of a vision language model on WebSight, they were able to generate the code to reproduce a website component, given a screenshot.
You can explore the dataset here: HuggingFaceM4/WebSight
What do you think?
@HugoLaurencon @Leyo & @VictorSanh are introducing HuggingFaceM4/WebSight , a multimodal dataset featuring 823,000 pairs of synthetically generated HTML/CSS codes along with screenshots of the corresponding rendered websites to train GPT4-V-like models ๐๐ป
While crafting their upcoming foundation vision language model, they faced the challenge of converting website screenshots into usable HTML/CSS codes. Most VLMs suck at this and there was no public dataset available for this specific task, so they decided to create their own.
They prompted existing LLMs to generate 823k HTML/CSS codes of very simple websites. Through supervised fine-tuning of a vision language model on WebSight, they were able to generate the code to reproduce a website component, given a screenshot.
You can explore the dataset here: HuggingFaceM4/WebSight
What do you think?
very cool!