AI & ML interests

None defined yet.

Recent Activity

blog-explorers's activity

merve 
posted an update about 12 hours ago
view post
Post
597
What a beginning to this year in open ML 🤠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal 🖼️
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs 💬
> Microsoft released Phi-4, sota open-source 14B language model 🔥
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻‍💻

Embeddings 🔖
> @MoritzLaurer released zero-shot version of ModernBERT large 👏
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
davanstrien 
posted an update about 16 hours ago
view post
Post
713
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
·
merve 
posted an update 1 day ago
view post
Post
1273
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
·
davidberenstein1957 
posted an update 7 days ago
Xenova 
posted an update 9 days ago
tomaarsen 
posted an update 10 days ago
view post
Post
2616
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!

Details:
🤖 Based on ModernBERT-base with 149M parameters.
📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB!
🏎️ Immediate FA2 and unpacking support for super efficient inference.
🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256.
➡️ Maximum sequence length of 8192 tokens!
2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets.
➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc.
🏛️ Apache 2.0 licensed: fully commercially permissible

Try it out here: nomic-ai/modernbert-embed-base

Very nice work by Zach Nussbaum and colleagues at Nomic AI.
merve 
posted an update 10 days ago
view post
Post
4710
supercharge your LLM apps with smolagents 🔥

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents
davidberenstein1957 
posted an update 12 days ago
davanstrien 
posted an update 15 days ago
view post
Post
3126
🇸🇰 Hovorte po slovensky? Help build better AI for Slovak!

We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!

Your contribution will help create better language models for 5+ million Slovak speakers.

Annotate here: data-is-better-together/fineweb-c.

Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
  • 3 replies
·
merve 
posted an update 17 days ago
davanstrien 
posted an update 21 days ago
view post
Post
1759
Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c
davidberenstein1957 
posted an update 23 days ago
Xenova 
posted an update 23 days ago
view post
Post
3352
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
🚀 Faster and more accurate than Whisper
🔒 Privacy-focused (no data leaves your device)
⚡️ WebGPU accelerated (w/ WASM fallback)
🔥 Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web
·
merve 
posted an update 24 days ago
view post
Post
2787
Aya by Cohere For AI can now see! 👀

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! 🗣️

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya 👏
kudos @nahidalam and team
  • 1 reply
·