Meta dropped swiss army knives for vision with A2.0 license ๐ > image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐ > The vision LM outperforms InternVL3 and Qwen2.5VL ๐ > They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
Authors release following datasets ๐ > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐
multimodal > Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS) > InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)
LLMs > NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use > Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset > Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS) > Skywork-OR1-32B-Preview is a new reasoning model by Skywork
Image Generation > HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)
๐ Multimodal > Mistral AI released a 24B vision LM, both base and instruction FT versions, sota ๐ฅ (OS) > with IBM we released SmolDocling, a sota 256M document parser with Apache 2.0 license (OS) > SpatialLM is a new vision LM that outputs 3D bounding boxes, comes with 0.5B (QwenVL based) and 1B (Llama based) variants > SkyWork released SkyWork-R1V-38B, new vision reasoning model (OS)
๐ฌ LLMs > NVIDIA released new Nemotron models in 49B and 8B with their post-training dataset > LG released EXAONE, new reasoning models in 2.4B, 7.8B and 32B > Dataset: Glaive AI released a new reasoning dataset of 22M+ examples > Dataset: NVIDIA released new helpfulness dataset HelpSteer3 > Dataset: OpenManusRL is a new agent dataset based on ReAct framework (OS) > Open-R1 team released OlympicCoder, new competitive coder model in 7B and 32B > Dataset: GeneralThought-430K is a new reasoning dataset (OS)
๐ผ๏ธ Image Generation/Computer Vision > Roboflow released RF-DETR, new real-time sota object detector (OS) ๐ฅ > YOLOE is a new real-time zero-shot object detector with text and visual prompts ๐ฅน > Stability AI released Stable Virtual Camera, a new novel view synthesis model > Tencent released Hunyuan3D-2mini, new small and fast 3D asset generation model > ByteDance released InfiniteYou, new realistic photo generation model > StarVector is a new 8B model that generates svg from images > FlexWorld is a new model that expands 3D views (OS)
๐ค Audio > Sesame released CSM-1B new speech generation model (OS)
๐ค Robotics > NVIDIA released GR00T, new robotics model for generalized reasoning and skills, along with the dataset
*OS ones have Apache 2.0 or MIT license
reacted to AdinaY's
post with ๐about 1 month ago
Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐ฅ
> Three new models: 3B, 10B, 28B with res 224, 448 ๐ > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
๐ Multimodal > OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context > AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support > ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size > Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
๐ฌ LLMs A lot of math models! > Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B > Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models > DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math > LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
๐ฃ๏ธ Audio > Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
๐ผ๏ธ Vision and Image Generation > We have ported DepthPro of Apple to transformers for your convenience! > illustrious-xl-v1.0 is a new illustration generation model
๐ค Robotics > Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
๐ฌ LLMs > Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math ๐คฏ s1-32B and s1K is out! > Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis > Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
๐ Multimodal > PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything > OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese > Krutrim released Chitrarth, a new vision language model for Indic languages and English
๐ผ๏ธ Vision > BiRefNet_HR is a new higher resolution BiRefNet for background removal
๐ฃ๏ธ Audio > kyutai released Hibiki, it's a real-time speech-to-speech translation model ๐คฏ it's available for French-English translation > Krutrim released Dhwani, a new STT model for Indic languages > They also release a new dataset for STT-TTS
๐ผ๏ธ Image Generation > Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation > Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint > boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan
This week in open AI was ๐ฅ Let's recap! ๐ค merve/january-31-releases-679a10669bd4030090c5de4d LLMs ๐ฌ > Huge: AllenAI released new Tรผlu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B ๐ฅ > Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license ๐ฑ > Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license ๐ฅ > Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens > Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages > OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset
VLMs & vision ๐ > Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization ๐ฅ > NVIDIA released new series of Eagle2 models with 1B and 9B sizes > DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license > BEN2 is a new background removal model with MIT license!
Audio ๐ฃ๏ธ > YuE is a new open-source music generation foundation model, lyrics-to-song generation
Given an input image, it generates several queries along with explanations to justify them. This approach can generate synthetic data for fine-tuning ColPali models.
reacted to m-ric's
post with ๐โค๏ธ๐ฅ3 months ago
โ Hosting our own inference was not enough: now the Hub 4 new inference providers: fal, Replicate, SambaNova Systems, & Together AI.
Check model cards on the Hub: you can now, in 1 click, use inference from various providers (cf video demo)
Their inference can also be used through our Inference API client. There, you can use either your custom provider key, or your HF token, then billing will be handled directly on your HF account, as a way to centralize all expenses.
๐ธ Also, PRO users get 2$ inference credits per month!
Multimodal ๐ฌ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐ - UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐ - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
7 replies
ยท
reacted to m-ric's
post with ๐โค๏ธ๐ฅ3 months ago
Today we make the biggest release in smolagents so far: ๐๐ฒ ๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ ๐๐ถ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐, ๐๐ต๐ถ๐ฐ๐ต ๐ฎ๐น๐น๐ผ๐๐ ๐๐ผ ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ผ๐๐ฒ๐ฟ๐ณ๐๐น ๐๐ฒ๐ฏ ๐ฏ๐ฟ๐ผ๐๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐! ๐ฅณ
Our agents can now casually open up a web browser, and navigate on it by scrolling, clicking elements on the webpage, going back, just like a user would.
The demo below shows Claude-3.5-Sonnet browsing GitHub for task: "Find how many commits the author of the current top trending repo did over last year." Hi @mlabonne !
Go try it out, it's the most cracked agentic stuff I've seen in a while ๐คฏ (well, along with OpenAI's Operator who beat us by one day)
smolagents can see ๐ฅ we just shipped vision support to smolagents ๐ค agentic computers FTW
you can now: ๐ป let the agent get images dynamically (e.g. agentic web browser) ๐ pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! ๐คฏ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) ๐ค