smolagents can see š„ we just shipped vision support to smolagents š¤ agentic computers FTW
you can now: š» let the agent get images dynamically (e.g. agentic web browser) š pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! š¤Æ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) š¤
Multimodal š¼ļø > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook ā 22k hours worth of samples from instruction videos š¤Æ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
Embeddings š > @MoritzLaurer released zero-shot version of ModernBERT large š > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation āÆļø > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts š„ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding