Daniel Serrano

dnlserrano

AI & ML interests

None yet

Recent Activity

upvoted a paper 17 days ago

Apollo: An Exploration of Video Understanding in Large Multimodal Models

reacted to merve's post with 🔥 about 2 months ago

Another great week in open ML! Here's a small recap 🫰🏻 Model releases ⏯️ Video Language Models AI at Meta released https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2 💬 Small language models Hugging Face released https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released https://huggingface.co/facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M 🖼️ Image Generation Stability AI released https://huggingface.co/stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license 🖼️💬Any-to-Any https://huggingface.co/gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released! Dataset releases 🖼️ https://huggingface.co/datasets/Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2

View all activity

Organizations

None yet

dnlserrano's activity

upvoted a paper 17 days ago

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published 20 days ago • 135

reacted to merve's post with 🔥 about 2 months ago

Post

5431

Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2

reacted to merve's post with 🔥 3 months ago

Post

3770

Meta AI vision has been cooking @facebook
They shipped multiple models and demos for their papers at @ECCV 🤗

Here's a compilation of my top picks:
- Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos 👏

All models have their demos and even torchscript checkpoints!
A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc
- VFusion3D is state-of-the-art consistent 3D generation model from images

Model: facebook/vfusion3d
Demo: facebook/VFusion3D

- CoTracker is the state-of-the-art point (pixel) tracking model

Demo: facebook/cotracker
Model: facebook/cotracker

liked 7 models 3 months ago