Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Aug 1
Post
1721
๐—ฆ๐—”๐—  ๐Ÿฎ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ: ๐—ก๐—ฒ๐˜„ ๐—ฆ๐—ข๐—ง๐—” ๐—ผ๐—ป ๐˜€๐—ฒ๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป, ๐—ฏ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฏ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜„๐—ถ๐˜๐—ต ๐—ต๐˜‚๐—บ๐—ฎ๐—ป ๐—ณ๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ ๐Ÿš€

It's a model for Object segmentation, for both image and video:
๐Ÿ‘‰ input = a text prompt, or a click on a specific object
๐Ÿ‘‰ output = the model draws a mask around the object. In video segmentation, the mask should follow the object's movements (it is then called a masklet)

๐Ÿ’ช SAM 2 is 6x faster than the previous version, it now also works on a video, and it beats SOTA by far on both image and video segmentation tasks.

How did they pull that?

The main blocker for video segmentation was that data is really hard to collect: to build your training dataset, should you manually draw masks on every frame? That would be way too costly! โžก๏ธ As a result, existing video segmentation datasets have a real lack of coverage: few examples, few masklets drawn.

๐Ÿ’ก Key idea: researchers they decided to use a segmentation model to help them collect the dataset.

But then itโ€™s a chicken and egg problem: you need the model to create the dataset and the opposite as well? ๐Ÿค”

โ‡’ To solve this, they build a data generation system that they scale up progressively in 3 successive manual annotations phases:

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿญ: Annotators use only SAM + manual editing tools on each frame โ‡’ Create 16k masklets across 1.4k videos

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฎ: Then train a first SAM 2, add it in the loop to temporally propagate frames, and correct by re-doing a mask manually when an error has occured โ‡’ This gets a 5.1x speedup over data collection in phase 1! ๐Ÿƒ Collect 60k masklets

๐—ฆ๐˜๐—ฒ๐—ฝ ๐Ÿฏ: Now SAM 2 is more powerful, it has the โ€œsingle clickโ€ prompting option, thus annotators can use it with simple clicks to re-annotate data.

They even add a completely automatic step to generate 350k more masklets!
And in turn, the model perf gradually increases.

I find this a great example of combining synthetic data generation with human annotation ๐Ÿ‘
In this post