Spaces:
Running
on
T4
Running
on
T4
File size: 8,404 Bytes
dfd72c0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
# Segment Anything Video (SA-V) Dataset
## Overview
[Segment Anything Video (SA-V)](https://ai.meta.com/datasets/segment-anything-video/), consists of 51K diverse videos and 643K high-quality spatio-temporal segmentation masks (i.e., masklets). The dataset is released under the CC by 4.0 license. Browse the dataset [here](https://sam2.metademolab.com/dataset).
![SA-V dataset](../assets/sa_v_dataset.jpg?raw=true)
## Getting Started
### Download the dataset
Visit [here](https://ai.meta.com/datasets/segment-anything-video-downloads/) to download SA-V including the training, val and test sets.
### Dataset Stats
| | Num Videos | Num Masklets |
| ---------- | ---------- | ----------------------------------------- |
| SA-V train | 50,583 | 642,036 (auto 451,720 and manual 190,316) |
| SA-V val | 155 | 293 |
| SA-V test | 150 | 278 |
### Notebooks
To load and visualize the SA-V training set annotations, refer to the example [sav_visualization_example.ipynb](./sav_visualization_example.ipynb) notebook.
### SA-V train
For SA-V training set we release the mp4 videos and store the masklet annotations per video as json files . Automatic masklets and manual masklets are stored separately as two json files: `{video_id}_auto.json` and `{video_id}_manual.json`. They can be loaded as dictionaries in python in the format below.
```
{
"video_id" : str; video id
"video_duration" : float64; the duration in seconds of this video
"video_frame_count" : float64; the number of frames in the video
"video_height" : float64; the height of the video
"video_width" : float64; the width of the video
"video_resolution" : float64; video_height $\times$ video_width
"video_environment" : List[str]; "Indoor" or "Outdoor"
"video_split" : str; "train" for training set
"masklet" : List[List[Dict]]; masklet annotations in list of list of RLEs.
The outer list is over frames in the video and the inner list
is over objects in the video.
"masklet_id" : List[int]; the masklet ids
"masklet_size_rel" : List[float]; the average mask area normalized by resolution
across all the frames where the object is visible
"masklet_size_abs" : List[float]; the average mask area (in pixels)
across all the frames where the object is visible
"masklet_size_bucket" : List[str]; "small": $1$ <= masklet_size_abs < $32^2$,
"medium": $32^2$ <= masklet_size_abs < $96^2$,
and "large": masklet_size_abs > $96^2$
"masklet_visibility_changes" : List[int]; the number of times where the visibility changes
after the first appearance (e.g., invisible -> visible
or visible -> invisible)
"masklet_first_appeared_frame" : List[int]; the index of the frame where the object appears
the first time in the video. Always 0 for auto masklets.
"masklet_frame_count" : List[int]; the number of frames being annotated. Note that
videos are annotated at 6 fps (annotated every 4 frames)
while the videos are at 24 fps.
"masklet_edited_frame_count" : List[int]; the number of frames being edited by human annotators.
Always 0 for auto masklets.
"masklet_type" : List[str]; "auto" or "manual"
"masklet_stability_score" : Optional[List[List[float]]]; per-mask stability scores. Auto annotation only.
"masklet_num" : int; the number of manual/auto masklets in the video
}
```
Note that in SA-V train, there are in total 50,583 videos where all of them have manual annotations. Among the 50,583 videos there are 48,436 videos that also have automatic annotations.
### SA-V val and test
For SA-V val and test sets, we release the extracted frames as jpeg files, and the masks as png files with the following directory structure:
```
sav_val(sav_test)
βββ sav_val.txt (sav_test.txt): a list of video ids in the split
βββ JPEGImages_24fps # videos are extracted at 24 fps
β βββ {video_id}
β β βββ 00000.jpg # video frame
β β βββ 00001.jpg # video frame
β β βββ 00002.jpg # video frame
β β βββ 00003.jpg # video frame
β β βββ ...
β βββ {video_id}
β βββ {video_id}
β βββ ...
βββ Annotations_6fps # videos are annotated at 6 fps
βββ {video_id}
β βββ 000 # obj 000
β β βββ 00000.png # mask for object 000 in 00000.jpg
β β βββ 00004.png # mask for object 000 in 00004.jpg
β β βββ 00008.png # mask for object 000 in 00008.jpg
β β βββ 00012.png # mask for object 000 in 00012.jpg
β β βββ ...
β βββ 001 # obj 001
β βββ 002 # obj 002
β βββ ...
βββ {video_id}
βββ {video_id}
βββ ...
```
All masklets in val and test sets are manually annotated in every frame by annotators. For each annotated object in a video, we store the annotated masks in a single png. This is because the annotated objects may overlap, e.g., it is possible in our SA-V dataset for there to be a mask for the whole person as well as a separate mask for their hands.
## SA-V Val and Test Evaluation
We provide an evaluator to compute the common J and F metrics on SA-V val and test sets. To run the evaluation, we need to first install a few dependencies as follows:
```
pip install -r requirements.txt
```
Then we can evaluate the predictions as follows:
```
python sav_evaluator.py --gt_root {GT_ROOT} --pred_root {PRED_ROOT}
```
or run
```
python sav_evaluator.py --help
```
to print a complete help message.
The evaluator expects the `GT_ROOT` to be one of the following folder structures, and `GT_ROOT` and `PRED_ROOT` to have the same structure.
- Same as SA-V val and test directory structure
```
{GT_ROOT} # gt root folder
βββ {video_id}
β βββ 000 # all masks associated with obj 000
β β βββ 00000.png # mask for object 000 in frame 00000 (binary mask)
β β βββ ...
β βββ 001 # all masks associated with obj 001
β βββ 002 # all masks associated with obj 002
β βββ ...
βββ {video_id}
βββ {video_id}
βββ ...
```
In the paper for the experiments on SA-V val and test, we run inference on the 24 fps videos, and evaluate on the subset of frames where we have ground truth annotations (first and last annotated frames dropped). The evaluator will ignore the masks in frames where we don't have ground truth annotations.
- Same as [DAVIS](https://github.com/davisvideochallenge/davis2017-evaluation) directory structure
```
{GT_ROOT} # gt root folder
βββ {video_id}
β βββ 00000.png # annotations in frame 00000 (may contain multiple objects)
β βββ ...
βββ {video_id}
βββ {video_id}
βββ ...
```
## License
The evaluation code is licensed under the [BSD 3 license](./LICENSE). Please refer to the paper for more details on the models. The videos and annotations in SA-V Dataset are released under CC BY 4.0.
Third-party code: the evaluation software is heavily adapted from [`VOS-Benchmark`](https://github.com/hkchengrex/vos-benchmark) and [`DAVIS`](https://github.com/davisvideochallenge/davis2017-evaluation) (with their licenses in [`LICENSE_DAVIS`](./LICENSE_DAVIS) and [`LICENSE_VOS_BENCHMARK`](./LICENSE_VOS_BENCHMARK)).
|