File size: 8,005 Bytes
002bd9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# Data

By default, we cache the data in `.data.cache/`.

## About datasets loading

We use `datasets` to load data.

For `.zip` files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.

In contrast, loading `.tar` or `.tsv` files is faster as the data are accessed by order.

As a result, we only use `streaming=True` in when loading `SA1B-Cap` due to its huge memory consumption, whereas for VG and RefCOCOs, we set `streaming=False`.

TODO: use webdatasets for openimage (and sa1b).

## About Data preprocessing

`data/transforms.py`: take each sample, process all the regions inside it:
1. image: using SAM processor to resize and pad images to 1024x1024.
2. region box/ point / mask: use SAM processor to process the prompts.
3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" <BOS> and true <EOS>.

`data/collator.py`: take in multiple processed samples, and form tensors in the batch format:
1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
2. For captions, we need to pad the <PAD> tokens during batchifying.

### Code dev

1. `src/data/transforms`
2. Add arguments in `src/arguments.py` 
3. Add arguments in the function in `src/train.py`

The problem: generting random number with numpy in multi process data loader
- https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers
```
transformers/trainer_utils.py
detectron2/data/build.py
```
However, we use `datasets`'s `map`, which do not use sub-processes.


## Visual Genome

Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.

- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023)

if all parameters in `src/conf/data/vg_densecap.yaml` are set to `null`, the loading scripts will use the default urls.
If you want to load data from Azure, you **MUST UPDATE THE SAS KEY**.

## RefCOCO series

Use refer2 for referring expression generation. The paper is SLR.
- https://github.com/lichengunc/refer2
- https://arxiv.org/abs/1612.09542
- Thanks to [easy-to-understand-REG](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2) which points out the data evolving problem, and upload the evaluation sentences.

refcoco, location
refcoco+, no location
refcocog, with or without location
"testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.

## SA1B-Cap

###  The implementation of streaming loading in `datasets`

### Load with azcopy

Firstly, Each tar or tsv file is downloaded to local host with `azcopy` to a temporary dictory `/tmp/$PRFIX-$HASH_OF_URL`.

After all file loading handles are release, the file will be removed.
.

After all file loading handles are release, the file will be removed.

### Legacy solution

The `open` function of Python is extened with streaming loading from the Internet by `xopen` in [`datasets.download.streaming_download_manager`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/download/streaming_download_manager.py#L471).

After that, `xopen` is futher patched into `open` by [`datasets.streaming`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/streaming.py#L80).

There is an attribute called `is_streaming` in `dl_manager` object in data scripts which can indicate the whether the data are loaded with streaming mode or not.


## OpenImages

### Webdataset and pytorch-dalle

There are V6 (maybe) in webdataset format (i.e., `tar`)
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch

```
cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..
```

Train split: 523 GB

### Fiftyone

Openimages v6 and v7

(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)

https://docs.voxel51.com/integrations/open_images.html
https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset

Full split stats:
- Train split: 1,743,042 images (513 GB)
- Test split: 125,436 images (36 GB)
- Validation split: 41,620 images (12 GB)

Download OpenImagesV7 detections from fiftyone:

```python
import fiftyone as fo
import fiftyone.zoo as foz


validation_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="validation",
    label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="test",
    label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="train",
    label_types=["detections"],
)
```


## Detection data: COCO instance, Objects365, v3det

The default task_type is `recognition`.

If you want to activate the task tokens for `caption`, please use `*task_type_caption*.yaml`

Also see [./MODEL.md#multitaskv2](./MODEL.md#multitaskv2).

## Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic

From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md
- It provides code to convert data to panoptic format of detectron2.
- It requires `Detectron2` and `git+https://github.com/cocodataset/panopticapi.git@7bb4655` to preprocess the data to detectron2 format.

### COCO panoptic

https://cocodataset.org/#download

```
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip

unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip

DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py
```

### ADE20k Panopitc

http://sceneparsing.csail.mit.edu/

```
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016

wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar

DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py

DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py
```

The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html
Usage:
1. Add the custom dataset class in `DatasetCatalog`;
2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
3. `MetadataCatalog` contains info that is shared for all samples, like class labels.

Check data registator
Then check how the data is load with built-in function
Check mapper

## Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]

From [[hugging face - datasets library]], they are similar:

1. A like, the data script is the dataset that provides image paths and labels (load a json)
  1. Difference: The **difference** is that we merge different dataset here. We should merge latter
2. Then we use a transform function to load and process images and labels
3. We define a collator for dataloader
  1. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return `{"coco": coco_batch, "o365": o365_batch}`