File size: 8,005 Bytes
002bd9b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
# Data
By default, we cache the data in `.data.cache/`.
## About datasets loading
We use `datasets` to load data.
For `.zip` files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.
In contrast, loading `.tar` or `.tsv` files is faster as the data are accessed by order.
As a result, we only use `streaming=True` in when loading `SA1B-Cap` due to its huge memory consumption, whereas for VG and RefCOCOs, we set `streaming=False`.
TODO: use webdatasets for openimage (and sa1b).
## About Data preprocessing
`data/transforms.py`: take each sample, process all the regions inside it:
1. image: using SAM processor to resize and pad images to 1024x1024.
2. region box/ point / mask: use SAM processor to process the prompts.
3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" <BOS> and true <EOS>.
`data/collator.py`: take in multiple processed samples, and form tensors in the batch format:
1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
2. For captions, we need to pad the <PAD> tokens during batchifying.
### Code dev
1. `src/data/transforms`
2. Add arguments in `src/arguments.py`
3. Add arguments in the function in `src/train.py`
The problem: generting random number with numpy in multi process data loader
- https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers
```
transformers/trainer_utils.py
detectron2/data/build.py
```
However, we use `datasets`'s `map`, which do not use sub-processes.
## Visual Genome
Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.
- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023)
if all parameters in `src/conf/data/vg_densecap.yaml` are set to `null`, the loading scripts will use the default urls.
If you want to load data from Azure, you **MUST UPDATE THE SAS KEY**.
## RefCOCO series
Use refer2 for referring expression generation. The paper is SLR.
- https://github.com/lichengunc/refer2
- https://arxiv.org/abs/1612.09542
- Thanks to [easy-to-understand-REG](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2) which points out the data evolving problem, and upload the evaluation sentences.
refcoco, location
refcoco+, no location
refcocog, with or without location
"testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.
## SA1B-Cap
### The implementation of streaming loading in `datasets`
### Load with azcopy
Firstly, Each tar or tsv file is downloaded to local host with `azcopy` to a temporary dictory `/tmp/$PRFIX-$HASH_OF_URL`.
After all file loading handles are release, the file will be removed.
.
After all file loading handles are release, the file will be removed.
### Legacy solution
The `open` function of Python is extened with streaming loading from the Internet by `xopen` in [`datasets.download.streaming_download_manager`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/download/streaming_download_manager.py#L471).
After that, `xopen` is futher patched into `open` by [`datasets.streaming`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/streaming.py#L80).
There is an attribute called `is_streaming` in `dl_manager` object in data scripts which can indicate the whether the data are loaded with streaming mode or not.
## OpenImages
### Webdataset and pytorch-dalle
There are V6 (maybe) in webdataset format (i.e., `tar`)
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch
```
cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..
```
Train split: 523 GB
### Fiftyone
Openimages v6 and v7
(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)
https://docs.voxel51.com/integrations/open_images.html
https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset
Full split stats:
- Train split: 1,743,042 images (513 GB)
- Test split: 125,436 images (36 GB)
- Validation split: 41,620 images (12 GB)
Download OpenImagesV7 detections from fiftyone:
```python
import fiftyone as fo
import fiftyone.zoo as foz
validation_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="validation",
label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="test",
label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="train",
label_types=["detections"],
)
```
## Detection data: COCO instance, Objects365, v3det
The default task_type is `recognition`.
If you want to activate the task tokens for `caption`, please use `*task_type_caption*.yaml`
Also see [./MODEL.md#multitaskv2](./MODEL.md#multitaskv2).
## Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic
From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md
- It provides code to convert data to panoptic format of detectron2.
- It requires `Detectron2` and `git+https://github.com/cocodataset/panopticapi.git@7bb4655` to preprocess the data to detectron2 format.
### COCO panoptic
https://cocodataset.org/#download
```
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip
unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip
DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py
```
### ADE20k Panopitc
http://sceneparsing.csail.mit.edu/
```
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016
wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar
DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py
DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py
```
The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html
Usage:
1. Add the custom dataset class in `DatasetCatalog`;
2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
3. `MetadataCatalog` contains info that is shared for all samples, like class labels.
Check data registator
Then check how the data is load with built-in function
Check mapper
## Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]
From [[hugging face - datasets library]], they are similar:
1. A like, the data script is the dataset that provides image paths and labels (load a json)
1. Difference: The **difference** is that we merge different dataset here. We should merge latter
2. Then we use a transform function to load and process images and labels
3. We define a collator for dataloader
1. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return `{"coco": coco_batch, "o365": o365_batch}`
|