File size: 8,005 Bytes

002bd9b

# Data

By default, we cache the data in `.data.cache/`.

## About datasets loading

We use `datasets` to load data.

For `.zip` files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.

In contrast, loading `.tar` or `.tsv` files is faster as the data are accessed by order.

As a result, we only use `streaming=True` in when loading `SA1B-Cap` due to its huge memory consumption, whereas for VG and RefCOCOs, we set `streaming=False`.

TODO: use webdatasets for openimage (and sa1b).

## About Data preprocessing

`data/transforms.py`: take each sample, process all the regions inside it:
1. image: using SAM processor to resize and pad images to 1024x1024.
2. region box/ point / mask: use SAM processor to process the prompts.
3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" <BOS> and true <EOS>.

`data/collator.py`: take in multiple processed samples, and form tensors in the batch format:
1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
2. For captions, we need to pad the <PAD> tokens during batchifying.

### Code dev

1. `src/data/transforms`
2. Add arguments in `src/arguments.py` 
3. Add arguments in the function in `src/train.py`

The problem: generting random number with numpy in multi process data loader
- https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers
```
transformers/trainer_utils.py
detectron2/data/build.py
```
However, we use `datasets`'s `map`, which do not use sub-processes.


## Visual Genome

Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.

- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023)

if all parameters in `src/conf/data/vg_densecap.yaml` are set to `null`, the loading scripts will use the default urls.
If you want to load data from Azure, you **MUST UPDATE THE SAS KEY**.

## RefCOCO series

Use refer2 for referring expression generation. The paper is SLR.
- https://github.com/lichengunc/refer2
- https://arxiv.org/abs/1612.09542
- Thanks to [easy-to-understand-REG](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2) which points out the data evolving problem, and upload the evaluation sentences.

refcoco, location
refcoco+, no location
refcocog, with or without location
"testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.

## SA1B-Cap

###  The implementation of streaming loading in `datasets`

### Load with azcopy

Firstly, Each tar or tsv file is downloaded to local host with `azcopy` to a temporary dictory `/tmp/$PRFIX-$HASH_OF_URL`.

After all file loading handles are release, the file will be removed.
.

After all file loading handles are release, the file will be removed.

### Legacy solution

The `open` function of Python is extened with streaming loading from the Internet by `xopen` in [`datasets.download.streaming_download_manager`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/download/streaming_download_manager.py#L471).

After that, `xopen` is futher patched into `open` by [`datasets.streaming`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/streaming.py#L80).

There is an attribute called `is_streaming` in `dl_manager` object in data scripts which can indicate the whether the data are loaded with streaming mode or not.


## OpenImages

### Webdataset and pytorch-dalle

There are V6 (maybe) in webdataset format (i.e., `tar`)
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch

```
cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..
```

Train split: 523 GB

### Fiftyone

Openimages v6 and v7

(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)

https://docs.voxel51.com/integrations/open_images.html
https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset

Full split stats:
- Train split: 1,743,042 images (513 GB)
- Test split: 125,436 images (36 GB)
- Validation split: 41,620 images (12 GB)

Download OpenImagesV7 detections from fiftyone:

```python
import fiftyone as fo
import fiftyone.zoo as foz


validation_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="validation",
    label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="test",
    label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="train",
    label_types=["detections"],
)
```


## Detection data: COCO instance, Objects365, v3det

The default task_type is `recognition`.

If you want to activate the task tokens for `caption`, please use `*task_type_caption*.yaml`

Also see [./MODEL.md#multitaskv2](./MODEL.md#multitaskv2).

## Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic

From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md
- It provides code to convert data to panoptic format of detectron2.
- It requires `Detectron2` and `git+https://github.com/cocodataset/panopticapi.git@7bb4655` to preprocess the data to detectron2 format.

### COCO panoptic

https://cocodataset.org/#download

```
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip

unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip

DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py
```

### ADE20k Panopitc

http://sceneparsing.csail.mit.edu/

```
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016

wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar

DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py

DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py
```

The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html
Usage:
1. Add the custom dataset class in `DatasetCatalog`;
2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
3. `MetadataCatalog` contains info that is shared for all samples, like class labels.

Check data registator
Then check how the data is load with built-in function
Check mapper

## Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]

From [[hugging face - datasets library]], they are similar:

1. A like, the data script is the dataset that provides image paths and labels (load a json)
  1. Difference: The **difference** is that we merge different dataset here. We should merge latter
2. Then we use a transform function to load and process images and labels
3. We define a collator for dataloader
  1. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return `{"coco": coco_batch, "o365": o365_batch}`