deepspeed / docs /DATA.md
xingzhikb's picture
init
002bd9b

Data

By default, we cache the data in .data.cache/.

About datasets loading

We use datasets to load data.

For .zip files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.

In contrast, loading .tar or .tsv files is faster as the data are accessed by order.

As a result, we only use streaming=True in when loading SA1B-Cap due to its huge memory consumption, whereas for VG and RefCOCOs, we set streaming=False.

TODO: use webdatasets for openimage (and sa1b).

About Data preprocessing

data/transforms.py: take each sample, process all the regions inside it:

  1. image: using SAM processor to resize and pad images to 1024x1024.
  2. region box/ point / mask: use SAM processor to process the prompts.
  3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" and true .

data/collator.py: take in multiple processed samples, and form tensors in the batch format:

  1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
  2. For captions, we need to pad the tokens during batchifying.

Code dev

  1. src/data/transforms
  2. Add arguments in src/arguments.py
  3. Add arguments in the function in src/train.py

The problem: generting random number with numpy in multi process data loader

transformers/trainer_utils.py
detectron2/data/build.py

However, we use datasets's map, which do not use sub-processes.

Visual Genome

Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.

if all parameters in src/conf/data/vg_densecap.yaml are set to null, the loading scripts will use the default urls. If you want to load data from Azure, you MUST UPDATE THE SAS KEY.

RefCOCO series

Use refer2 for referring expression generation. The paper is SLR.

refcoco, location refcoco+, no location refcocog, with or without location "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.

SA1B-Cap

The implementation of streaming loading in datasets

Load with azcopy

Firstly, Each tar or tsv file is downloaded to local host with azcopy to a temporary dictory /tmp/$PRFIX-$HASH_OF_URL.

After all file loading handles are release, the file will be removed. .

After all file loading handles are release, the file will be removed.

Legacy solution

The open function of Python is extened with streaming loading from the Internet by xopen in datasets.download.streaming_download_manager.

After that, xopen is futher patched into open by datasets.streaming.

There is an attribute called is_streaming in dl_manager object in data scripts which can indicate the whether the data are loaded with streaming mode or not.

OpenImages

Webdataset and pytorch-dalle

There are V6 (maybe) in webdataset format (i.e., tar) https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch

cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..

Train split: 523 GB

Fiftyone

Openimages v6 and v7

(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)

https://docs.voxel51.com/integrations/open_images.html https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset

Full split stats:

  • Train split: 1,743,042 images (513 GB)
  • Test split: 125,436 images (36 GB)
  • Validation split: 41,620 images (12 GB)

Download OpenImagesV7 detections from fiftyone:

import fiftyone as fo
import fiftyone.zoo as foz


validation_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="validation",
    label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="test",
    label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
    "open-images-v7",
    split="train",
    label_types=["detections"],
)

Detection data: COCO instance, Objects365, v3det

The default task_type is recognition.

If you want to activate the task tokens for caption, please use *task_type_caption*.yaml

Also see ./MODEL.md#multitaskv2.

Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic

From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md

  • It provides code to convert data to panoptic format of detectron2.
  • It requires Detectron2 and git+https://github.com/cocodataset/panopticapi.git@7bb4655 to preprocess the data to detectron2 format.

COCO panoptic

https://cocodataset.org/#download

wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip

unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip

DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py

ADE20k Panopitc

http://sceneparsing.csail.mit.edu/

wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016

wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar

DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py

DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py

The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html Usage:

  1. Add the custom dataset class in DatasetCatalog;
  2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
  3. MetadataCatalog contains info that is shared for all samples, like class labels.

Check data registator Then check how the data is load with built-in function Check mapper

Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]

From [[hugging face - datasets library]], they are similar:

  1. A like, the data script is the dataset that provides image paths and labels (load a json)
  2. Difference: The difference is that we merge different dataset here. We should merge latter
  3. Then we use a transform function to load and process images and labels
  4. We define a collator for dataloader
  5. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return {"coco": coco_batch, "o365": o365_batch}