deepspeed / docs /DATA.md
xingzhikb's picture
init
002bd9b
# Data
By default, we cache the data in `.data.cache/`.
## About datasets loading
We use `datasets` to load data.
For `.zip` files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.
In contrast, loading `.tar` or `.tsv` files is faster as the data are accessed by order.
As a result, we only use `streaming=True` in when loading `SA1B-Cap` due to its huge memory consumption, whereas for VG and RefCOCOs, we set `streaming=False`.
TODO: use webdatasets for openimage (and sa1b).
## About Data preprocessing
`data/transforms.py`: take each sample, process all the regions inside it:
1. image: using SAM processor to resize and pad images to 1024x1024.
2. region box/ point / mask: use SAM processor to process the prompts.
3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" <BOS> and true <EOS>.
`data/collator.py`: take in multiple processed samples, and form tensors in the batch format:
1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
2. For captions, we need to pad the <PAD> tokens during batchifying.
### Code dev
1. `src/data/transforms`
2. Add arguments in `src/arguments.py`
3. Add arguments in the function in `src/train.py`
The problem: generting random number with numpy in multi process data loader
- https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers
```
transformers/trainer_utils.py
detectron2/data/build.py
```
However, we use `datasets`'s `map`, which do not use sub-processes.
## Visual Genome
Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.
- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023)
if all parameters in `src/conf/data/vg_densecap.yaml` are set to `null`, the loading scripts will use the default urls.
If you want to load data from Azure, you **MUST UPDATE THE SAS KEY**.
## RefCOCO series
Use refer2 for referring expression generation. The paper is SLR.
- https://github.com/lichengunc/refer2
- https://arxiv.org/abs/1612.09542
- Thanks to [easy-to-understand-REG](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2) which points out the data evolving problem, and upload the evaluation sentences.
refcoco, location
refcoco+, no location
refcocog, with or without location
"testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.
## SA1B-Cap
### The implementation of streaming loading in `datasets`
### Load with azcopy
Firstly, Each tar or tsv file is downloaded to local host with `azcopy` to a temporary dictory `/tmp/$PRFIX-$HASH_OF_URL`.
After all file loading handles are release, the file will be removed.
.
After all file loading handles are release, the file will be removed.
### Legacy solution
The `open` function of Python is extened with streaming loading from the Internet by `xopen` in [`datasets.download.streaming_download_manager`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/download/streaming_download_manager.py#L471).
After that, `xopen` is futher patched into `open` by [`datasets.streaming`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/streaming.py#L80).
There is an attribute called `is_streaming` in `dl_manager` object in data scripts which can indicate the whether the data are loaded with streaming mode or not.
## OpenImages
### Webdataset and pytorch-dalle
There are V6 (maybe) in webdataset format (i.e., `tar`)
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch
```
cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..
```
Train split: 523 GB
### Fiftyone
Openimages v6 and v7
(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)
https://docs.voxel51.com/integrations/open_images.html
https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset
Full split stats:
- Train split: 1,743,042 images (513 GB)
- Test split: 125,436 images (36 GB)
- Validation split: 41,620 images (12 GB)
Download OpenImagesV7 detections from fiftyone:
```python
import fiftyone as fo
import fiftyone.zoo as foz
validation_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="validation",
label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="test",
label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="train",
label_types=["detections"],
)
```
## Detection data: COCO instance, Objects365, v3det
The default task_type is `recognition`.
If you want to activate the task tokens for `caption`, please use `*task_type_caption*.yaml`
Also see [./MODEL.md#multitaskv2](./MODEL.md#multitaskv2).
## Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic
From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md
- It provides code to convert data to panoptic format of detectron2.
- It requires `Detectron2` and `git+https://github.com/cocodataset/panopticapi.git@7bb4655` to preprocess the data to detectron2 format.
### COCO panoptic
https://cocodataset.org/#download
```
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip
unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip
DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py
```
### ADE20k Panopitc
http://sceneparsing.csail.mit.edu/
```
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016
wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar
DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py
DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py
```
The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html
Usage:
1. Add the custom dataset class in `DatasetCatalog`;
2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
3. `MetadataCatalog` contains info that is shared for all samples, like class labels.
Check data registator
Then check how the data is load with built-in function
Check mapper
## Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]
From [[hugging face - datasets library]], they are similar:
1. A like, the data script is the dataset that provides image paths and labels (load a json)
1. Difference: The **difference** is that we merge different dataset here. We should merge latter
2. Then we use a transform function to load and process images and labels
3. We define a collator for dataloader
1. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return `{"coco": coco_batch, "o365": o365_batch}`