|
# Data |
|
|
|
By default, we cache the data in `.data.cache/`. |
|
|
|
## About datasets loading |
|
|
|
We use `datasets` to load data. |
|
|
|
For `.zip` files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes. |
|
|
|
In contrast, loading `.tar` or `.tsv` files is faster as the data are accessed by order. |
|
|
|
As a result, we only use `streaming=True` in when loading `SA1B-Cap` due to its huge memory consumption, whereas for VG and RefCOCOs, we set `streaming=False`. |
|
|
|
TODO: use webdatasets for openimage (and sa1b). |
|
|
|
## About Data preprocessing |
|
|
|
`data/transforms.py`: take each sample, process all the regions inside it: |
|
1. image: using SAM processor to resize and pad images to 1024x1024. |
|
2. region box/ point / mask: use SAM processor to process the prompts. |
|
3. region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" <BOS> and true <EOS>. |
|
|
|
`data/collator.py`: take in multiple processed samples, and form tensors in the batch format: |
|
1. If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions. |
|
2. For captions, we need to pad the <PAD> tokens during batchifying. |
|
|
|
### Code dev |
|
|
|
1. `src/data/transforms` |
|
2. Add arguments in `src/arguments.py` |
|
3. Add arguments in the function in `src/train.py` |
|
|
|
The problem: generting random number with numpy in multi process data loader |
|
- https://pytorch.org/docs/stable/notes/faq.html#my-data-loader-workers-return-identical-random-numbers |
|
``` |
|
transformers/trainer_utils.py |
|
detectron2/data/build.py |
|
``` |
|
However, we use `datasets`'s `map`, which do not use sub-processes. |
|
|
|
|
|
## Visual Genome |
|
|
|
Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure. |
|
|
|
- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023) |
|
|
|
if all parameters in `src/conf/data/vg_densecap.yaml` are set to `null`, the loading scripts will use the default urls. |
|
If you want to load data from Azure, you **MUST UPDATE THE SAS KEY**. |
|
|
|
## RefCOCO series |
|
|
|
Use refer2 for referring expression generation. The paper is SLR. |
|
- https://github.com/lichengunc/refer2 |
|
- https://arxiv.org/abs/1612.09542 |
|
- Thanks to [easy-to-understand-REG](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2) which points out the data evolving problem, and upload the evaluation sentences. |
|
|
|
refcoco, location |
|
refcoco+, no location |
|
refcocog, with or without location |
|
"testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. |
|
|
|
## SA1B-Cap |
|
|
|
### The implementation of streaming loading in `datasets` |
|
|
|
### Load with azcopy |
|
|
|
Firstly, Each tar or tsv file is downloaded to local host with `azcopy` to a temporary dictory `/tmp/$PRFIX-$HASH_OF_URL`. |
|
|
|
After all file loading handles are release, the file will be removed. |
|
. |
|
|
|
After all file loading handles are release, the file will be removed. |
|
|
|
### Legacy solution |
|
|
|
The `open` function of Python is extened with streaming loading from the Internet by `xopen` in [`datasets.download.streaming_download_manager`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/download/streaming_download_manager.py#L471). |
|
|
|
After that, `xopen` is futher patched into `open` by [`datasets.streaming`](https://github.com/huggingface/datasets/blob/029227a116c14720afca71b9b22e78eb2a1c09a6/src/datasets/streaming.py#L80). |
|
|
|
There is an attribute called `is_streaming` in `dl_manager` object in data scripts which can indicate the whether the data are loaded with streaming mode or not. |
|
|
|
|
|
## OpenImages |
|
|
|
### Webdataset and pytorch-dalle |
|
|
|
There are V6 (maybe) in webdataset format (i.e., `tar`) |
|
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch |
|
|
|
``` |
|
cd ~ |
|
mkdir webdataset-openimages |
|
cd webdataset-openimages |
|
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do |
|
for i in {000000..000554}; do |
|
echo $i |
|
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar |
|
done |
|
cd .. |
|
``` |
|
|
|
Train split: 523 GB |
|
|
|
### Fiftyone |
|
|
|
Openimages v6 and v7 |
|
|
|
(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.) |
|
|
|
https://docs.voxel51.com/integrations/open_images.html |
|
https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset |
|
|
|
Full split stats: |
|
- Train split: 1,743,042 images (513 GB) |
|
- Test split: 125,436 images (36 GB) |
|
- Validation split: 41,620 images (12 GB) |
|
|
|
Download OpenImagesV7 detections from fiftyone: |
|
|
|
```python |
|
import fiftyone as fo |
|
import fiftyone.zoo as foz |
|
|
|
|
|
validation_dataset = foz.load_zoo_dataset( |
|
"open-images-v7", |
|
split="validation", |
|
label_types=["detections"], |
|
) |
|
test_dataset = foz.load_zoo_dataset( |
|
"open-images-v7", |
|
split="test", |
|
label_types=["detections"], |
|
) |
|
train_dataset = foz.load_zoo_dataset( |
|
"open-images-v7", |
|
split="train", |
|
label_types=["detections"], |
|
) |
|
``` |
|
|
|
|
|
## Detection data: COCO instance, Objects365, v3det |
|
|
|
The default task_type is `recognition`. |
|
|
|
If you want to activate the task tokens for `caption`, please use `*task_type_caption*.yaml` |
|
|
|
Also see [./MODEL.md#multitaskv2](./MODEL.md#multitaskv2). |
|
|
|
## Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic |
|
|
|
From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md |
|
- It provides code to convert data to panoptic format of detectron2. |
|
- It requires `Detectron2` and `git+https://github.com/cocodataset/panopticapi.git@7bb4655` to preprocess the data to detectron2 format. |
|
|
|
### COCO panoptic |
|
|
|
https://cocodataset.org/#download |
|
|
|
``` |
|
wget http://images.cocodataset.org/zips/train2017.zip |
|
wget http://images.cocodataset.org/zips/val2017.zip |
|
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip |
|
|
|
unzip train2017.zip |
|
unzip val2017.zip |
|
unzip panoptic_annotations_trainval2017.zip |
|
unzip annotations/panoptic_train2017.zip |
|
unzip annotations/panoptic_val2017.zip |
|
|
|
DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py |
|
``` |
|
|
|
### ADE20k Panopitc |
|
|
|
http://sceneparsing.csail.mit.edu/ |
|
|
|
``` |
|
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip |
|
unzip ADEChallengeData2016.zip |
|
cd ADEChallengeData2016 |
|
|
|
wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar |
|
tar -xvf annotations_instance.tar |
|
|
|
DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py |
|
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py |
|
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py |
|
|
|
DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py |
|
``` |
|
|
|
The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html |
|
Usage: |
|
1. Add the custom dataset class in `DatasetCatalog`; |
|
2. Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors); |
|
3. `MetadataCatalog` contains info that is shared for all samples, like class labels. |
|
|
|
Check data registator |
|
Then check how the data is load with built-in function |
|
Check mapper |
|
|
|
## Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]] |
|
|
|
From [[hugging face - datasets library]], they are similar: |
|
|
|
1. A like, the data script is the dataset that provides image paths and labels (load a json) |
|
1. Difference: The **difference** is that we merge different dataset here. We should merge latter |
|
2. Then we use a transform function to load and process images and labels |
|
3. We define a collator for dataloader |
|
1. Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return `{"coco": coco_batch, "o365": o365_batch}` |
|
|