Data
By default, we cache the data in .data.cache/
.
About datasets loading
We use datasets
to load data.
For .zip
files (e.g., VG, RefCOCOs), the streaming fetching is extremely slow due to data access via random indexes.
In contrast, loading .tar
or .tsv
files is faster as the data are accessed by order.
As a result, we only use streaming=True
in when loading SA1B-Cap
due to its huge memory consumption, whereas for VG and RefCOCOs, we set streaming=False
.
TODO: use webdatasets for openimage (and sa1b).
About Data preprocessing
data/transforms.py
: take each sample, process all the regions inside it:
- image: using SAM processor to resize and pad images to 1024x1024.
- region box/ point / mask: use SAM processor to process the prompts.
- region captions: Use LM processor to do tokenization; For SCA, we need to add "virtual" and true .
data/collator.py
: take in multiple processed samples, and form tensors in the batch format:
- If the number of regions is not the same among the samples, we chunk each of them to the minimum number of regions.
- For captions, we need to pad the tokens during batchifying.
Code dev
src/data/transforms
- Add arguments in
src/arguments.py
- Add arguments in the function in
src/train.py
The problem: generting random number with numpy in multi process data loader
transformers/trainer_utils.py
detectron2/data/build.py
However, we use datasets
's map
, which do not use sub-processes.
Visual Genome
Editted from https://huggingface.co/datasets/visual_genome/blob/main/visual_genome.py, we can load the data stored on Azure.
- the broken links are fixed in https://huggingface.co/datasets/visual_genome/discussions/3#649d99c26a066a00a087b80d (as of 06/30/2023)
if all parameters in src/conf/data/vg_densecap.yaml
are set to null
, the loading scripts will use the default urls.
If you want to load data from Azure, you MUST UPDATE THE SAS KEY.
RefCOCO series
Use refer2 for referring expression generation. The paper is SLR.
- https://github.com/lichengunc/refer2
- https://arxiv.org/abs/1612.09542
- Thanks to easy-to-understand-REG which points out the data evolving problem, and upload the evaluation sentences.
refcoco, location refcoco+, no location refcocog, with or without location "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively.
SA1B-Cap
The implementation of streaming loading in datasets
Load with azcopy
Firstly, Each tar or tsv file is downloaded to local host with azcopy
to a temporary dictory /tmp/$PRFIX-$HASH_OF_URL
.
After all file loading handles are release, the file will be removed. .
After all file loading handles are release, the file will be removed.
Legacy solution
The open
function of Python is extened with streaming loading from the Internet by xopen
in datasets.download.streaming_download_manager
.
After that, xopen
is futher patched into open
by datasets.streaming
.
There is an attribute called is_streaming
in dl_manager
object in data scripts which can indicate the whether the data are loaded with streaming mode or not.
OpenImages
Webdataset and pytorch-dalle
There are V6 (maybe) in webdataset format (i.e., tar
)
https://webdataset.github.io/webdataset/gettingstarted/ and https://github.com/lucidrains/DALLE-pytorch
cd ~
mkdir webdataset-openimages
cd webdataset-openimages
# for i in http://storage.googleapis.com/nvdata-openimages/openimages-train-{000000..000554}.tar; do
for i in {000000..000554}; do
echo $i
wget http://storage.googleapis.com/nvdata-openimages/openimages-train-$i.tar
done
cd ..
Train split: 523 GB
Fiftyone
Openimages v6 and v7
(Use Fiftyone to load the 'train' split of Openimages is extremely slow, as it loads the data into memory, which takes about 3 hours.)
https://docs.voxel51.com/integrations/open_images.html https://docs.voxel51.com/api/fiftyone.zoo.datasets.base.html#fiftyone.zoo.datasets.base.OpenImagesV7Dataset
Full split stats:
- Train split: 1,743,042 images (513 GB)
- Test split: 125,436 images (36 GB)
- Validation split: 41,620 images (12 GB)
Download OpenImagesV7 detections from fiftyone:
import fiftyone as fo
import fiftyone.zoo as foz
validation_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="validation",
label_types=["detections"],
)
test_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="test",
label_types=["detections"],
)
train_dataset = foz.load_zoo_dataset(
"open-images-v7",
split="train",
label_types=["detections"],
)
Detection data: COCO instance, Objects365, v3det
The default task_type is recognition
.
If you want to activate the task tokens for caption
, please use *task_type_caption*.yaml
Also see ./MODEL.md#multitaskv2.
Panoptic Segmentation Data: COCO Panoptic, ADE20k panoptic
From Mask2Former: https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md
- It provides code to convert data to panoptic format of detectron2.
- It requires
Detectron2
andgit+https://github.com/cocodataset/panopticapi.git@7bb4655
to preprocess the data to detectron2 format.
COCO panoptic
https://cocodataset.org/#download
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip
unzip train2017.zip
unzip val2017.zip
unzip panoptic_annotations_trainval2017.zip
unzip annotations/panoptic_train2017.zip
unzip annotations/panoptic_val2017.zip
DETECTRON2_DATASETS= python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py
ADE20k Panopitc
http://sceneparsing.csail.mit.edu/
wget http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
unzip ADEChallengeData2016.zip
cd ADEChallengeData2016
wget http://sceneparsing.csail.mit.edu/data/ChallengeData2017/annotations_instance.tar
tar -xvf annotations_instance.tar
DETECTRON2_DATASETS= python datasets/prepare_ade20k_sem_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_pan_seg.py
DETECTRON2_DATASETS= python datasets/prepare_ade20k_ins_seg.py
DETECTRON2_DATASETS=/home/t-yutonglin/xiaoke/segment-caption-anything-v2/tmp/data/mask2former_data python datasets/prepare_ade20k_ins_seg.py
The format should be in https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html Usage:
- Add the custom dataset class in
DatasetCatalog
; - Add mapper to convert the arbitary custom dataset to the standard format (load images from paths, augment images, and convert images to tensors);
MetadataCatalog
contains info that is shared for all samples, like class labels.
Check data registator Then check how the data is load with built-in function Check mapper
Compare The data loading (image) between [[detectron 2]] and [[hugging face - datasets library]]
From [[hugging face - datasets library]], they are similar:
- A like, the data script is the dataset that provides image paths and labels (load a json)
- Difference: The difference is that we merge different dataset here. We should merge latter
- Then we use a transform function to load and process images and labels
- We define a collator for dataloader
- Improvement: Here is the place to merge multiple dataset, by merging the dataloader. In [[OpenSEED]], it return
{"coco": coco_batch, "o365": o365_batch}