Vision-CAIR commited on
Commit
deedda5
β€’
1 Parent(s): c55fc6f

Delete dataset

Browse files
dataset/convert_cc_sbu.py DELETED
@@ -1,20 +0,0 @@
1
- import json
2
- import csv
3
-
4
- # specify input and output file paths
5
- input_file = 'ccs_synthetic_filtered_large.json'
6
- output_file = 'ccs_synthetic_filtered_large.tsv'
7
-
8
- # load JSON data from input file
9
- with open(input_file, 'r') as f:
10
- data = json.load(f)
11
-
12
- # extract header and data from JSON
13
- header = data[0].keys()
14
- rows = [x.values() for x in data]
15
-
16
- # write data to TSV file
17
- with open(output_file, 'w') as f:
18
- writer = csv.writer(f, delimiter='\t')
19
- writer.writerow(header)
20
- writer.writerows(rows)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dataset/convert_laion.py DELETED
@@ -1,20 +0,0 @@
1
- import json
2
- import csv
3
-
4
- # specify input and output file paths
5
- input_file = 'laion_synthetic_filtered_large.json'
6
- output_file = 'laion_synthetic_filtered_large.tsv'
7
-
8
- # load JSON data from input file
9
- with open(input_file, 'r') as f:
10
- data = json.load(f)
11
-
12
- # extract header and data from JSON
13
- header = data[0].keys()
14
- rows = [x.values() for x in data]
15
-
16
- # write data to TSV file
17
- with open(output_file, 'w') as f:
18
- writer = csv.writer(f, delimiter='\t')
19
- writer.writerow(header)
20
- writer.writerows(rows)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dataset/download_cc_sbu.sh DELETED
@@ -1,6 +0,0 @@
1
- #!/bin/bash
2
-
3
- img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\
4
- --url_col "url" --caption_col "caption" --output_format webdataset\
5
- --output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 256 \
6
- --enable_wandb True
 
 
 
 
 
 
 
dataset/download_laion.sh DELETED
@@ -1,6 +0,0 @@
1
- #!/bin/bash
2
-
3
- img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\
4
- --url_col "url" --caption_col "caption" --output_format webdataset\
5
- --output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 256 \
6
- --enable_wandb True
 
 
 
 
 
 
 
dataset/readme.md DELETED
@@ -1,92 +0,0 @@
1
- ## Download the filtered Conceptual Captions, SBU, LAION datasets
2
-
3
- ### Pre-training datasets download:
4
- We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).
5
-
6
- It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets
7
-
8
- Image source | Filtered synthetic caption by ViT-L
9
- --- | :---:
10
- CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>
11
- LAION115M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>
12
-
13
- This will download two json files
14
- ```
15
- ccs_synthetic_filtered_large.json
16
- laion_synthetic_filtered_large.json
17
- ```
18
-
19
- ## prepare the data step-by-step
20
-
21
-
22
- ### setup the dataset folder and move the annotation file to the data storage folder
23
- ```
24
- export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
25
- mkdir ${MINIGPT4_DATASET}/cc_sbu
26
- mkdir ${MINIGPT4_DATASET}/laion
27
- mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
28
- mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
29
- ```
30
-
31
- ### Convert the scripts to data storate folder
32
- ```
33
- cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
34
- cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
35
- cp convert_laion.py ${MINIGPT4_DATASET}/laion
36
- cp download_laion.sh ${MINIGPT4_DATASET}/laion
37
- ```
38
-
39
-
40
- ### Convert the laion and cc_sbu annotation file format to be img2dataset format
41
- ```
42
- cd ${MINIGPT4_DATASET}/cc_sbu
43
- python convert_cc_sbu.py
44
-
45
- cd ${MINIGPT4_DATASET}/laion
46
- python convert_laion.py
47
- ```
48
-
49
- ### Download the datasets with img2dataset
50
- ```
51
- cd ${MINIGPT4_DATASET}/cc_sbu
52
- sh download_cc_sbu.sh
53
- cd ${MINIGPT4_DATASET}/laion
54
- sh download_laion.sh
55
- ```
56
-
57
-
58
- The final dataset structure
59
-
60
- ```
61
- .
62
- β”œβ”€β”€ ${MINIGPT4_DATASET}
63
- β”‚ β”œβ”€β”€ cc_sbu
64
- β”‚ β”œβ”€β”€ convert_cc_sbu.py
65
- β”‚ β”œβ”€β”€ download_cc_sbu.sh
66
- β”‚ β”œβ”€β”€ ccs_synthetic_filtered_large.json
67
- β”‚ β”œβ”€β”€ ccs_synthetic_filtered_large.tsv
68
- β”‚ └── cc_sbu_dataset
69
- β”‚ β”œβ”€β”€ 00000.tar
70
- β”‚ β”œβ”€β”€ 00000.parquet
71
- β”‚ ...
72
- β”‚ β”œβ”€β”€ laion
73
- β”‚ β”œβ”€β”€ convert_laion.py
74
- β”‚ β”œβ”€β”€ download_laion.sh
75
- β”‚ β”œβ”€β”€ laion_synthetic_filtered_large.json
76
- β”‚ β”œβ”€β”€ laion_synthetic_filtered_large.tsv
77
- β”‚ └── laion_dataset
78
- β”‚ β”œβ”€β”€ 00000.tar
79
- β”‚ β”œβ”€β”€ 00000.parquet
80
- β”‚ ...
81
- ...
82
- ```
83
-
84
-
85
- ## Set up the dataset configuration files
86
-
87
- Then, set up the LAION dataset loading path in [here](../minigpt4/configs/datasets/laion/defaults.yaml#L13) at Line 13 as ${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar
88
-
89
- Then, set up the Conceptual Captoin and SBU datasets loading path in [here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L13) at Line 13 as ${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar
90
-
91
-
92
-