liangfeng commited on
Commit
b92a792
1 Parent(s): 0839f49
CODE_OF_CONDUCT.md DELETED
@@ -1,80 +0,0 @@
1
- # Code of Conduct
2
-
3
- ## Our Pledge
4
-
5
- In the interest of fostering an open and welcoming environment, we as
6
- contributors and maintainers pledge to make participation in our project and
7
- our community a harassment-free experience for everyone, regardless of age, body
8
- size, disability, ethnicity, sex characteristics, gender identity and expression,
9
- level of experience, education, socio-economic status, nationality, personal
10
- appearance, race, religion, or sexual identity and orientation.
11
-
12
- ## Our Standards
13
-
14
- Examples of behavior that contributes to creating a positive environment
15
- include:
16
-
17
- * Using welcoming and inclusive language
18
- * Being respectful of differing viewpoints and experiences
19
- * Gracefully accepting constructive criticism
20
- * Focusing on what is best for the community
21
- * Showing empathy towards other community members
22
-
23
- Examples of unacceptable behavior by participants include:
24
-
25
- * The use of sexualized language or imagery and unwelcome sexual attention or
26
- advances
27
- * Trolling, insulting/derogatory comments, and personal or political attacks
28
- * Public or private harassment
29
- * Publishing others' private information, such as a physical or electronic
30
- address, without explicit permission
31
- * Other conduct which could reasonably be considered inappropriate in a
32
- professional setting
33
-
34
- ## Our Responsibilities
35
-
36
- Project maintainers are responsible for clarifying the standards of acceptable
37
- behavior and are expected to take appropriate and fair corrective action in
38
- response to any instances of unacceptable behavior.
39
-
40
- Project maintainers have the right and responsibility to remove, edit, or
41
- reject comments, commits, code, wiki edits, issues, and other contributions
42
- that are not aligned to this Code of Conduct, or to ban temporarily or
43
- permanently any contributor for other behaviors that they deem inappropriate,
44
- threatening, offensive, or harmful.
45
-
46
- ## Scope
47
-
48
- This Code of Conduct applies within all project spaces, and it also applies when
49
- an individual is representing the project or its community in public spaces.
50
- Examples of representing a project or community include using an official
51
- project e-mail address, posting via an official social media account, or acting
52
- as an appointed representative at an online or offline event. Representation of
53
- a project may be further defined and clarified by project maintainers.
54
-
55
- This Code of Conduct also applies outside the project spaces when there is a
56
- reasonable belief that an individual's behavior may have a negative impact on
57
- the project or its community.
58
-
59
- ## Enforcement
60
-
61
- Instances of abusive, harassing, or otherwise unacceptable behavior may be
62
- reported by contacting the project team at <opensource-conduct@fb.com>. All
63
- complaints will be reviewed and investigated and will result in a response that
64
- is deemed necessary and appropriate to the circumstances. The project team is
65
- obligated to maintain confidentiality with regard to the reporter of an incident.
66
- Further details of specific enforcement policies may be posted separately.
67
-
68
- Project maintainers who do not follow or enforce the Code of Conduct in good
69
- faith may face temporary or permanent repercussions as determined by other
70
- members of the project's leadership.
71
-
72
- ## Attribution
73
-
74
- This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
75
- available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
76
-
77
- [homepage]: https://www.contributor-covenant.org
78
-
79
- For answers to common questions about this code of conduct, see
80
- https://www.contributor-covenant.org/faq
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CONTRIBUTING.md DELETED
@@ -1,32 +0,0 @@
1
- # Contributing to OVSeg
2
- We want to make contributing to this project as easy and transparent as
3
- possible.
4
-
5
- ## Pull Requests
6
- We actively welcome your pull requests.
7
-
8
- 1. Fork the repo and create your branch from `main`.
9
- 2. If you've added code that should be tested, add tests.
10
- 3. If you've changed APIs, update the documentation.
11
- 4. Ensure the test suite passes.
12
- 5. Make sure your code lints.
13
- 6. If you haven't already, complete the Contributor License Agreement ("CLA").
14
-
15
- ## Contributor License Agreement ("CLA")
16
- In order to accept your pull request, we need you to submit a CLA. You only need
17
- to do this once to work on any of Meta's open source projects.
18
-
19
- Complete your CLA here: <https://code.facebook.com/cla>
20
-
21
- ## Issues
22
- We use GitHub issues to track public bugs. Please ensure your description is
23
- clear and has sufficient instructions to be able to reproduce the issue.
24
-
25
- Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
26
- disclosure of security bugs. In those cases, please go through the process
27
- outlined on that page and do not file a public issue.
28
-
29
-
30
- ## License
31
- By contributing to OVSeg, you agree that your contributions will be licensed
32
- under the LICENSE file in the root directory of this source tree.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
GETTING_STARTED.md DELETED
@@ -1,99 +0,0 @@
1
- ## Getting started with OVSeg
2
-
3
-
4
- ### Try demo
5
-
6
- We release our largest model (Swin-Base + CLIP-ViT-L/14) [ovseg_swinbase_vitL14_ft_mpt.pth](https://drive.google.com/file/d/1cn-ohxgXDrDfkzC1QdO-fi8IjbjXmgKy/view?usp=sharing) (md5: <tt>526080</tt>).
7
-
8
- - Test on sample image
9
- ```bash
10
- python demo.py --config-file configs/ovseg_swinB_vitL_demo.yaml --class-names 'Oculus' 'Ukulele' --input ./resources/demo_samples/sample_03.jpeg --output ./pred --opts MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth
11
- ```
12
-
13
- ### Evaluation with pre-trained weights
14
-
15
- We release our largest model (Swin-Base + CLIP-ViT-L/14) [ovseg_swinbase_vitL14_ft_mpt.pth](https://drive.google.com/file/d/1cn-ohxgXDrDfkzC1QdO-fi8IjbjXmgKy/view?usp=sharing) (md5: <tt>526080</tt>).
16
-
17
- - Test on ADE20K-150 and ADE-847
18
- ```bash
19
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth DATASETS.TEST \(\"ade20k_sem_seg_val\",\"ade20k_full_sem_seg_val\"\)
20
- ```
21
-
22
- - Test on PascalContext-59 and PascalContext-459
23
- ```bash
24
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth MODEL.CLIP_ADAPTER.CLIP_ENSEMBLE_WEIGHT 0.6 DATASETS.TEST \(\"pascal_context_59_sem_seg_val\",\"pascal_context_459_sem_seg_val\",\)
25
- ```
26
-
27
- - Test on PascalVOC-20
28
- ```bash
29
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth MODEL.CLIP_ADAPTER.CLIP_ENSEMBLE_WEIGHT 0.45 DATASETS.TEST \(\"pascalvoc20_sem_seg_val\",\)
30
- ```
31
-
32
- #### Performance benchmark
33
-
34
- | method | backbone | training dataset | A-847 | PC-459 | A-150 | PC-59 | PAS-20 |
35
- |------------------------------------|----------|------------------|:-----:|:------:|:-----:|:-----:|:------:|
36
- | Open-vocabulary generalist models. | | | | | | | |
37
- | SPNet | R-101 | PASCAL-15 | - | - | - | 24.3 | 18.3 |
38
- | ZS3Net | R-101 | PASCAL-15 | - | - | - | 19.4 | 38.3 |
39
- | LSeg | R-101 | PASCAL-15 | - | - | - | - | 47.4 |
40
- | LSeg+ | R-101 | COCO Panoptic | 2.5 | 5.2 | 13.0 | 36.0 | 59.0 |
41
- | SimBaseline | R-101c | COCO-Stuff-156 | - | - | 15.3 | - | 74.5 |
42
- | ZegFormer | R-50 | COCO-Stuff-156 | - | - | 16.4 | - | 80.7 |
43
- | OpenSeg | R-101 | COCO Panoptic | 4.0 | 6.5 | 15.3 | 36.9 | 60.0 |
44
- | OVSeg (Ours) | R-101c | COCO-Stuff-171 | 7.1 | 11.0 | 24.8 | 53.3 | 92.6 |
45
- | LSeg+ | Eff-B7 | COCO Panoptic | 3.8 | 7.8 | 18.0 | 46.5 | - |
46
- | OpenSeg | Eff-B7 | COCO Panoptic | 6.3 | 9.0 | 21.1 | 42.1 | - |
47
- | OVSeg (Ours) | Swin-B | COCO-Stuff-171 | 9.0 | 12.4 | 29.6 | 55.7 | 94.5 |
48
- | Supervised specialist models. | | | | | | | |
49
- | FCN | FCN-8s | Same as test | - | - | 29.4 | 37.8 | - |
50
- | Deeplab | R-101 | Same as test | - | - | - | 45.7 | 77.7 |
51
- | SelfTrain | Eff-L2 | Same as test | - | - | - | - | 90.0 |
52
-
53
- #### Ablation study
54
-
55
- - Mask prompt tuning can bring significant improvement without changing CLIP weights (Table 3 in [paper](https://arxiv.org/pdf/2210.04150.pdf))
56
-
57
- Download the checkpoint with mpt only [ovseg_swinbase_vitL14_mpt_only.pt](https://drive.google.com/file/d/1LJGWFjHw76OGDNy9r9KQIaACfIm9KMhQ/view?usp=sharing) (md5: <tt>2dd495</tt>).
58
-
59
- ```bash
60
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_mpt_only.pt DATASETS.TEST \(\"ade20k_sem_seg_val\",\"ade20k_full_sem_seg_val\"\)
61
- ```
62
-
63
- - Mask prompt tuning can improve over fully finetuned model (Table 3 in [paper](https://arxiv.org/pdf/2210.04150.pdf))
64
-
65
- With the same [ovseg_swinbase_vitL14_ft_mpt.pth](https://drive.google.com/file/d/1cn-ohxgXDrDfkzC1QdO-fi8IjbjXmgKy/view?usp=sharing) checkpoint, set `MASK_PROMPT_FWD` as `False`
66
-
67
- ```bash
68
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.CLIP_ADAPTER.MASK_PROMPT_FWD False MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth DATASETS.TEST \(\"ade20k_sem_seg_val\",\"ade20k_full_sem_seg_val\"\)
69
- ```
70
-
71
- - The effects of class prediction ensemble (Table 6 in [paper](https://arxiv.org/pdf/2210.04150.pdf))
72
-
73
- With the same [ovseg_swinbase_vitL14_ft_mpt.pth](https://drive.google.com/file/d/1cn-ohxgXDrDfkzC1QdO-fi8IjbjXmgKy/view?usp=sharing) checkpoint, set `CLIP_ENSEMBLE` as `False`.
74
-
75
- ```bash
76
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.CLIP_ADAPTER.CLIP_ENSEMBLE False MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth DATASETS.TEST \(\"ade20k_sem_seg_val\",\"ade20k_full_sem_seg_val\"\)
77
- ```
78
-
79
- ### Training Segmentation model
80
-
81
- Our model is trained on COCO-Stuff
82
-
83
- - Training baseline w/ original CLIP
84
- ```
85
- python train_net.py --num-gpu 8 --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.CLIP_ADAPTER.MASK_PROMPT_FWD False
86
- ```
87
-
88
- To reproduce our final results, you may want to use the our mask-adapted CLIP
89
-
90
- - Training ovseg w/ mask-adapted CLIP
91
- ```
92
- python train_net.py --num-gpu 8 --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml MODEL.CLIP_ADAPTER.CLIP_MODEL_NAME #PATH_TO_MASKADAPTED_CLIP
93
- ```
94
-
95
- CAUTION: The final results is sensitive to the ensemble (appendix A.5 in [paper](https://arxiv.org/pdf/2210.04150.pdf)). Thus, you may want to use the ```tools/search_thr_ensemble_w.sh``` to find the best ensemble hyper-parameters.
96
-
97
- ### Fine-tuning CLIP with collected mask-category pairs
98
-
99
- We are still working on this part, stay tuned!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
INSTALL.md DELETED
@@ -1,33 +0,0 @@
1
- ## Installation
2
-
3
- ### Requirements
4
- - Linux with Python ≥ 3.6
5
- - PyTorch ≥ 1.8 and [torchvision](https://github.com/pytorch/vision/) that matches the PyTorch installation.
6
- Install them together at [pytorch.org](https://pytorch.org) to make sure of this. Note, please check
7
- PyTorch version matches that is required by Detectron2.
8
- - Detectron2: follow [Detectron2 installation instructions](https://detectron2.readthedocs.io/tutorials/install.html).
9
-
10
- ### Usage
11
-
12
- Install required packages.
13
-
14
- ```bash
15
- conda create --name ovseg python=3.8
16
- conda activate ovseg
17
- conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
18
- pip install -r requirements.txt
19
- ```
20
-
21
- You need to download `detectron2==0.6` following [instructions](https://detectron2.readthedocs.io/en/latest/tutorials/install.html)
22
-
23
- ```bash
24
- python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
25
- ```
26
-
27
-
28
- FurtherMore, install the modified clip package.
29
-
30
- ```bash
31
- cd third_party/CLIP
32
- python -m pip install -Ue .
33
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
LICENSE DELETED
@@ -1,399 +0,0 @@
1
- Attribution-NonCommercial 4.0 International
2
-
3
- =======================================================================
4
-
5
- Creative Commons Corporation ("Creative Commons") is not a law firm and
6
- does not provide legal services or legal advice. Distribution of
7
- Creative Commons public licenses does not create a lawyer-client or
8
- other relationship. Creative Commons makes its licenses and related
9
- information available on an "as-is" basis. Creative Commons gives no
10
- warranties regarding its licenses, any material licensed under their
11
- terms and conditions, or any related information. Creative Commons
12
- disclaims all liability for damages resulting from their use to the
13
- fullest extent possible.
14
-
15
- Using Creative Commons Public Licenses
16
-
17
- Creative Commons public licenses provide a standard set of terms and
18
- conditions that creators and other rights holders may use to share
19
- original works of authorship and other material subject to copyright
20
- and certain other rights specified in the public license below. The
21
- following considerations are for informational purposes only, are not
22
- exhaustive, and do not form part of our licenses.
23
-
24
- Considerations for licensors: Our public licenses are
25
- intended for use by those authorized to give the public
26
- permission to use material in ways otherwise restricted by
27
- copyright and certain other rights. Our licenses are
28
- irrevocable. Licensors should read and understand the terms
29
- and conditions of the license they choose before applying it.
30
- Licensors should also secure all rights necessary before
31
- applying our licenses so that the public can reuse the
32
- material as expected. Licensors should clearly mark any
33
- material not subject to the license. This includes other CC-
34
- licensed material, or material used under an exception or
35
- limitation to copyright. More considerations for licensors:
36
- wiki.creativecommons.org/Considerations_for_licensors
37
-
38
- Considerations for the public: By using one of our public
39
- licenses, a licensor grants the public permission to use the
40
- licensed material under specified terms and conditions. If
41
- the licensor's permission is not necessary for any reason--for
42
- example, because of any applicable exception or limitation to
43
- copyright--then that use is not regulated by the license. Our
44
- licenses grant only permissions under copyright and certain
45
- other rights that a licensor has authority to grant. Use of
46
- the licensed material may still be restricted for other
47
- reasons, including because others have copyright or other
48
- rights in the material. A licensor may make special requests,
49
- such as asking that all changes be marked or described.
50
- Although not required by our licenses, you are encouraged to
51
- respect those requests where reasonable. More_considerations
52
- for the public:
53
- wiki.creativecommons.org/Considerations_for_licensees
54
-
55
- =======================================================================
56
-
57
- Creative Commons Attribution-NonCommercial 4.0 International Public
58
- License
59
-
60
- By exercising the Licensed Rights (defined below), You accept and agree
61
- to be bound by the terms and conditions of this Creative Commons
62
- Attribution-NonCommercial 4.0 International Public License ("Public
63
- License"). To the extent this Public License may be interpreted as a
64
- contract, You are granted the Licensed Rights in consideration of Your
65
- acceptance of these terms and conditions, and the Licensor grants You
66
- such rights in consideration of benefits the Licensor receives from
67
- making the Licensed Material available under these terms and
68
- conditions.
69
-
70
- Section 1 -- Definitions.
71
-
72
- a. Adapted Material means material subject to Copyright and Similar
73
- Rights that is derived from or based upon the Licensed Material
74
- and in which the Licensed Material is translated, altered,
75
- arranged, transformed, or otherwise modified in a manner requiring
76
- permission under the Copyright and Similar Rights held by the
77
- Licensor. For purposes of this Public License, where the Licensed
78
- Material is a musical work, performance, or sound recording,
79
- Adapted Material is always produced where the Licensed Material is
80
- synched in timed relation with a moving image.
81
-
82
- b. Adapter's License means the license You apply to Your Copyright
83
- and Similar Rights in Your contributions to Adapted Material in
84
- accordance with the terms and conditions of this Public License.
85
-
86
- c. Copyright and Similar Rights means copyright and/or similar rights
87
- closely related to copyright including, without limitation,
88
- performance, broadcast, sound recording, and Sui Generis Database
89
- Rights, without regard to how the rights are labeled or
90
- categorized. For purposes of this Public License, the rights
91
- specified in Section 2(b)(1)-(2) are not Copyright and Similar
92
- Rights.
93
- d. Effective Technological Measures means those measures that, in the
94
- absence of proper authority, may not be circumvented under laws
95
- fulfilling obligations under Article 11 of the WIPO Copyright
96
- Treaty adopted on December 20, 1996, and/or similar international
97
- agreements.
98
-
99
- e. Exceptions and Limitations means fair use, fair dealing, and/or
100
- any other exception or limitation to Copyright and Similar Rights
101
- that applies to Your use of the Licensed Material.
102
-
103
- f. Licensed Material means the artistic or literary work, database,
104
- or other material to which the Licensor applied this Public
105
- License.
106
-
107
- g. Licensed Rights means the rights granted to You subject to the
108
- terms and conditions of this Public License, which are limited to
109
- all Copyright and Similar Rights that apply to Your use of the
110
- Licensed Material and that the Licensor has authority to license.
111
-
112
- h. Licensor means the individual(s) or entity(ies) granting rights
113
- under this Public License.
114
-
115
- i. NonCommercial means not primarily intended for or directed towards
116
- commercial advantage or monetary compensation. For purposes of
117
- this Public License, the exchange of the Licensed Material for
118
- other material subject to Copyright and Similar Rights by digital
119
- file-sharing or similar means is NonCommercial provided there is
120
- no payment of monetary compensation in connection with the
121
- exchange.
122
-
123
- j. Share means to provide material to the public by any means or
124
- process that requires permission under the Licensed Rights, such
125
- as reproduction, public display, public performance, distribution,
126
- dissemination, communication, or importation, and to make material
127
- available to the public including in ways that members of the
128
- public may access the material from a place and at a time
129
- individually chosen by them.
130
-
131
- k. Sui Generis Database Rights means rights other than copyright
132
- resulting from Directive 96/9/EC of the European Parliament and of
133
- the Council of 11 March 1996 on the legal protection of databases,
134
- as amended and/or succeeded, as well as other essentially
135
- equivalent rights anywhere in the world.
136
-
137
- l. You means the individual or entity exercising the Licensed Rights
138
- under this Public License. Your has a corresponding meaning.
139
-
140
- Section 2 -- Scope.
141
-
142
- a. License grant.
143
-
144
- 1. Subject to the terms and conditions of this Public License,
145
- the Licensor hereby grants You a worldwide, royalty-free,
146
- non-sublicensable, non-exclusive, irrevocable license to
147
- exercise the Licensed Rights in the Licensed Material to:
148
-
149
- a. reproduce and Share the Licensed Material, in whole or
150
- in part, for NonCommercial purposes only; and
151
-
152
- b. produce, reproduce, and Share Adapted Material for
153
- NonCommercial purposes only.
154
-
155
- 2. Exceptions and Limitations. For the avoidance of doubt, where
156
- Exceptions and Limitations apply to Your use, this Public
157
- License does not apply, and You do not need to comply with
158
- its terms and conditions.
159
-
160
- 3. Term. The term of this Public License is specified in Section
161
- 6(a).
162
-
163
- 4. Media and formats; technical modifications allowed. The
164
- Licensor authorizes You to exercise the Licensed Rights in
165
- all media and formats whether now known or hereafter created,
166
- and to make technical modifications necessary to do so. The
167
- Licensor waives and/or agrees not to assert any right or
168
- authority to forbid You from making technical modifications
169
- necessary to exercise the Licensed Rights, including
170
- technical modifications necessary to circumvent Effective
171
- Technological Measures. For purposes of this Public License,
172
- simply making modifications authorized by this Section 2(a)
173
- (4) never produces Adapted Material.
174
-
175
- 5. Downstream recipients.
176
-
177
- a. Offer from the Licensor -- Licensed Material. Every
178
- recipient of the Licensed Material automatically
179
- receives an offer from the Licensor to exercise the
180
- Licensed Rights under the terms and conditions of this
181
- Public License.
182
-
183
- b. No downstream restrictions. You may not offer or impose
184
- any additional or different terms or conditions on, or
185
- apply any Effective Technological Measures to, the
186
- Licensed Material if doing so restricts exercise of the
187
- Licensed Rights by any recipient of the Licensed
188
- Material.
189
-
190
- 6. No endorsement. Nothing in this Public License constitutes or
191
- may be construed as permission to assert or imply that You
192
- are, or that Your use of the Licensed Material is, connected
193
- with, or sponsored, endorsed, or granted official status by,
194
- the Licensor or others designated to receive attribution as
195
- provided in Section 3(a)(1)(A)(i).
196
-
197
- b. Other rights.
198
-
199
- 1. Moral rights, such as the right of integrity, are not
200
- licensed under this Public License, nor are publicity,
201
- privacy, and/or other similar personality rights; however, to
202
- the extent possible, the Licensor waives and/or agrees not to
203
- assert any such rights held by the Licensor to the limited
204
- extent necessary to allow You to exercise the Licensed
205
- Rights, but not otherwise.
206
-
207
- 2. Patent and trademark rights are not licensed under this
208
- Public License.
209
-
210
- 3. To the extent possible, the Licensor waives any right to
211
- collect royalties from You for the exercise of the Licensed
212
- Rights, whether directly or through a collecting society
213
- under any voluntary or waivable statutory or compulsory
214
- licensing scheme. In all other cases the Licensor expressly
215
- reserves any right to collect such royalties, including when
216
- the Licensed Material is used other than for NonCommercial
217
- purposes.
218
-
219
- Section 3 -- License Conditions.
220
-
221
- Your exercise of the Licensed Rights is expressly made subject to the
222
- following conditions.
223
-
224
- a. Attribution.
225
-
226
- 1. If You Share the Licensed Material (including in modified
227
- form), You must:
228
-
229
- a. retain the following if it is supplied by the Licensor
230
- with the Licensed Material:
231
-
232
- i. identification of the creator(s) of the Licensed
233
- Material and any others designated to receive
234
- attribution, in any reasonable manner requested by
235
- the Licensor (including by pseudonym if
236
- designated);
237
-
238
- ii. a copyright notice;
239
-
240
- iii. a notice that refers to this Public License;
241
-
242
- iv. a notice that refers to the disclaimer of
243
- warranties;
244
-
245
- v. a URI or hyperlink to the Licensed Material to the
246
- extent reasonably practicable;
247
-
248
- b. indicate if You modified the Licensed Material and
249
- retain an indication of any previous modifications; and
250
-
251
- c. indicate the Licensed Material is licensed under this
252
- Public License, and include the text of, or the URI or
253
- hyperlink to, this Public License.
254
-
255
- 2. You may satisfy the conditions in Section 3(a)(1) in any
256
- reasonable manner based on the medium, means, and context in
257
- which You Share the Licensed Material. For example, it may be
258
- reasonable to satisfy the conditions by providing a URI or
259
- hyperlink to a resource that includes the required
260
- information.
261
-
262
- 3. If requested by the Licensor, You must remove any of the
263
- information required by Section 3(a)(1)(A) to the extent
264
- reasonably practicable.
265
-
266
- 4. If You Share Adapted Material You produce, the Adapter's
267
- License You apply must not prevent recipients of the Adapted
268
- Material from complying with this Public License.
269
-
270
- Section 4 -- Sui Generis Database Rights.
271
-
272
- Where the Licensed Rights include Sui Generis Database Rights that
273
- apply to Your use of the Licensed Material:
274
-
275
- a. for the avoidance of doubt, Section 2(a)(1) grants You the right
276
- to extract, reuse, reproduce, and Share all or a substantial
277
- portion of the contents of the database for NonCommercial purposes
278
- only;
279
-
280
- b. if You include all or a substantial portion of the database
281
- contents in a database in which You have Sui Generis Database
282
- Rights, then the database in which You have Sui Generis Database
283
- Rights (but not its individual contents) is Adapted Material; and
284
-
285
- c. You must comply with the conditions in Section 3(a) if You Share
286
- all or a substantial portion of the contents of the database.
287
-
288
- For the avoidance of doubt, this Section 4 supplements and does not
289
- replace Your obligations under this Public License where the Licensed
290
- Rights include other Copyright and Similar Rights.
291
-
292
- Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293
-
294
- a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295
- EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296
- AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297
- ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298
- IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299
- WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300
- PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301
- ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302
- KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303
- ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304
-
305
- b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306
- TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307
- NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308
- INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309
- COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310
- USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311
- ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312
- DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313
- IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314
-
315
- c. The disclaimer of warranties and limitation of liability provided
316
- above shall be interpreted in a manner that, to the extent
317
- possible, most closely approximates an absolute disclaimer and
318
- waiver of all liability.
319
-
320
- Section 6 -- Term and Termination.
321
-
322
- a. This Public License applies for the term of the Copyright and
323
- Similar Rights licensed here. However, if You fail to comply with
324
- this Public License, then Your rights under this Public License
325
- terminate automatically.
326
-
327
- b. Where Your right to use the Licensed Material has terminated under
328
- Section 6(a), it reinstates:
329
-
330
- 1. automatically as of the date the violation is cured, provided
331
- it is cured within 30 days of Your discovery of the
332
- violation; or
333
-
334
- 2. upon express reinstatement by the Licensor.
335
-
336
- For the avoidance of doubt, this Section 6(b) does not affect any
337
- right the Licensor may have to seek remedies for Your violations
338
- of this Public License.
339
-
340
- c. For the avoidance of doubt, the Licensor may also offer the
341
- Licensed Material under separate terms or conditions or stop
342
- distributing the Licensed Material at any time; however, doing so
343
- will not terminate this Public License.
344
-
345
- d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
346
- License.
347
-
348
- Section 7 -- Other Terms and Conditions.
349
-
350
- a. The Licensor shall not be bound by any additional or different
351
- terms or conditions communicated by You unless expressly agreed.
352
-
353
- b. Any arrangements, understandings, or agreements regarding the
354
- Licensed Material not stated herein are separate from and
355
- independent of the terms and conditions of this Public License.
356
-
357
- Section 8 -- Interpretation.
358
-
359
- a. For the avoidance of doubt, this Public License does not, and
360
- shall not be interpreted to, reduce, limit, restrict, or impose
361
- conditions on any use of the Licensed Material that could lawfully
362
- be made without permission under this Public License.
363
-
364
- b. To the extent possible, if any provision of this Public License is
365
- deemed unenforceable, it shall be automatically reformed to the
366
- minimum extent necessary to make it enforceable. If the provision
367
- cannot be reformed, it shall be severed from this Public License
368
- without affecting the enforceability of the remaining terms and
369
- conditions.
370
-
371
- c. No term or condition of this Public License will be waived and no
372
- failure to comply consented to unless expressly agreed to by the
373
- Licensor.
374
-
375
- d. Nothing in this Public License constitutes or may be interpreted
376
- as a limitation upon, or waiver of, any privileges and immunities
377
- that apply to the Licensor or You, including from the legal
378
- processes of any jurisdiction or authority.
379
-
380
- =======================================================================
381
-
382
- Creative Commons is not a party to its public
383
- licenses. Notwithstanding, Creative Commons may elect to apply one of
384
- its public licenses to material it publishes and in those instances
385
- will be considered the “Licensor.” The text of the Creative Commons
386
- public licenses is dedicated to the public domain under the CC0 Public
387
- Domain Dedication. Except for the limited purpose of indicating that
388
- material is shared under a Creative Commons public license or as
389
- otherwise permitted by the Creative Commons policies published at
390
- creativecommons.org/policies, Creative Commons does not authorize the
391
- use of the trademark "Creative Commons" or any other trademark or logo
392
- of Creative Commons without its prior written consent including,
393
- without limitation, in connection with any unauthorized modifications
394
- to any of its public licenses or any other arrangements,
395
- understandings, or agreements concerning use of licensed material. For
396
- the avoidance of doubt, this paragraph does not form part of the
397
- public licenses.
398
-
399
- Creative Commons may be contacted at creativecommons.org.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -10,60 +10,4 @@ pinned: false
10
  license: cc-by-nc-4.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
-
15
- # [OVSeg] Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
16
-
17
- <img src="resources/pytorch-logo-dark.png" width="10%">
18
-
19
- This is the official PyTorch implementation of our paper: <br>
20
- **Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP**<br>
21
- [Feng Liang](https://jeff-liangf.github.io/), [Bichen Wu](https://www.linkedin.com/in/bichenwu), [Xiaoliang Dai](https://sites.google.com/view/xiaoliangdai/), [Kunpeng Li](https://kunpengli1994.github.io/), [Yinan Zhao](https://yinan-zhao.github.io/), [Hang Zhang](https://hangzhang.org/), [Peizhao Zhang](https://www.linkedin.com/in/peizhao-zhang-14846042/), [Peter Vajda](https://sites.google.com/site/vajdap), [Diana Marculescu](https://www.ece.utexas.edu/people/faculty/diana-marculescu)
22
-
23
- [[arXiv](https://arxiv.org/abs/2210.04150)] [[Project](https://jeff-liangf.github.io/projects/ovseg/)]
24
-
25
- <p align="center">
26
- <img src="resources/ovseg.gif" width="100%">
27
- </p>
28
-
29
-
30
- ## Installation
31
-
32
- Please see [installation guide](./INSTALL.md).
33
-
34
- ## Data Preparation
35
-
36
- Please see [datasets preparation](./datasets/DATASETS.md).
37
-
38
- ## Getting started
39
-
40
- Please see [getting started instruction](./GETTING_STARTED.md).
41
-
42
- ## LICENSE
43
-
44
- Shield: [![CC BY-NC 4.0][cc-by-nc-shield]][cc-by-nc]
45
-
46
- The majority of OVSeg is licensed under a
47
- [Creative Commons Attribution-NonCommercial 4.0 International License](LICENSE).
48
-
49
- [![CC BY-NC 4.0][cc-by-nc-image]][cc-by-nc]
50
-
51
- [cc-by-nc]: http://creativecommons.org/licenses/by-nc/4.0/
52
- [cc-by-nc-image]: https://licensebuttons.net/l/by-nc/4.0/88x31.png
53
- [cc-by-nc-shield]: https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg
54
-
55
- However portions of the project are under separate license terms: CLIP and ZSSEG are licensed under the [MIT license](https://github.com/openai/CLIP/blob/main/LICENSE); MaskFormer is licensed under the [CC-BY-NC](https://github.com/facebookresearch/MaskFormer/blob/main/LICENSE); openclip is licensed under the license at [its repo](https://github.com/mlfoundations/open_clip/blob/main/LICENSE).
56
-
57
-
58
- ## Citing OVSeg :pray:
59
-
60
- If you use OVSeg in your research or wish to refer to the baseline results published in the paper, please use the following BibTeX entry.
61
-
62
- ```BibTeX
63
- @article{liang2022open,
64
- title={Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP},
65
- author={Liang, Feng and Wu, Bichen and Dai, Xiaoliang and Li, Kunpeng and Zhao, Yinan and Zhang, Hang and Zhang, Peizhao and Vajda, Peter and Marculescu, Diana},
66
- journal={arXiv preprint arXiv:2210.04150},
67
- year={2022}
68
- }
69
- ```
10
  license: cc-by-nc-4.0
11
  ---
12
 
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -6,6 +6,13 @@ import multiprocessing as mp
6
  import numpy as np
7
  from PIL import Image
8
 
 
 
 
 
 
 
 
9
  from detectron2.config import get_cfg
10
 
11
  from detectron2.projects.deeplab import add_deeplab_config
@@ -15,6 +22,12 @@ from open_vocab_seg.utils import VisualizationDemo
15
 
16
  import gradio as gr
17
 
 
 
 
 
 
 
18
  def setup_cfg(config_file):
19
  # load config from file and command-line arguments
20
  cfg = get_cfg()
@@ -27,7 +40,7 @@ def setup_cfg(config_file):
27
 
28
  def inference(class_names, input_img):
29
  mp.set_start_method("spawn", force=True)
30
- config_file = './configs/ovseg_swinB_vitL_demo.yaml'
31
  cfg = setup_cfg(config_file)
32
 
33
  demo = VisualizationDemo(cfg)
@@ -38,19 +51,18 @@ def inference(class_names, input_img):
38
 
39
  return Image.fromarray(np.uint8(visualized_output.get_image())).convert('RGB')
40
 
41
- # demo = gr.Interface(fn=greet, inputs="text", outputs="text")
42
- # demo.launch()
43
-
44
 
45
- examples = [['Oculus, Ukulele', './resources/demo_samples/sample_03.jpeg'],]
 
 
46
  output_labels = ['segmentation map']
47
 
48
  title = 'OVSeg'
49
 
50
  description = """
51
- Gradio Demo for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP \n
52
- You may click on of the examples or upload your own image. \n
53
- OVSeg could perform open vocabulary segmentation, you may input more classes (seperate by comma).
54
  """
55
 
56
  article = """
@@ -59,7 +71,7 @@ article = """
59
  Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
60
  </a>
61
  |
62
- <a href='https://github.com' target='_blank'>Github Repo</a></p>
63
  """
64
 
65
  gr.Interface(
6
  import numpy as np
7
  from PIL import Image
8
 
9
+
10
+ try:
11
+ import detectron2
12
+ except:
13
+ import os
14
+ os.system('pip install git+https://github.com/facebookresearch/detectron2.git')
15
+
16
  from detectron2.config import get_cfg
17
 
18
  from detectron2.projects.deeplab import add_deeplab_config
22
 
23
  import gradio as gr
24
 
25
+ import gdown
26
+
27
+ ckpt_url = 'https://drive.google.com/uc?id=1cn-ohxgXDrDfkzC1QdO-fi8IjbjXmgKy'
28
+ output = './ovseg_swinbase_vitL14_ft_mpt.pth'
29
+ gdown.download(ckpt_url, output, quiet=False)
30
+
31
  def setup_cfg(config_file):
32
  # load config from file and command-line arguments
33
  cfg = get_cfg()
40
 
41
  def inference(class_names, input_img):
42
  mp.set_start_method("spawn", force=True)
43
+ config_file = './ovseg_swinB_vitL_demo.yaml'
44
  cfg = setup_cfg(config_file)
45
 
46
  demo = VisualizationDemo(cfg)
51
 
52
  return Image.fromarray(np.uint8(visualized_output.get_image())).convert('RGB')
53
 
 
 
 
54
 
55
+ examples = [['Oculus, Ukulele', './resources/demo_samples/sample_03.jpeg'],
56
+ ['Saturn V, toys, blossom', './resources/demo_samples/sample_01.jpeg'],
57
+ ['Golden gate, yacht', './resources/demo_samples/sample_02.jpeg'],]
58
  output_labels = ['segmentation map']
59
 
60
  title = 'OVSeg'
61
 
62
  description = """
63
+ Gradio Demo for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. \n
64
+ OVSeg could perform open vocabulary segmentation, you may input more classes (seperate by comma). You may click on of the examples or upload your own image. \n
65
+ It might take some time to process. Cheers!
66
  """
67
 
68
  article = """
71
  Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
72
  </a>
73
  |
74
+ <a href='https://github.com/facebookresearch/ov-seg' target='_blank'>Github Repo</a></p>
75
  """
76
 
77
  gr.Interface(
configs/ovseg_swinB_vitL_bs32_120k.yaml DELETED
@@ -1,100 +0,0 @@
1
- MODEL:
2
- META_ARCHITECTURE: "OVSeg"
3
- BACKBONE:
4
- FREEZE_AT: 0
5
- NAME: "D2SwinTransformer"
6
- SWIN:
7
- EMBED_DIM: 128
8
- DEPTHS: [2, 2, 18, 2]
9
- NUM_HEADS: [4, 8, 16, 32]
10
- WINDOW_SIZE: 12
11
- APE: False
12
- DROP_PATH_RATE: 0.3
13
- PATCH_NORM: True
14
- PRETRAIN_IMG_SIZE: 384
15
- WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
16
- PIXEL_MEAN: [123.675, 116.280, 103.530]
17
- PIXEL_STD: [58.395, 57.120, 57.375]
18
- SEM_SEG_HEAD:
19
- NAME: "OpenVocabMaskFormerHead"
20
- IN_FEATURES: ["res2", "res3", "res4", "res5"]
21
- IGNORE_VALUE: 255
22
- NUM_CLASSES: 171 # number of categories in training set
23
- EMBEDDING_DIM: 768
24
- EMBED_LAYERS: 2
25
- COMMON_STRIDE: 4 # not used, hard-coded
26
- LOSS_WEIGHT: 1.0
27
- CONVS_DIM: 256
28
- MASK_DIM: 256
29
- NORM: "GN"
30
- MASK_FORMER:
31
- TRANSFORMER_IN_FEATURE: "res5"
32
- DEEP_SUPERVISION: True
33
- NO_OBJECT_WEIGHT: 0.1
34
- DICE_WEIGHT: 1.0
35
- MASK_WEIGHT: 20.0
36
- HIDDEN_DIM: 256
37
- NUM_OBJECT_QUERIES: 100
38
- NHEADS: 8
39
- DROPOUT: 0.1
40
- DIM_FEEDFORWARD: 2048
41
- ENC_LAYERS: 0
42
- DEC_LAYERS: 6
43
- PRE_NORM: False
44
- CLIP_ADAPTER:
45
- TEXT_TEMPLATES: "vild"
46
- CLIP_MODEL_NAME: "ViT-L/14"
47
- MASK_FILL: "mean"
48
- MASK_EXPAND_RATIO: 1.0
49
- MASK_THR: 0.4 # choose the foreground objects
50
- MASK_MATTING: False # use soft background, default not used
51
- MASK_PROMPT_DEPTH: 3
52
- MASK_PROMPT_FWD: True # use mask prompt during forward
53
- REGION_RESIZED: True # resize to the input of clip, e.g., 224
54
- CLIP_ENSEMBLE: True # use ensemble of two classification branches
55
- CLIP_ENSEMBLE_WEIGHT: 0.7
56
- DATASETS:
57
- TRAIN: ("coco_2017_train_stuff_sem_seg",)
58
- TEST: ("ade20k_sem_seg_val",)
59
- SOLVER:
60
- IMS_PER_BATCH: 32
61
- BASE_LR: 0.00006
62
- MAX_ITER: 120000
63
- WARMUP_FACTOR: 1e-6
64
- WARMUP_ITERS: 1500
65
- LR_SCHEDULER_NAME: "WarmupPolyLR"
66
- WEIGHT_DECAY: 0.01
67
- WEIGHT_DECAY_NORM: 0.0
68
- WEIGHT_DECAY_EMBED: 0.0
69
- BACKBONE_MULTIPLIER: 1.0
70
- TEST_IMS_PER_BATCH: 1
71
- CLIP_GRADIENTS:
72
- ENABLED: True
73
- CLIP_TYPE: "full_model"
74
- CLIP_VALUE: 0.01
75
- NORM_TYPE: 2.0
76
- INPUT:
77
- MIN_SIZE_TRAIN: !!python/object/apply:eval ["[int(x * 0.1 * 640) for x in range(5, 21)]"]
78
- MIN_SIZE_TRAIN_SAMPLING: "choice"
79
- MIN_SIZE_TEST: 640
80
- MAX_SIZE_TRAIN: 2560
81
- MAX_SIZE_TEST: 2560
82
- CROP:
83
- ENABLED: True
84
- TYPE: "absolute"
85
- SIZE: (640, 640)
86
- SINGLE_CATEGORY_MAX_AREA: 1.0
87
- COLOR_AUG_SSD: True
88
- SIZE_DIVISIBILITY: 640 # used in dataset mapper
89
- FORMAT: "RGB"
90
- TEST:
91
- EVAL_PERIOD: 5000
92
- AUG:
93
- ENABLED: False
94
- MIN_SIZES: [256, 384, 512, 640, 768, 896]
95
- MAX_SIZE: 3584
96
- FLIP: True
97
- DATALOADER:
98
- FILTER_EMPTY_ANNOTATIONS: True
99
- NUM_WORKERS: 4
100
- VERSION: 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/DATASETS.md DELETED
@@ -1,122 +0,0 @@
1
- ## Prepare Datasets for OVSeg
2
-
3
- This doc is a modification/extension of [MaskFormer](https://github.com/facebookresearch/MaskFormer/blob/main/datasets/README.md) following [Detectron2 fromat](https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html).
4
-
5
- A dataset can be used by accessing [DatasetCatalog](https://detectron2.readthedocs.io/modules/data.html#detectron2.data.DatasetCatalog)
6
- for its data, or [MetadataCatalog](https://detectron2.readthedocs.io/modules/data.html#detectron2.data.MetadataCatalog) for its metadata (class names, etc).
7
- This document explains how to setup the builtin datasets so they can be used by the above APIs.
8
- [Use Custom Datasets](https://detectron2.readthedocs.io/tutorials/datasets.html) gives a deeper dive on how to use `DatasetCatalog` and `MetadataCatalog`,
9
- and how to add new datasets to them.
10
-
11
- OVSeg has builtin support for a few datasets.
12
- The datasets are assumed to exist in a directory specified by the environment variable
13
- `DETECTRON2_DATASETS`.
14
- Under this directory, detectron2 will look for datasets in the structure described below, if needed.
15
- ```
16
- $DETECTRON2_DATASETS/
17
- coco/ # COCOStuff-171
18
- ADEChallengeData2016/ # ADE20K-150
19
- ADE20K_2021_17_01/ # ADE20K-847
20
- VOCdevkit/
21
- VOC2012/ # PASCALVOC-20
22
- VOC2010/ # PASCALContext-59, PASCALContext-459
23
- ```
24
-
25
- You can set the location for builtin datasets by `export DETECTRON2_DATASETS=/path/to/datasets`.
26
- If left unset, the default is `./datasets` relative to your current working directory.
27
-
28
- Without specific notifications, our model is trained on COCOStuff-171 and evlauted on ADE20K-150, ADE20K-847, PASCALVOC-20, PASCALContext-59 and PASCALContext-459.
29
-
30
- | dataset | split | # images | # categories |
31
- |:--------------:|:---------:|:--------:|:------------:|
32
- | COCO Stuff | train2017 | 118K | 171 |
33
- | ADE20K | val | 2K | 150/847 |
34
- | Pascal VOC | val | 1.5K | 20 |
35
- | Pascal Context | val | 5K | 59/459 |
36
-
37
-
38
- ### Expected dataset structure for [COCO Stuff](https://github.com/nightrome/cocostuff):
39
- ```
40
- coco/
41
- train2017/ # http://images.cocodataset.org/zips/train2017.zip
42
- annotations/ # http://images.cocodataset.org/annotations/annotations_trainval2017.zip
43
- stuffthingmaps/
44
- stuffthingmaps_trainval2017.zip # http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip
45
- train2017/
46
- # below are generated
47
- stuffthingmaps_detectron2/
48
- train2017/
49
- ```
50
-
51
- The directory `stuffthingmaps_detectron2` is generated by running `python datasets/prepare_coco_stuff_sem_seg.py`.
52
-
53
-
54
-
55
- ### Expected dataset structure for [ADE20k Scene Parsing (ADE20K-150)](http://sceneparsing.csail.mit.edu/):
56
- ```
57
- ADEChallengeData2016/
58
- annotations/
59
- images/
60
- objectInfo150.txt
61
- # below are generated
62
- annotations_detectron2/
63
- ```
64
- The directory `annotations_detectron2` is generated by running `python datasets/prepare_ade20k_sem_seg.py`.
65
-
66
-
67
- ### Expected dataset structure for [ADE20k-Full (ADE20K-847)](https://github.com/CSAILVision/ADE20K#download):
68
- ```
69
- ADE20K_2021_17_01/
70
- images/
71
- index_ade20k.pkl
72
- objects.txt
73
- # below are generated
74
- images_detectron2/
75
- annotations_detectron2/
76
- ```
77
- The directories `images_detectron2` and `annotations_detectron2` are generated by running `python datasets/prepare_ade20k_full_sem_seg.py`.
78
-
79
- ### Expected dataset structure for [Pascal VOC 2012 (PASCALVOC-20)](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/#devkit):
80
- ```
81
- VOCdevkit/VOC2012/
82
- Annotations/
83
- ImageSets/
84
- JPEGImages/
85
- SegmentationClass/
86
- SegmentationObject/
87
- SegmentationClassAug/ # https://github.com/kazuto1011/deeplab-pytorch/blob/master/data/datasets/voc12/README.md
88
- # below are generated
89
- images_detectron2/
90
- annotations_detectron2/
91
- ```
92
-
93
- It starts with a tar file `VOCtrainval_11-May-2012.tar`.
94
-
95
- We use SBD augmentated training data as `SegmentationClassAug` following [Deeplab](https://github.com/kazuto1011/deeplab-pytorch/blob/master/data/datasets/voc12/README.md)
96
-
97
- The directories `images_detectron2` and `annotations_detectron2` are generated by running `python datasets/prepare_voc_sem_seg.py`.
98
-
99
-
100
- ### Expected dataset structure for [Pascal Context](https://www.cs.stanford.edu/~roozbeh/pascal-context/):
101
-
102
- ```
103
- VOCdevkit/VOC2010/
104
- Annotations/
105
- ImageSets/
106
- JPEGImages/
107
- SegmentationClass/
108
- SegmentationObject/
109
- # below are from https://www.cs.stanford.edu/~roozbeh/pascal-context/trainval.tar.gz
110
- trainval/
111
- labels.txt
112
- 59_labels.txt # https://www.cs.stanford.edu/~roozbeh/pascal-context/59_labels.txt
113
- pascalcontext_val.txt # https://drive.google.com/file/d/1BCbiOKtLvozjVnlTJX51koIveUZHCcUh/view?usp=sharing
114
- # below are generated
115
- annotations_detectron2/
116
- pc459_val
117
- pc59_val
118
- ```
119
- It starts with a tar file `VOCtrainval_03-May-2010.tar`. You may want to download the 5K validation set [here](https://drive.google.com/file/d/1BCbiOKtLvozjVnlTJX51koIveUZHCcUh/view?usp=sharing).
120
-
121
- The directory `annotations_detectron2` is generated by running `python datasets/prepare_pascal_context.py`.
122
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/prepare_ade20k_full_sem_seg.py DELETED
@@ -1,1011 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import os
5
- import pickle as pkl
6
- from pathlib import Path
7
-
8
- import cv2
9
- import numpy as np
10
- import tqdm
11
- from PIL import Image
12
-
13
- ADE20K_SEM_SEG_FULL_CATEGORIES = [
14
- {"name": "wall", "id": 2978, "trainId": 0},
15
- {"name": "building, edifice", "id": 312, "trainId": 1},
16
- {"name": "sky", "id": 2420, "trainId": 2},
17
- {"name": "tree", "id": 2855, "trainId": 3},
18
- {"name": "road, route", "id": 2131, "trainId": 4},
19
- {"name": "floor, flooring", "id": 976, "trainId": 5},
20
- {"name": "ceiling", "id": 447, "trainId": 6},
21
- {"name": "bed", "id": 165, "trainId": 7},
22
- {"name": "sidewalk, pavement", "id": 2377, "trainId": 8},
23
- {"name": "earth, ground", "id": 838, "trainId": 9},
24
- {"name": "cabinet", "id": 350, "trainId": 10},
25
- {"name": "person, individual, someone, somebody, mortal, soul", "id": 1831, "trainId": 11},
26
- {"name": "grass", "id": 1125, "trainId": 12},
27
- {"name": "windowpane, window", "id": 3055, "trainId": 13},
28
- {"name": "car, auto, automobile, machine, motorcar", "id": 401, "trainId": 14},
29
- {"name": "mountain, mount", "id": 1610, "trainId": 15},
30
- {"name": "plant, flora, plant life", "id": 1910, "trainId": 16},
31
- {"name": "table", "id": 2684, "trainId": 17},
32
- {"name": "chair", "id": 471, "trainId": 18},
33
- {"name": "curtain, drape, drapery, mantle, pall", "id": 687, "trainId": 19},
34
- {"name": "door", "id": 774, "trainId": 20},
35
- {"name": "sofa, couch, lounge", "id": 2473, "trainId": 21},
36
- {"name": "sea", "id": 2264, "trainId": 22},
37
- {"name": "painting, picture", "id": 1735, "trainId": 23},
38
- {"name": "water", "id": 2994, "trainId": 24},
39
- {"name": "mirror", "id": 1564, "trainId": 25},
40
- {"name": "house", "id": 1276, "trainId": 26},
41
- {"name": "rug, carpet, carpeting", "id": 2178, "trainId": 27},
42
- {"name": "shelf", "id": 2329, "trainId": 28},
43
- {"name": "armchair", "id": 57, "trainId": 29},
44
- {"name": "fence, fencing", "id": 907, "trainId": 30},
45
- {"name": "field", "id": 913, "trainId": 31},
46
- {"name": "lamp", "id": 1395, "trainId": 32},
47
- {"name": "rock, stone", "id": 2138, "trainId": 33},
48
- {"name": "seat", "id": 2272, "trainId": 34},
49
- {"name": "river", "id": 2128, "trainId": 35},
50
- {"name": "desk", "id": 724, "trainId": 36},
51
- {"name": "bathtub, bathing tub, bath, tub", "id": 155, "trainId": 37},
52
- {"name": "railing, rail", "id": 2053, "trainId": 38},
53
- {"name": "signboard, sign", "id": 2380, "trainId": 39},
54
- {"name": "cushion", "id": 689, "trainId": 40},
55
- {"name": "path", "id": 1788, "trainId": 41},
56
- {"name": "work surface", "id": 3087, "trainId": 42},
57
- {"name": "stairs, steps", "id": 2530, "trainId": 43},
58
- {"name": "column, pillar", "id": 581, "trainId": 44},
59
- {"name": "sink", "id": 2388, "trainId": 45},
60
- {"name": "wardrobe, closet, press", "id": 2985, "trainId": 46},
61
- {"name": "snow", "id": 2454, "trainId": 47},
62
- {"name": "refrigerator, icebox", "id": 2096, "trainId": 48},
63
- {"name": "base, pedestal, stand", "id": 137, "trainId": 49},
64
- {"name": "bridge, span", "id": 294, "trainId": 50},
65
- {"name": "blind, screen", "id": 212, "trainId": 51},
66
- {"name": "runway", "id": 2185, "trainId": 52},
67
- {"name": "cliff, drop, drop-off", "id": 524, "trainId": 53},
68
- {"name": "sand", "id": 2212, "trainId": 54},
69
- {"name": "fireplace, hearth, open fireplace", "id": 943, "trainId": 55},
70
- {"name": "pillow", "id": 1869, "trainId": 56},
71
- {"name": "screen door, screen", "id": 2251, "trainId": 57},
72
- {"name": "toilet, can, commode, crapper, pot, potty, stool, throne", "id": 2793, "trainId": 58},
73
- {"name": "skyscraper", "id": 2423, "trainId": 59},
74
- {"name": "grandstand, covered stand", "id": 1121, "trainId": 60},
75
- {"name": "box", "id": 266, "trainId": 61},
76
- {"name": "pool table, billiard table, snooker table", "id": 1948, "trainId": 62},
77
- {"name": "palm, palm tree", "id": 1744, "trainId": 63},
78
- {"name": "double door", "id": 783, "trainId": 64},
79
- {"name": "coffee table, cocktail table", "id": 571, "trainId": 65},
80
- {"name": "counter", "id": 627, "trainId": 66},
81
- {"name": "countertop", "id": 629, "trainId": 67},
82
- {"name": "chest of drawers, chest, bureau, dresser", "id": 491, "trainId": 68},
83
- {"name": "kitchen island", "id": 1374, "trainId": 69},
84
- {"name": "boat", "id": 223, "trainId": 70},
85
- {"name": "waterfall, falls", "id": 3016, "trainId": 71},
86
- {
87
- "name": "stove, kitchen stove, range, kitchen range, cooking stove",
88
- "id": 2598,
89
- "trainId": 72,
90
- },
91
- {"name": "flower", "id": 978, "trainId": 73},
92
- {"name": "bookcase", "id": 239, "trainId": 74},
93
- {"name": "controls", "id": 608, "trainId": 75},
94
- {"name": "book", "id": 236, "trainId": 76},
95
- {"name": "stairway, staircase", "id": 2531, "trainId": 77},
96
- {"name": "streetlight, street lamp", "id": 2616, "trainId": 78},
97
- {
98
- "name": "computer, computing machine, computing device, data processor, electronic computer, information processing system",
99
- "id": 591,
100
- "trainId": 79,
101
- },
102
- {
103
- "name": "bus, autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach, omnibus, passenger vehicle",
104
- "id": 327,
105
- "trainId": 80,
106
- },
107
- {"name": "swivel chair", "id": 2679, "trainId": 81},
108
- {"name": "light, light source", "id": 1451, "trainId": 82},
109
- {"name": "bench", "id": 181, "trainId": 83},
110
- {"name": "case, display case, showcase, vitrine", "id": 420, "trainId": 84},
111
- {"name": "towel", "id": 2821, "trainId": 85},
112
- {"name": "fountain", "id": 1023, "trainId": 86},
113
- {"name": "embankment", "id": 855, "trainId": 87},
114
- {
115
- "name": "television receiver, television, television set, tv, tv set, idiot box, boob tube, telly, goggle box",
116
- "id": 2733,
117
- "trainId": 88,
118
- },
119
- {"name": "van", "id": 2928, "trainId": 89},
120
- {"name": "hill", "id": 1240, "trainId": 90},
121
- {"name": "awning, sunshade, sunblind", "id": 77, "trainId": 91},
122
- {"name": "poster, posting, placard, notice, bill, card", "id": 1969, "trainId": 92},
123
- {"name": "truck, motortruck", "id": 2880, "trainId": 93},
124
- {"name": "airplane, aeroplane, plane", "id": 14, "trainId": 94},
125
- {"name": "pole", "id": 1936, "trainId": 95},
126
- {"name": "tower", "id": 2828, "trainId": 96},
127
- {"name": "court", "id": 631, "trainId": 97},
128
- {"name": "ball", "id": 103, "trainId": 98},
129
- {
130
- "name": "aircraft carrier, carrier, flattop, attack aircraft carrier",
131
- "id": 3144,
132
- "trainId": 99,
133
- },
134
- {"name": "buffet, counter, sideboard", "id": 308, "trainId": 100},
135
- {"name": "hovel, hut, hutch, shack, shanty", "id": 1282, "trainId": 101},
136
- {"name": "apparel, wearing apparel, dress, clothes", "id": 38, "trainId": 102},
137
- {"name": "minibike, motorbike", "id": 1563, "trainId": 103},
138
- {"name": "animal, animate being, beast, brute, creature, fauna", "id": 29, "trainId": 104},
139
- {"name": "chandelier, pendant, pendent", "id": 480, "trainId": 105},
140
- {"name": "step, stair", "id": 2569, "trainId": 106},
141
- {"name": "booth, cubicle, stall, kiosk", "id": 247, "trainId": 107},
142
- {"name": "bicycle, bike, wheel, cycle", "id": 187, "trainId": 108},
143
- {"name": "doorframe, doorcase", "id": 778, "trainId": 109},
144
- {"name": "sconce", "id": 2243, "trainId": 110},
145
- {"name": "pond", "id": 1941, "trainId": 111},
146
- {"name": "trade name, brand name, brand, marque", "id": 2833, "trainId": 112},
147
- {"name": "bannister, banister, balustrade, balusters, handrail", "id": 120, "trainId": 113},
148
- {"name": "bag", "id": 95, "trainId": 114},
149
- {"name": "traffic light, traffic signal, stoplight", "id": 2836, "trainId": 115},
150
- {"name": "gazebo", "id": 1087, "trainId": 116},
151
- {"name": "escalator, moving staircase, moving stairway", "id": 868, "trainId": 117},
152
- {"name": "land, ground, soil", "id": 1401, "trainId": 118},
153
- {"name": "board, plank", "id": 220, "trainId": 119},
154
- {"name": "arcade machine", "id": 47, "trainId": 120},
155
- {"name": "eiderdown, duvet, continental quilt", "id": 843, "trainId": 121},
156
- {"name": "bar", "id": 123, "trainId": 122},
157
- {"name": "stall, stand, sales booth", "id": 2537, "trainId": 123},
158
- {"name": "playground", "id": 1927, "trainId": 124},
159
- {"name": "ship", "id": 2337, "trainId": 125},
160
- {"name": "ottoman, pouf, pouffe, puff, hassock", "id": 1702, "trainId": 126},
161
- {
162
- "name": "ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin",
163
- "id": 64,
164
- "trainId": 127,
165
- },
166
- {"name": "bottle", "id": 249, "trainId": 128},
167
- {"name": "cradle", "id": 642, "trainId": 129},
168
- {"name": "pot, flowerpot", "id": 1981, "trainId": 130},
169
- {
170
- "name": "conveyer belt, conveyor belt, conveyer, conveyor, transporter",
171
- "id": 609,
172
- "trainId": 131,
173
- },
174
- {"name": "train, railroad train", "id": 2840, "trainId": 132},
175
- {"name": "stool", "id": 2586, "trainId": 133},
176
- {"name": "lake", "id": 1393, "trainId": 134},
177
- {"name": "tank, storage tank", "id": 2704, "trainId": 135},
178
- {"name": "ice, water ice", "id": 1304, "trainId": 136},
179
- {"name": "basket, handbasket", "id": 146, "trainId": 137},
180
- {"name": "manhole", "id": 1494, "trainId": 138},
181
- {"name": "tent, collapsible shelter", "id": 2739, "trainId": 139},
182
- {"name": "canopy", "id": 389, "trainId": 140},
183
- {"name": "microwave, microwave oven", "id": 1551, "trainId": 141},
184
- {"name": "barrel, cask", "id": 131, "trainId": 142},
185
- {"name": "dirt track", "id": 738, "trainId": 143},
186
- {"name": "beam", "id": 161, "trainId": 144},
187
- {"name": "dishwasher, dish washer, dishwashing machine", "id": 747, "trainId": 145},
188
- {"name": "plate", "id": 1919, "trainId": 146},
189
- {"name": "screen, crt screen", "id": 3109, "trainId": 147},
190
- {"name": "ruins", "id": 2179, "trainId": 148},
191
- {"name": "washer, automatic washer, washing machine", "id": 2989, "trainId": 149},
192
- {"name": "blanket, cover", "id": 206, "trainId": 150},
193
- {"name": "plaything, toy", "id": 1930, "trainId": 151},
194
- {"name": "food, solid food", "id": 1002, "trainId": 152},
195
- {"name": "screen, silver screen, projection screen", "id": 2254, "trainId": 153},
196
- {"name": "oven", "id": 1708, "trainId": 154},
197
- {"name": "stage", "id": 2526, "trainId": 155},
198
- {"name": "beacon, lighthouse, beacon light, pharos", "id": 160, "trainId": 156},
199
- {"name": "umbrella", "id": 2901, "trainId": 157},
200
- {"name": "sculpture", "id": 2262, "trainId": 158},
201
- {"name": "aqueduct", "id": 44, "trainId": 159},
202
- {"name": "container", "id": 597, "trainId": 160},
203
- {"name": "scaffolding, staging", "id": 2235, "trainId": 161},
204
- {"name": "hood, exhaust hood", "id": 1260, "trainId": 162},
205
- {"name": "curb, curbing, kerb", "id": 682, "trainId": 163},
206
- {"name": "roller coaster", "id": 2151, "trainId": 164},
207
- {"name": "horse, equus caballus", "id": 3107, "trainId": 165},
208
- {"name": "catwalk", "id": 432, "trainId": 166},
209
- {"name": "glass, drinking glass", "id": 1098, "trainId": 167},
210
- {"name": "vase", "id": 2932, "trainId": 168},
211
- {"name": "central reservation", "id": 461, "trainId": 169},
212
- {"name": "carousel", "id": 410, "trainId": 170},
213
- {"name": "radiator", "id": 2046, "trainId": 171},
214
- {"name": "closet", "id": 533, "trainId": 172},
215
- {"name": "machine", "id": 1481, "trainId": 173},
216
- {"name": "pier, wharf, wharfage, dock", "id": 1858, "trainId": 174},
217
- {"name": "fan", "id": 894, "trainId": 175},
218
- {"name": "inflatable bounce game", "id": 1322, "trainId": 176},
219
- {"name": "pitch", "id": 1891, "trainId": 177},
220
- {"name": "paper", "id": 1756, "trainId": 178},
221
- {"name": "arcade, colonnade", "id": 49, "trainId": 179},
222
- {"name": "hot tub", "id": 1272, "trainId": 180},
223
- {"name": "helicopter", "id": 1229, "trainId": 181},
224
- {"name": "tray", "id": 2850, "trainId": 182},
225
- {"name": "partition, divider", "id": 1784, "trainId": 183},
226
- {"name": "vineyard", "id": 2962, "trainId": 184},
227
- {"name": "bowl", "id": 259, "trainId": 185},
228
- {"name": "bullring", "id": 319, "trainId": 186},
229
- {"name": "flag", "id": 954, "trainId": 187},
230
- {"name": "pot", "id": 1974, "trainId": 188},
231
- {"name": "footbridge, overcrossing, pedestrian bridge", "id": 1013, "trainId": 189},
232
- {"name": "shower", "id": 2356, "trainId": 190},
233
- {"name": "bag, traveling bag, travelling bag, grip, suitcase", "id": 97, "trainId": 191},
234
- {"name": "bulletin board, notice board", "id": 318, "trainId": 192},
235
- {"name": "confessional booth", "id": 592, "trainId": 193},
236
- {"name": "trunk, tree trunk, bole", "id": 2885, "trainId": 194},
237
- {"name": "forest", "id": 1017, "trainId": 195},
238
- {"name": "elevator door", "id": 851, "trainId": 196},
239
- {"name": "laptop, laptop computer", "id": 1407, "trainId": 197},
240
- {"name": "instrument panel", "id": 1332, "trainId": 198},
241
- {"name": "bucket, pail", "id": 303, "trainId": 199},
242
- {"name": "tapestry, tapis", "id": 2714, "trainId": 200},
243
- {"name": "platform", "id": 1924, "trainId": 201},
244
- {"name": "jacket", "id": 1346, "trainId": 202},
245
- {"name": "gate", "id": 1081, "trainId": 203},
246
- {"name": "monitor, monitoring device", "id": 1583, "trainId": 204},
247
- {
248
- "name": "telephone booth, phone booth, call box, telephone box, telephone kiosk",
249
- "id": 2727,
250
- "trainId": 205,
251
- },
252
- {"name": "spotlight, spot", "id": 2509, "trainId": 206},
253
- {"name": "ring", "id": 2123, "trainId": 207},
254
- {"name": "control panel", "id": 602, "trainId": 208},
255
- {"name": "blackboard, chalkboard", "id": 202, "trainId": 209},
256
- {"name": "air conditioner, air conditioning", "id": 10, "trainId": 210},
257
- {"name": "chest", "id": 490, "trainId": 211},
258
- {"name": "clock", "id": 530, "trainId": 212},
259
- {"name": "sand dune", "id": 2213, "trainId": 213},
260
- {"name": "pipe, pipage, piping", "id": 1884, "trainId": 214},
261
- {"name": "vault", "id": 2934, "trainId": 215},
262
- {"name": "table football", "id": 2687, "trainId": 216},
263
- {"name": "cannon", "id": 387, "trainId": 217},
264
- {"name": "swimming pool, swimming bath, natatorium", "id": 2668, "trainId": 218},
265
- {"name": "fluorescent, fluorescent fixture", "id": 982, "trainId": 219},
266
- {"name": "statue", "id": 2547, "trainId": 220},
267
- {
268
- "name": "loudspeaker, speaker, speaker unit, loudspeaker system, speaker system",
269
- "id": 1474,
270
- "trainId": 221,
271
- },
272
- {"name": "exhibitor", "id": 877, "trainId": 222},
273
- {"name": "ladder", "id": 1391, "trainId": 223},
274
- {"name": "carport", "id": 414, "trainId": 224},
275
- {"name": "dam", "id": 698, "trainId": 225},
276
- {"name": "pulpit", "id": 2019, "trainId": 226},
277
- {"name": "skylight, fanlight", "id": 2422, "trainId": 227},
278
- {"name": "water tower", "id": 3010, "trainId": 228},
279
- {"name": "grill, grille, grillwork", "id": 1139, "trainId": 229},
280
- {"name": "display board", "id": 753, "trainId": 230},
281
- {"name": "pane, pane of glass, window glass", "id": 1747, "trainId": 231},
282
- {"name": "rubbish, trash, scrap", "id": 2175, "trainId": 232},
283
- {"name": "ice rink", "id": 1301, "trainId": 233},
284
- {"name": "fruit", "id": 1033, "trainId": 234},
285
- {"name": "patio", "id": 1789, "trainId": 235},
286
- {"name": "vending machine", "id": 2939, "trainId": 236},
287
- {"name": "telephone, phone, telephone set", "id": 2730, "trainId": 237},
288
- {"name": "net", "id": 1652, "trainId": 238},
289
- {
290
- "name": "backpack, back pack, knapsack, packsack, rucksack, haversack",
291
- "id": 90,
292
- "trainId": 239,
293
- },
294
- {"name": "jar", "id": 1349, "trainId": 240},
295
- {"name": "track", "id": 2830, "trainId": 241},
296
- {"name": "magazine", "id": 1485, "trainId": 242},
297
- {"name": "shutter", "id": 2370, "trainId": 243},
298
- {"name": "roof", "id": 2155, "trainId": 244},
299
- {"name": "banner, streamer", "id": 118, "trainId": 245},
300
- {"name": "landfill", "id": 1402, "trainId": 246},
301
- {"name": "post", "id": 1957, "trainId": 247},
302
- {"name": "altarpiece, reredos", "id": 3130, "trainId": 248},
303
- {"name": "hat, chapeau, lid", "id": 1197, "trainId": 249},
304
- {"name": "arch, archway", "id": 52, "trainId": 250},
305
- {"name": "table game", "id": 2688, "trainId": 251},
306
- {"name": "bag, handbag, pocketbook, purse", "id": 96, "trainId": 252},
307
- {"name": "document, written document, papers", "id": 762, "trainId": 253},
308
- {"name": "dome", "id": 772, "trainId": 254},
309
- {"name": "pier", "id": 1857, "trainId": 255},
310
- {"name": "shanties", "id": 2315, "trainId": 256},
311
- {"name": "forecourt", "id": 1016, "trainId": 257},
312
- {"name": "crane", "id": 643, "trainId": 258},
313
- {"name": "dog, domestic dog, canis familiaris", "id": 3105, "trainId": 259},
314
- {"name": "piano, pianoforte, forte-piano", "id": 1849, "trainId": 260},
315
- {"name": "drawing", "id": 791, "trainId": 261},
316
- {"name": "cabin", "id": 349, "trainId": 262},
317
- {
318
- "name": "ad, advertisement, advertizement, advertising, advertizing, advert",
319
- "id": 6,
320
- "trainId": 263,
321
- },
322
- {"name": "amphitheater, amphitheatre, coliseum", "id": 3114, "trainId": 264},
323
- {"name": "monument", "id": 1587, "trainId": 265},
324
- {"name": "henhouse", "id": 1233, "trainId": 266},
325
- {"name": "cockpit", "id": 559, "trainId": 267},
326
- {"name": "heater, warmer", "id": 1223, "trainId": 268},
327
- {"name": "windmill, aerogenerator, wind generator", "id": 3049, "trainId": 269},
328
- {"name": "pool", "id": 1943, "trainId": 270},
329
- {"name": "elevator, lift", "id": 853, "trainId": 271},
330
- {"name": "decoration, ornament, ornamentation", "id": 709, "trainId": 272},
331
- {"name": "labyrinth", "id": 1390, "trainId": 273},
332
- {"name": "text, textual matter", "id": 2748, "trainId": 274},
333
- {"name": "printer", "id": 2007, "trainId": 275},
334
- {"name": "mezzanine, first balcony", "id": 1546, "trainId": 276},
335
- {"name": "mattress", "id": 1513, "trainId": 277},
336
- {"name": "straw", "id": 2600, "trainId": 278},
337
- {"name": "stalls", "id": 2538, "trainId": 279},
338
- {"name": "patio, terrace", "id": 1790, "trainId": 280},
339
- {"name": "billboard, hoarding", "id": 194, "trainId": 281},
340
- {"name": "bus stop", "id": 326, "trainId": 282},
341
- {"name": "trouser, pant", "id": 2877, "trainId": 283},
342
- {"name": "console table, console", "id": 594, "trainId": 284},
343
- {"name": "rack", "id": 2036, "trainId": 285},
344
- {"name": "notebook", "id": 1662, "trainId": 286},
345
- {"name": "shrine", "id": 2366, "trainId": 287},
346
- {"name": "pantry", "id": 1754, "trainId": 288},
347
- {"name": "cart", "id": 418, "trainId": 289},
348
- {"name": "steam shovel", "id": 2553, "trainId": 290},
349
- {"name": "porch", "id": 1951, "trainId": 291},
350
- {"name": "postbox, mailbox, letter box", "id": 1963, "trainId": 292},
351
- {"name": "figurine, statuette", "id": 918, "trainId": 293},
352
- {"name": "recycling bin", "id": 2086, "trainId": 294},
353
- {"name": "folding screen", "id": 997, "trainId": 295},
354
- {"name": "telescope", "id": 2731, "trainId": 296},
355
- {"name": "deck chair, beach chair", "id": 704, "trainId": 297},
356
- {"name": "kennel", "id": 1365, "trainId": 298},
357
- {"name": "coffee maker", "id": 569, "trainId": 299},
358
- {"name": "altar, communion table, lord's table", "id": 3108, "trainId": 300},
359
- {"name": "fish", "id": 948, "trainId": 301},
360
- {"name": "easel", "id": 839, "trainId": 302},
361
- {"name": "artificial golf green", "id": 63, "trainId": 303},
362
- {"name": "iceberg", "id": 1305, "trainId": 304},
363
- {"name": "candlestick, candle holder", "id": 378, "trainId": 305},
364
- {"name": "shower stall, shower bath", "id": 2362, "trainId": 306},
365
- {"name": "television stand", "id": 2734, "trainId": 307},
366
- {
367
- "name": "wall socket, wall plug, electric outlet, electrical outlet, outlet, electric receptacle",
368
- "id": 2982,
369
- "trainId": 308,
370
- },
371
- {"name": "skeleton", "id": 2398, "trainId": 309},
372
- {"name": "grand piano, grand", "id": 1119, "trainId": 310},
373
- {"name": "candy, confect", "id": 382, "trainId": 311},
374
- {"name": "grille door", "id": 1141, "trainId": 312},
375
- {"name": "pedestal, plinth, footstall", "id": 1805, "trainId": 313},
376
- {"name": "jersey, t-shirt, tee shirt", "id": 3102, "trainId": 314},
377
- {"name": "shoe", "id": 2341, "trainId": 315},
378
- {"name": "gravestone, headstone, tombstone", "id": 1131, "trainId": 316},
379
- {"name": "shanty", "id": 2316, "trainId": 317},
380
- {"name": "structure", "id": 2626, "trainId": 318},
381
- {"name": "rocking chair, rocker", "id": 3104, "trainId": 319},
382
- {"name": "bird", "id": 198, "trainId": 320},
383
- {"name": "place mat", "id": 1896, "trainId": 321},
384
- {"name": "tomb", "id": 2800, "trainId": 322},
385
- {"name": "big top", "id": 190, "trainId": 323},
386
- {"name": "gas pump, gasoline pump, petrol pump, island dispenser", "id": 3131, "trainId": 324},
387
- {"name": "lockers", "id": 1463, "trainId": 325},
388
- {"name": "cage", "id": 357, "trainId": 326},
389
- {"name": "finger", "id": 929, "trainId": 327},
390
- {"name": "bleachers", "id": 209, "trainId": 328},
391
- {"name": "ferris wheel", "id": 912, "trainId": 329},
392
- {"name": "hairdresser chair", "id": 1164, "trainId": 330},
393
- {"name": "mat", "id": 1509, "trainId": 331},
394
- {"name": "stands", "id": 2539, "trainId": 332},
395
- {"name": "aquarium, fish tank, marine museum", "id": 3116, "trainId": 333},
396
- {"name": "streetcar, tram, tramcar, trolley, trolley car", "id": 2615, "trainId": 334},
397
- {"name": "napkin, table napkin, serviette", "id": 1644, "trainId": 335},
398
- {"name": "dummy", "id": 818, "trainId": 336},
399
- {"name": "booklet, brochure, folder, leaflet, pamphlet", "id": 242, "trainId": 337},
400
- {"name": "sand trap", "id": 2217, "trainId": 338},
401
- {"name": "shop, store", "id": 2347, "trainId": 339},
402
- {"name": "table cloth", "id": 2686, "trainId": 340},
403
- {"name": "service station", "id": 2300, "trainId": 341},
404
- {"name": "coffin", "id": 572, "trainId": 342},
405
- {"name": "drawer", "id": 789, "trainId": 343},
406
- {"name": "cages", "id": 358, "trainId": 344},
407
- {"name": "slot machine, coin machine", "id": 2443, "trainId": 345},
408
- {"name": "balcony", "id": 101, "trainId": 346},
409
- {"name": "volleyball court", "id": 2969, "trainId": 347},
410
- {"name": "table tennis", "id": 2692, "trainId": 348},
411
- {"name": "control table", "id": 606, "trainId": 349},
412
- {"name": "shirt", "id": 2339, "trainId": 350},
413
- {"name": "merchandise, ware, product", "id": 1533, "trainId": 351},
414
- {"name": "railway", "id": 2060, "trainId": 352},
415
- {"name": "parterre", "id": 1782, "trainId": 353},
416
- {"name": "chimney", "id": 495, "trainId": 354},
417
- {"name": "can, tin, tin can", "id": 371, "trainId": 355},
418
- {"name": "tanks", "id": 2707, "trainId": 356},
419
- {"name": "fabric, cloth, material, textile", "id": 889, "trainId": 357},
420
- {"name": "alga, algae", "id": 3156, "trainId": 358},
421
- {"name": "system", "id": 2683, "trainId": 359},
422
- {"name": "map", "id": 1499, "trainId": 360},
423
- {"name": "greenhouse", "id": 1135, "trainId": 361},
424
- {"name": "mug", "id": 1619, "trainId": 362},
425
- {"name": "barbecue", "id": 125, "trainId": 363},
426
- {"name": "trailer", "id": 2838, "trainId": 364},
427
- {"name": "toilet tissue, toilet paper, bathroom tissue", "id": 2792, "trainId": 365},
428
- {"name": "organ", "id": 1695, "trainId": 366},
429
- {"name": "dishrag, dishcloth", "id": 746, "trainId": 367},
430
- {"name": "island", "id": 1343, "trainId": 368},
431
- {"name": "keyboard", "id": 1370, "trainId": 369},
432
- {"name": "trench", "id": 2858, "trainId": 370},
433
- {"name": "basket, basketball hoop, hoop", "id": 145, "trainId": 371},
434
- {"name": "steering wheel, wheel", "id": 2565, "trainId": 372},
435
- {"name": "pitcher, ewer", "id": 1892, "trainId": 373},
436
- {"name": "goal", "id": 1103, "trainId": 374},
437
- {"name": "bread, breadstuff, staff of life", "id": 286, "trainId": 375},
438
- {"name": "beds", "id": 170, "trainId": 376},
439
- {"name": "wood", "id": 3073, "trainId": 377},
440
- {"name": "file cabinet", "id": 922, "trainId": 378},
441
- {"name": "newspaper, paper", "id": 1655, "trainId": 379},
442
- {"name": "motorboat", "id": 1602, "trainId": 380},
443
- {"name": "rope", "id": 2160, "trainId": 381},
444
- {"name": "guitar", "id": 1151, "trainId": 382},
445
- {"name": "rubble", "id": 2176, "trainId": 383},
446
- {"name": "scarf", "id": 2239, "trainId": 384},
447
- {"name": "barrels", "id": 132, "trainId": 385},
448
- {"name": "cap", "id": 394, "trainId": 386},
449
- {"name": "leaves", "id": 1424, "trainId": 387},
450
- {"name": "control tower", "id": 607, "trainId": 388},
451
- {"name": "dashboard", "id": 700, "trainId": 389},
452
- {"name": "bandstand", "id": 116, "trainId": 390},
453
- {"name": "lectern", "id": 1425, "trainId": 391},
454
- {"name": "switch, electric switch, electrical switch", "id": 2676, "trainId": 392},
455
- {"name": "baseboard, mopboard, skirting board", "id": 141, "trainId": 393},
456
- {"name": "shower room", "id": 2360, "trainId": 394},
457
- {"name": "smoke", "id": 2449, "trainId": 395},
458
- {"name": "faucet, spigot", "id": 897, "trainId": 396},
459
- {"name": "bulldozer", "id": 317, "trainId": 397},
460
- {"name": "saucepan", "id": 2228, "trainId": 398},
461
- {"name": "shops", "id": 2351, "trainId": 399},
462
- {"name": "meter", "id": 1543, "trainId": 400},
463
- {"name": "crevasse", "id": 656, "trainId": 401},
464
- {"name": "gear", "id": 1088, "trainId": 402},
465
- {"name": "candelabrum, candelabra", "id": 373, "trainId": 403},
466
- {"name": "sofa bed", "id": 2472, "trainId": 404},
467
- {"name": "tunnel", "id": 2892, "trainId": 405},
468
- {"name": "pallet", "id": 1740, "trainId": 406},
469
- {"name": "wire, conducting wire", "id": 3067, "trainId": 407},
470
- {"name": "kettle, boiler", "id": 1367, "trainId": 408},
471
- {"name": "bidet", "id": 188, "trainId": 409},
472
- {
473
- "name": "baby buggy, baby carriage, carriage, perambulator, pram, stroller, go-cart, pushchair, pusher",
474
- "id": 79,
475
- "trainId": 410,
476
- },
477
- {"name": "music stand", "id": 1633, "trainId": 411},
478
- {"name": "pipe, tube", "id": 1885, "trainId": 412},
479
- {"name": "cup", "id": 677, "trainId": 413},
480
- {"name": "parking meter", "id": 1779, "trainId": 414},
481
- {"name": "ice hockey rink", "id": 1297, "trainId": 415},
482
- {"name": "shelter", "id": 2334, "trainId": 416},
483
- {"name": "weeds", "id": 3027, "trainId": 417},
484
- {"name": "temple", "id": 2735, "trainId": 418},
485
- {"name": "patty, cake", "id": 1791, "trainId": 419},
486
- {"name": "ski slope", "id": 2405, "trainId": 420},
487
- {"name": "panel", "id": 1748, "trainId": 421},
488
- {"name": "wallet", "id": 2983, "trainId": 422},
489
- {"name": "wheel", "id": 3035, "trainId": 423},
490
- {"name": "towel rack, towel horse", "id": 2824, "trainId": 424},
491
- {"name": "roundabout", "id": 2168, "trainId": 425},
492
- {"name": "canister, cannister, tin", "id": 385, "trainId": 426},
493
- {"name": "rod", "id": 2148, "trainId": 427},
494
- {"name": "soap dispenser", "id": 2465, "trainId": 428},
495
- {"name": "bell", "id": 175, "trainId": 429},
496
- {"name": "canvas", "id": 390, "trainId": 430},
497
- {"name": "box office, ticket office, ticket booth", "id": 268, "trainId": 431},
498
- {"name": "teacup", "id": 2722, "trainId": 432},
499
- {"name": "trellis", "id": 2857, "trainId": 433},
500
- {"name": "workbench", "id": 3088, "trainId": 434},
501
- {"name": "valley, vale", "id": 2926, "trainId": 435},
502
- {"name": "toaster", "id": 2782, "trainId": 436},
503
- {"name": "knife", "id": 1378, "trainId": 437},
504
- {"name": "podium", "id": 1934, "trainId": 438},
505
- {"name": "ramp", "id": 2072, "trainId": 439},
506
- {"name": "tumble dryer", "id": 2889, "trainId": 440},
507
- {"name": "fireplug, fire hydrant, plug", "id": 944, "trainId": 441},
508
- {"name": "gym shoe, sneaker, tennis shoe", "id": 1158, "trainId": 442},
509
- {"name": "lab bench", "id": 1383, "trainId": 443},
510
- {"name": "equipment", "id": 867, "trainId": 444},
511
- {"name": "rocky formation", "id": 2145, "trainId": 445},
512
- {"name": "plastic", "id": 1915, "trainId": 446},
513
- {"name": "calendar", "id": 361, "trainId": 447},
514
- {"name": "caravan", "id": 402, "trainId": 448},
515
- {"name": "check-in-desk", "id": 482, "trainId": 449},
516
- {"name": "ticket counter", "id": 2761, "trainId": 450},
517
- {"name": "brush", "id": 300, "trainId": 451},
518
- {"name": "mill", "id": 1554, "trainId": 452},
519
- {"name": "covered bridge", "id": 636, "trainId": 453},
520
- {"name": "bowling alley", "id": 260, "trainId": 454},
521
- {"name": "hanger", "id": 1186, "trainId": 455},
522
- {"name": "excavator", "id": 871, "trainId": 456},
523
- {"name": "trestle", "id": 2859, "trainId": 457},
524
- {"name": "revolving door", "id": 2103, "trainId": 458},
525
- {"name": "blast furnace", "id": 208, "trainId": 459},
526
- {"name": "scale, weighing machine", "id": 2236, "trainId": 460},
527
- {"name": "projector", "id": 2012, "trainId": 461},
528
- {"name": "soap", "id": 2462, "trainId": 462},
529
- {"name": "locker", "id": 1462, "trainId": 463},
530
- {"name": "tractor", "id": 2832, "trainId": 464},
531
- {"name": "stretcher", "id": 2617, "trainId": 465},
532
- {"name": "frame", "id": 1024, "trainId": 466},
533
- {"name": "grating", "id": 1129, "trainId": 467},
534
- {"name": "alembic", "id": 18, "trainId": 468},
535
- {"name": "candle, taper, wax light", "id": 376, "trainId": 469},
536
- {"name": "barrier", "id": 134, "trainId": 470},
537
- {"name": "cardboard", "id": 407, "trainId": 471},
538
- {"name": "cave", "id": 434, "trainId": 472},
539
- {"name": "puddle", "id": 2017, "trainId": 473},
540
- {"name": "tarp", "id": 2717, "trainId": 474},
541
- {"name": "price tag", "id": 2005, "trainId": 475},
542
- {"name": "watchtower", "id": 2993, "trainId": 476},
543
- {"name": "meters", "id": 1545, "trainId": 477},
544
- {
545
- "name": "light bulb, lightbulb, bulb, incandescent lamp, electric light, electric-light bulb",
546
- "id": 1445,
547
- "trainId": 478,
548
- },
549
- {"name": "tracks", "id": 2831, "trainId": 479},
550
- {"name": "hair dryer", "id": 1161, "trainId": 480},
551
- {"name": "skirt", "id": 2411, "trainId": 481},
552
- {"name": "viaduct", "id": 2949, "trainId": 482},
553
- {"name": "paper towel", "id": 1769, "trainId": 483},
554
- {"name": "coat", "id": 552, "trainId": 484},
555
- {"name": "sheet", "id": 2327, "trainId": 485},
556
- {"name": "fire extinguisher, extinguisher, asphyxiator", "id": 939, "trainId": 486},
557
- {"name": "water wheel", "id": 3013, "trainId": 487},
558
- {"name": "pottery, clayware", "id": 1986, "trainId": 488},
559
- {"name": "magazine rack", "id": 1486, "trainId": 489},
560
- {"name": "teapot", "id": 2723, "trainId": 490},
561
- {"name": "microphone, mike", "id": 1549, "trainId": 491},
562
- {"name": "support", "id": 2649, "trainId": 492},
563
- {"name": "forklift", "id": 1020, "trainId": 493},
564
- {"name": "canyon", "id": 392, "trainId": 494},
565
- {"name": "cash register, register", "id": 422, "trainId": 495},
566
- {"name": "leaf, leafage, foliage", "id": 1419, "trainId": 496},
567
- {"name": "remote control, remote", "id": 2099, "trainId": 497},
568
- {"name": "soap dish", "id": 2464, "trainId": 498},
569
- {"name": "windshield, windscreen", "id": 3058, "trainId": 499},
570
- {"name": "cat", "id": 430, "trainId": 500},
571
- {"name": "cue, cue stick, pool cue, pool stick", "id": 675, "trainId": 501},
572
- {"name": "vent, venthole, vent-hole, blowhole", "id": 2941, "trainId": 502},
573
- {"name": "videos", "id": 2955, "trainId": 503},
574
- {"name": "shovel", "id": 2355, "trainId": 504},
575
- {"name": "eaves", "id": 840, "trainId": 505},
576
- {"name": "antenna, aerial, transmitting aerial", "id": 32, "trainId": 506},
577
- {"name": "shipyard", "id": 2338, "trainId": 507},
578
- {"name": "hen, biddy", "id": 1232, "trainId": 508},
579
- {"name": "traffic cone", "id": 2834, "trainId": 509},
580
- {"name": "washing machines", "id": 2991, "trainId": 510},
581
- {"name": "truck crane", "id": 2879, "trainId": 511},
582
- {"name": "cds", "id": 444, "trainId": 512},
583
- {"name": "niche", "id": 1657, "trainId": 513},
584
- {"name": "scoreboard", "id": 2246, "trainId": 514},
585
- {"name": "briefcase", "id": 296, "trainId": 515},
586
- {"name": "boot", "id": 245, "trainId": 516},
587
- {"name": "sweater, jumper", "id": 2661, "trainId": 517},
588
- {"name": "hay", "id": 1202, "trainId": 518},
589
- {"name": "pack", "id": 1714, "trainId": 519},
590
- {"name": "bottle rack", "id": 251, "trainId": 520},
591
- {"name": "glacier", "id": 1095, "trainId": 521},
592
- {"name": "pergola", "id": 1828, "trainId": 522},
593
- {"name": "building materials", "id": 311, "trainId": 523},
594
- {"name": "television camera", "id": 2732, "trainId": 524},
595
- {"name": "first floor", "id": 947, "trainId": 525},
596
- {"name": "rifle", "id": 2115, "trainId": 526},
597
- {"name": "tennis table", "id": 2738, "trainId": 527},
598
- {"name": "stadium", "id": 2525, "trainId": 528},
599
- {"name": "safety belt", "id": 2194, "trainId": 529},
600
- {"name": "cover", "id": 634, "trainId": 530},
601
- {"name": "dish rack", "id": 740, "trainId": 531},
602
- {"name": "synthesizer", "id": 2682, "trainId": 532},
603
- {"name": "pumpkin", "id": 2020, "trainId": 533},
604
- {"name": "gutter", "id": 1156, "trainId": 534},
605
- {"name": "fruit stand", "id": 1036, "trainId": 535},
606
- {"name": "ice floe, floe", "id": 1295, "trainId": 536},
607
- {"name": "handle, grip, handgrip, hold", "id": 1181, "trainId": 537},
608
- {"name": "wheelchair", "id": 3037, "trainId": 538},
609
- {"name": "mousepad, mouse mat", "id": 1614, "trainId": 539},
610
- {"name": "diploma", "id": 736, "trainId": 540},
611
- {"name": "fairground ride", "id": 893, "trainId": 541},
612
- {"name": "radio", "id": 2047, "trainId": 542},
613
- {"name": "hotplate", "id": 1274, "trainId": 543},
614
- {"name": "junk", "id": 1361, "trainId": 544},
615
- {"name": "wheelbarrow", "id": 3036, "trainId": 545},
616
- {"name": "stream", "id": 2606, "trainId": 546},
617
- {"name": "toll plaza", "id": 2797, "trainId": 547},
618
- {"name": "punching bag", "id": 2022, "trainId": 548},
619
- {"name": "trough", "id": 2876, "trainId": 549},
620
- {"name": "throne", "id": 2758, "trainId": 550},
621
- {"name": "chair desk", "id": 472, "trainId": 551},
622
- {"name": "weighbridge", "id": 3028, "trainId": 552},
623
- {"name": "extractor fan", "id": 882, "trainId": 553},
624
- {"name": "hanging clothes", "id": 1189, "trainId": 554},
625
- {"name": "dish, dish aerial, dish antenna, saucer", "id": 743, "trainId": 555},
626
- {"name": "alarm clock, alarm", "id": 3122, "trainId": 556},
627
- {"name": "ski lift", "id": 2401, "trainId": 557},
628
- {"name": "chain", "id": 468, "trainId": 558},
629
- {"name": "garage", "id": 1061, "trainId": 559},
630
- {"name": "mechanical shovel", "id": 1523, "trainId": 560},
631
- {"name": "wine rack", "id": 3059, "trainId": 561},
632
- {"name": "tramway", "id": 2843, "trainId": 562},
633
- {"name": "treadmill", "id": 2853, "trainId": 563},
634
- {"name": "menu", "id": 1529, "trainId": 564},
635
- {"name": "block", "id": 214, "trainId": 565},
636
- {"name": "well", "id": 3032, "trainId": 566},
637
- {"name": "witness stand", "id": 3071, "trainId": 567},
638
- {"name": "branch", "id": 277, "trainId": 568},
639
- {"name": "duck", "id": 813, "trainId": 569},
640
- {"name": "casserole", "id": 426, "trainId": 570},
641
- {"name": "frying pan", "id": 1039, "trainId": 571},
642
- {"name": "desk organizer", "id": 727, "trainId": 572},
643
- {"name": "mast", "id": 1508, "trainId": 573},
644
- {"name": "spectacles, specs, eyeglasses, glasses", "id": 2490, "trainId": 574},
645
- {"name": "service elevator", "id": 2299, "trainId": 575},
646
- {"name": "dollhouse", "id": 768, "trainId": 576},
647
- {"name": "hammock", "id": 1172, "trainId": 577},
648
- {"name": "clothes hanging", "id": 537, "trainId": 578},
649
- {"name": "photocopier", "id": 1847, "trainId": 579},
650
- {"name": "notepad", "id": 1664, "trainId": 580},
651
- {"name": "golf cart", "id": 1110, "trainId": 581},
652
- {"name": "footpath", "id": 1014, "trainId": 582},
653
- {"name": "cross", "id": 662, "trainId": 583},
654
- {"name": "baptismal font", "id": 121, "trainId": 584},
655
- {"name": "boiler", "id": 227, "trainId": 585},
656
- {"name": "skip", "id": 2410, "trainId": 586},
657
- {"name": "rotisserie", "id": 2165, "trainId": 587},
658
- {"name": "tables", "id": 2696, "trainId": 588},
659
- {"name": "water mill", "id": 3005, "trainId": 589},
660
- {"name": "helmet", "id": 1231, "trainId": 590},
661
- {"name": "cover curtain", "id": 635, "trainId": 591},
662
- {"name": "brick", "id": 292, "trainId": 592},
663
- {"name": "table runner", "id": 2690, "trainId": 593},
664
- {"name": "ashtray", "id": 65, "trainId": 594},
665
- {"name": "street box", "id": 2607, "trainId": 595},
666
- {"name": "stick", "id": 2574, "trainId": 596},
667
- {"name": "hangers", "id": 1188, "trainId": 597},
668
- {"name": "cells", "id": 456, "trainId": 598},
669
- {"name": "urinal", "id": 2913, "trainId": 599},
670
- {"name": "centerpiece", "id": 459, "trainId": 600},
671
- {"name": "portable fridge", "id": 1955, "trainId": 601},
672
- {"name": "dvds", "id": 827, "trainId": 602},
673
- {"name": "golf club", "id": 1111, "trainId": 603},
674
- {"name": "skirting board", "id": 2412, "trainId": 604},
675
- {"name": "water cooler", "id": 2997, "trainId": 605},
676
- {"name": "clipboard", "id": 528, "trainId": 606},
677
- {"name": "camera, photographic camera", "id": 366, "trainId": 607},
678
- {"name": "pigeonhole", "id": 1863, "trainId": 608},
679
- {"name": "chips", "id": 500, "trainId": 609},
680
- {"name": "food processor", "id": 1001, "trainId": 610},
681
- {"name": "post box", "id": 1958, "trainId": 611},
682
- {"name": "lid", "id": 1441, "trainId": 612},
683
- {"name": "drum", "id": 809, "trainId": 613},
684
- {"name": "blender", "id": 210, "trainId": 614},
685
- {"name": "cave entrance", "id": 435, "trainId": 615},
686
- {"name": "dental chair", "id": 718, "trainId": 616},
687
- {"name": "obelisk", "id": 1674, "trainId": 617},
688
- {"name": "canoe", "id": 388, "trainId": 618},
689
- {"name": "mobile", "id": 1572, "trainId": 619},
690
- {"name": "monitors", "id": 1584, "trainId": 620},
691
- {"name": "pool ball", "id": 1944, "trainId": 621},
692
- {"name": "cue rack", "id": 674, "trainId": 622},
693
- {"name": "baggage carts", "id": 99, "trainId": 623},
694
- {"name": "shore", "id": 2352, "trainId": 624},
695
- {"name": "fork", "id": 1019, "trainId": 625},
696
- {"name": "paper filer", "id": 1763, "trainId": 626},
697
- {"name": "bicycle rack", "id": 185, "trainId": 627},
698
- {"name": "coat rack", "id": 554, "trainId": 628},
699
- {"name": "garland", "id": 1066, "trainId": 629},
700
- {"name": "sports bag", "id": 2508, "trainId": 630},
701
- {"name": "fish tank", "id": 951, "trainId": 631},
702
- {"name": "towel dispenser", "id": 2822, "trainId": 632},
703
- {"name": "carriage", "id": 415, "trainId": 633},
704
- {"name": "brochure", "id": 297, "trainId": 634},
705
- {"name": "plaque", "id": 1914, "trainId": 635},
706
- {"name": "stringer", "id": 2619, "trainId": 636},
707
- {"name": "iron", "id": 1338, "trainId": 637},
708
- {"name": "spoon", "id": 2505, "trainId": 638},
709
- {"name": "flag pole", "id": 955, "trainId": 639},
710
- {"name": "toilet brush", "id": 2786, "trainId": 640},
711
- {"name": "book stand", "id": 238, "trainId": 641},
712
- {"name": "water faucet, water tap, tap, hydrant", "id": 3000, "trainId": 642},
713
- {"name": "ticket office", "id": 2763, "trainId": 643},
714
- {"name": "broom", "id": 299, "trainId": 644},
715
- {"name": "dvd", "id": 822, "trainId": 645},
716
- {"name": "ice bucket", "id": 1288, "trainId": 646},
717
- {"name": "carapace, shell, cuticle, shield", "id": 3101, "trainId": 647},
718
- {"name": "tureen", "id": 2894, "trainId": 648},
719
- {"name": "folders", "id": 992, "trainId": 649},
720
- {"name": "chess", "id": 489, "trainId": 650},
721
- {"name": "root", "id": 2157, "trainId": 651},
722
- {"name": "sewing machine", "id": 2309, "trainId": 652},
723
- {"name": "model", "id": 1576, "trainId": 653},
724
- {"name": "pen", "id": 1810, "trainId": 654},
725
- {"name": "violin", "id": 2964, "trainId": 655},
726
- {"name": "sweatshirt", "id": 2662, "trainId": 656},
727
- {"name": "recycling materials", "id": 2087, "trainId": 657},
728
- {"name": "mitten", "id": 1569, "trainId": 658},
729
- {"name": "chopping board, cutting board", "id": 503, "trainId": 659},
730
- {"name": "mask", "id": 1505, "trainId": 660},
731
- {"name": "log", "id": 1468, "trainId": 661},
732
- {"name": "mouse, computer mouse", "id": 1613, "trainId": 662},
733
- {"name": "grill", "id": 1138, "trainId": 663},
734
- {"name": "hole", "id": 1256, "trainId": 664},
735
- {"name": "target", "id": 2715, "trainId": 665},
736
- {"name": "trash bag", "id": 2846, "trainId": 666},
737
- {"name": "chalk", "id": 477, "trainId": 667},
738
- {"name": "sticks", "id": 2576, "trainId": 668},
739
- {"name": "balloon", "id": 108, "trainId": 669},
740
- {"name": "score", "id": 2245, "trainId": 670},
741
- {"name": "hair spray", "id": 1162, "trainId": 671},
742
- {"name": "roll", "id": 2149, "trainId": 672},
743
- {"name": "runner", "id": 2183, "trainId": 673},
744
- {"name": "engine", "id": 858, "trainId": 674},
745
- {"name": "inflatable glove", "id": 1324, "trainId": 675},
746
- {"name": "games", "id": 1055, "trainId": 676},
747
- {"name": "pallets", "id": 1741, "trainId": 677},
748
- {"name": "baskets", "id": 149, "trainId": 678},
749
- {"name": "coop", "id": 615, "trainId": 679},
750
- {"name": "dvd player", "id": 825, "trainId": 680},
751
- {"name": "rocking horse", "id": 2143, "trainId": 681},
752
- {"name": "buckets", "id": 304, "trainId": 682},
753
- {"name": "bread rolls", "id": 283, "trainId": 683},
754
- {"name": "shawl", "id": 2322, "trainId": 684},
755
- {"name": "watering can", "id": 3017, "trainId": 685},
756
- {"name": "spotlights", "id": 2510, "trainId": 686},
757
- {"name": "post-it", "id": 1960, "trainId": 687},
758
- {"name": "bowls", "id": 265, "trainId": 688},
759
- {"name": "security camera", "id": 2282, "trainId": 689},
760
- {"name": "runner cloth", "id": 2184, "trainId": 690},
761
- {"name": "lock", "id": 1461, "trainId": 691},
762
- {"name": "alarm, warning device, alarm system", "id": 3113, "trainId": 692},
763
- {"name": "side", "id": 2372, "trainId": 693},
764
- {"name": "roulette", "id": 2166, "trainId": 694},
765
- {"name": "bone", "id": 232, "trainId": 695},
766
- {"name": "cutlery", "id": 693, "trainId": 696},
767
- {"name": "pool balls", "id": 1945, "trainId": 697},
768
- {"name": "wheels", "id": 3039, "trainId": 698},
769
- {"name": "spice rack", "id": 2494, "trainId": 699},
770
- {"name": "plant pots", "id": 1908, "trainId": 700},
771
- {"name": "towel ring", "id": 2827, "trainId": 701},
772
- {"name": "bread box", "id": 280, "trainId": 702},
773
- {"name": "video", "id": 2950, "trainId": 703},
774
- {"name": "funfair", "id": 1044, "trainId": 704},
775
- {"name": "breads", "id": 288, "trainId": 705},
776
- {"name": "tripod", "id": 2863, "trainId": 706},
777
- {"name": "ironing board", "id": 1342, "trainId": 707},
778
- {"name": "skimmer", "id": 2409, "trainId": 708},
779
- {"name": "hollow", "id": 1258, "trainId": 709},
780
- {"name": "scratching post", "id": 2249, "trainId": 710},
781
- {"name": "tricycle", "id": 2862, "trainId": 711},
782
- {"name": "file box", "id": 920, "trainId": 712},
783
- {"name": "mountain pass", "id": 1607, "trainId": 713},
784
- {"name": "tombstones", "id": 2802, "trainId": 714},
785
- {"name": "cooker", "id": 610, "trainId": 715},
786
- {"name": "card game, cards", "id": 3129, "trainId": 716},
787
- {"name": "golf bag", "id": 1108, "trainId": 717},
788
- {"name": "towel paper", "id": 2823, "trainId": 718},
789
- {"name": "chaise lounge", "id": 476, "trainId": 719},
790
- {"name": "sun", "id": 2641, "trainId": 720},
791
- {"name": "toilet paper holder", "id": 2788, "trainId": 721},
792
- {"name": "rake", "id": 2070, "trainId": 722},
793
- {"name": "key", "id": 1368, "trainId": 723},
794
- {"name": "umbrella stand", "id": 2903, "trainId": 724},
795
- {"name": "dartboard", "id": 699, "trainId": 725},
796
- {"name": "transformer", "id": 2844, "trainId": 726},
797
- {"name": "fireplace utensils", "id": 942, "trainId": 727},
798
- {"name": "sweatshirts", "id": 2663, "trainId": 728},
799
- {
800
- "name": "cellular telephone, cellular phone, cellphone, cell, mobile phone",
801
- "id": 457,
802
- "trainId": 729,
803
- },
804
- {"name": "tallboy", "id": 2701, "trainId": 730},
805
- {"name": "stapler", "id": 2540, "trainId": 731},
806
- {"name": "sauna", "id": 2231, "trainId": 732},
807
- {"name": "test tube", "id": 2746, "trainId": 733},
808
- {"name": "palette", "id": 1738, "trainId": 734},
809
- {"name": "shopping carts", "id": 2350, "trainId": 735},
810
- {"name": "tools", "id": 2808, "trainId": 736},
811
- {"name": "push button, push, button", "id": 2025, "trainId": 737},
812
- {"name": "star", "id": 2541, "trainId": 738},
813
- {"name": "roof rack", "id": 2156, "trainId": 739},
814
- {"name": "barbed wire", "id": 126, "trainId": 740},
815
- {"name": "spray", "id": 2512, "trainId": 741},
816
- {"name": "ear", "id": 831, "trainId": 742},
817
- {"name": "sponge", "id": 2503, "trainId": 743},
818
- {"name": "racket", "id": 2039, "trainId": 744},
819
- {"name": "tins", "id": 2774, "trainId": 745},
820
- {"name": "eyeglasses", "id": 886, "trainId": 746},
821
- {"name": "file", "id": 919, "trainId": 747},
822
- {"name": "scarfs", "id": 2240, "trainId": 748},
823
- {"name": "sugar bowl", "id": 2636, "trainId": 749},
824
- {"name": "flip flop", "id": 963, "trainId": 750},
825
- {"name": "headstones", "id": 1218, "trainId": 751},
826
- {"name": "laptop bag", "id": 1406, "trainId": 752},
827
- {"name": "leash", "id": 1420, "trainId": 753},
828
- {"name": "climbing frame", "id": 526, "trainId": 754},
829
- {"name": "suit hanger", "id": 2639, "trainId": 755},
830
- {"name": "floor spotlight", "id": 975, "trainId": 756},
831
- {"name": "plate rack", "id": 1921, "trainId": 757},
832
- {"name": "sewer", "id": 2305, "trainId": 758},
833
- {"name": "hard drive", "id": 1193, "trainId": 759},
834
- {"name": "sprinkler", "id": 2517, "trainId": 760},
835
- {"name": "tools box", "id": 2809, "trainId": 761},
836
- {"name": "necklace", "id": 1647, "trainId": 762},
837
- {"name": "bulbs", "id": 314, "trainId": 763},
838
- {"name": "steel industry", "id": 2560, "trainId": 764},
839
- {"name": "club", "id": 545, "trainId": 765},
840
- {"name": "jack", "id": 1345, "trainId": 766},
841
- {"name": "door bars", "id": 775, "trainId": 767},
842
- {
843
- "name": "control panel, instrument panel, control board, board, panel",
844
- "id": 603,
845
- "trainId": 768,
846
- },
847
- {"name": "hairbrush", "id": 1163, "trainId": 769},
848
- {"name": "napkin holder", "id": 1641, "trainId": 770},
849
- {"name": "office", "id": 1678, "trainId": 771},
850
- {"name": "smoke detector", "id": 2450, "trainId": 772},
851
- {"name": "utensils", "id": 2915, "trainId": 773},
852
- {"name": "apron", "id": 42, "trainId": 774},
853
- {"name": "scissors", "id": 2242, "trainId": 775},
854
- {"name": "terminal", "id": 2741, "trainId": 776},
855
- {"name": "grinder", "id": 1143, "trainId": 777},
856
- {"name": "entry phone", "id": 862, "trainId": 778},
857
- {"name": "newspaper stand", "id": 1654, "trainId": 779},
858
- {"name": "pepper shaker", "id": 1826, "trainId": 780},
859
- {"name": "onions", "id": 1689, "trainId": 781},
860
- {
861
- "name": "central processing unit, cpu, c p u , central processor, processor, mainframe",
862
- "id": 3124,
863
- "trainId": 782,
864
- },
865
- {"name": "tape", "id": 2710, "trainId": 783},
866
- {"name": "bat", "id": 152, "trainId": 784},
867
- {"name": "coaster", "id": 549, "trainId": 785},
868
- {"name": "calculator", "id": 360, "trainId": 786},
869
- {"name": "potatoes", "id": 1982, "trainId": 787},
870
- {"name": "luggage rack", "id": 1478, "trainId": 788},
871
- {"name": "salt", "id": 2203, "trainId": 789},
872
- {"name": "street number", "id": 2612, "trainId": 790},
873
- {"name": "viewpoint", "id": 2956, "trainId": 791},
874
- {"name": "sword", "id": 2681, "trainId": 792},
875
- {"name": "cd", "id": 437, "trainId": 793},
876
- {"name": "rowing machine", "id": 2171, "trainId": 794},
877
- {"name": "plug", "id": 1933, "trainId": 795},
878
- {"name": "andiron, firedog, dog, dog-iron", "id": 3110, "trainId": 796},
879
- {"name": "pepper", "id": 1824, "trainId": 797},
880
- {"name": "tongs", "id": 2803, "trainId": 798},
881
- {"name": "bonfire", "id": 234, "trainId": 799},
882
- {"name": "dog dish", "id": 764, "trainId": 800},
883
- {"name": "belt", "id": 177, "trainId": 801},
884
- {"name": "dumbbells", "id": 817, "trainId": 802},
885
- {"name": "videocassette recorder, vcr", "id": 3145, "trainId": 803},
886
- {"name": "hook", "id": 1262, "trainId": 804},
887
- {"name": "envelopes", "id": 864, "trainId": 805},
888
- {"name": "shower faucet", "id": 2359, "trainId": 806},
889
- {"name": "watch", "id": 2992, "trainId": 807},
890
- {"name": "padlock", "id": 1725, "trainId": 808},
891
- {"name": "swimming pool ladder", "id": 2667, "trainId": 809},
892
- {"name": "spanners", "id": 2484, "trainId": 810},
893
- {"name": "gravy boat", "id": 1133, "trainId": 811},
894
- {"name": "notice board", "id": 1667, "trainId": 812},
895
- {"name": "trash bags", "id": 2847, "trainId": 813},
896
- {"name": "fire alarm", "id": 932, "trainId": 814},
897
- {"name": "ladle", "id": 1392, "trainId": 815},
898
- {"name": "stethoscope", "id": 2573, "trainId": 816},
899
- {"name": "rocket", "id": 2140, "trainId": 817},
900
- {"name": "funnel", "id": 1046, "trainId": 818},
901
- {"name": "bowling pins", "id": 264, "trainId": 819},
902
- {"name": "valve", "id": 2927, "trainId": 820},
903
- {"name": "thermometer", "id": 2752, "trainId": 821},
904
- {"name": "cups", "id": 679, "trainId": 822},
905
- {"name": "spice jar", "id": 2493, "trainId": 823},
906
- {"name": "night light", "id": 1658, "trainId": 824},
907
- {"name": "soaps", "id": 2466, "trainId": 825},
908
- {"name": "games table", "id": 1057, "trainId": 826},
909
- {"name": "slotted spoon", "id": 2444, "trainId": 827},
910
- {"name": "reel", "id": 2093, "trainId": 828},
911
- {"name": "scourer", "id": 2248, "trainId": 829},
912
- {"name": "sleeping robe", "id": 2432, "trainId": 830},
913
- {"name": "desk mat", "id": 726, "trainId": 831},
914
- {"name": "dumbbell", "id": 816, "trainId": 832},
915
- {"name": "hammer", "id": 1171, "trainId": 833},
916
- {"name": "tie", "id": 2766, "trainId": 834},
917
- {"name": "typewriter", "id": 2900, "trainId": 835},
918
- {"name": "shaker", "id": 2313, "trainId": 836},
919
- {"name": "cheese dish", "id": 488, "trainId": 837},
920
- {"name": "sea star", "id": 2265, "trainId": 838},
921
- {"name": "racquet", "id": 2043, "trainId": 839},
922
- {"name": "butane gas cylinder", "id": 332, "trainId": 840},
923
- {"name": "paper weight", "id": 1771, "trainId": 841},
924
- {"name": "shaving brush", "id": 2320, "trainId": 842},
925
- {"name": "sunglasses", "id": 2646, "trainId": 843},
926
- {"name": "gear shift", "id": 1089, "trainId": 844},
927
- {"name": "towel rail", "id": 2826, "trainId": 845},
928
- {"name": "adding machine, totalizer, totaliser", "id": 3148, "trainId": 846},
929
- ]
930
-
931
-
932
- def loadAde20K(file):
933
- fileseg = file.replace(".jpg", "_seg.png")
934
- with Image.open(fileseg) as io:
935
- seg = np.array(io)
936
-
937
- R = seg[:, :, 0]
938
- G = seg[:, :, 1]
939
- ObjectClassMasks = (R / 10).astype(np.int32) * 256 + (G.astype(np.int32))
940
-
941
- return {"img_name": file, "segm_name": fileseg, "class_mask": ObjectClassMasks}
942
-
943
-
944
- if __name__ == "__main__":
945
- dataset_dir = Path(os.getenv("DETECTRON2_DATASETS", "datasets"))
946
- index_file = dataset_dir / "ADE20K_2021_17_01" / "index_ade20k.pkl"
947
- print('Caution: we only generate the validation set!')
948
- with open(index_file, "rb") as f:
949
- index_ade20k = pkl.load(f)
950
-
951
- id_map = {}
952
- for cat in ADE20K_SEM_SEG_FULL_CATEGORIES:
953
- id_map[cat["id"]] = cat["trainId"]
954
-
955
- # make output dir
956
- for name in ["training", "validation"]:
957
- image_dir = dataset_dir / "ADE20K_2021_17_01" / "images_detectron2" / name
958
- image_dir.mkdir(parents=True, exist_ok=True)
959
- annotation_dir = dataset_dir / "ADE20K_2021_17_01" / "annotations_detectron2" / name
960
- annotation_dir.mkdir(parents=True, exist_ok=True)
961
-
962
- # process image and gt
963
- for i, (folder_name, file_name) in tqdm.tqdm(
964
- enumerate(zip(index_ade20k["folder"], index_ade20k["filename"])),
965
- total=len(index_ade20k["filename"]),
966
- ):
967
- split = "validation" if file_name.split("_")[1] == "val" else "training"
968
- if split == 'training':
969
- # FIXME: If you want to generate training set, delete this condition
970
- continue
971
- info = loadAde20K(str(dataset_dir / folder_name / file_name))
972
-
973
- # resize image and label
974
- img = np.asarray(Image.open(info["img_name"]))
975
- lab = np.asarray(info["class_mask"])
976
-
977
- h, w = img.shape[0], img.shape[1]
978
- max_size = 512
979
- resize = True
980
- if w >= h > max_size:
981
- h_new, w_new = max_size, round(w / float(h) * max_size)
982
- elif h >= w > max_size:
983
- h_new, w_new = round(h / float(w) * max_size), max_size
984
- else:
985
- resize = False
986
-
987
- if resize:
988
- img = cv2.resize(img, (w_new, h_new), interpolation=cv2.INTER_LINEAR)
989
- lab = cv2.resize(lab, (w_new, h_new), interpolation=cv2.INTER_NEAREST)
990
-
991
- assert img.dtype == np.uint8
992
- assert lab.dtype == np.int32
993
-
994
- # apply label conversion and save into uint16 images
995
- output = np.zeros_like(lab, dtype=np.uint16) + 65535
996
- for obj_id in np.unique(lab):
997
- if obj_id in id_map:
998
- output[lab == obj_id] = id_map[obj_id]
999
-
1000
- output_img = dataset_dir / "ADE20K_2021_17_01" / "images_detectron2" / split / file_name
1001
- output_lab = (
1002
- dataset_dir
1003
- / "ADE20K_2021_17_01"
1004
- / "annotations_detectron2"
1005
- / split
1006
- / file_name.replace(".jpg", ".tif")
1007
- )
1008
- Image.fromarray(img).save(output_img)
1009
-
1010
- assert output.dtype == np.uint16
1011
- Image.fromarray(output).save(output_lab)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/prepare_ade20k_sem_seg.py DELETED
@@ -1,35 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import os
5
- from pathlib import Path
6
-
7
- import numpy as np
8
- import tqdm
9
- from PIL import Image
10
-
11
-
12
- def convert(input, output, index=None):
13
- img = np.asarray(Image.open(input))
14
- assert img.dtype == np.uint8
15
- img = img - 1 # 0 (ignore) becomes 255. others are shifted by 1
16
- if index is not None:
17
- mapping = {i: k for k, i in enumerate(index)}
18
- img = np.vectorize(lambda x: mapping[x] if x in mapping else 255)(
19
- img.astype(np.float)
20
- ).astype(np.uint8)
21
- Image.fromarray(img).save(output)
22
-
23
-
24
- if __name__ == "__main__":
25
- dataset_dir = (
26
- Path(os.getenv("DETECTRON2_DATASETS", "datasets")) / "ADEChallengeData2016"
27
- )
28
- print('Caution: we only generate the validation set!')
29
- for name in ["validation"]:
30
- annotation_dir = dataset_dir / "annotations" / name
31
- output_dir = dataset_dir / "annotations_detectron2" / name
32
- output_dir.mkdir(parents=True, exist_ok=True)
33
- for file in tqdm.tqdm(list(annotation_dir.iterdir())):
34
- output_file = output_dir / file.name
35
- convert(file, output_file)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/prepare_coco_stuff_sem_seg.py DELETED
@@ -1,219 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
- # Modified by Feng Liang from
4
- # https://github.com/MendelXu/zsseg.baseline/blob/master/datasets/prepare_coco_stuff_164k_sem_seg.py
5
-
6
- import os
7
- import os.path as osp
8
- from pathlib import Path
9
- import tqdm
10
- from glob import glob
11
-
12
- import numpy as np
13
- from PIL import Image
14
-
15
-
16
- full_clsID_to_trID = {
17
- 0: 0,
18
- 1: 1,
19
- 2: 2,
20
- 3: 3,
21
- 4: 4,
22
- 5: 5,
23
- 6: 6,
24
- 7: 7,
25
- 8: 8,
26
- 9: 9,
27
- 10: 10,
28
- 12: 11,
29
- 13: 12,
30
- 14: 13,
31
- 15: 14,
32
- 16: 15,
33
- 17: 16,
34
- 18: 17,
35
- 19: 18,
36
- 20: 19,
37
- 21: 20,
38
- 22: 21,
39
- 23: 22,
40
- 24: 23,
41
- 26: 24,
42
- 27: 25,
43
- 30: 26,
44
- 31: 27,
45
- 32: 28,
46
- 33: 29,
47
- 34: 30,
48
- 35: 31,
49
- 36: 32,
50
- 37: 33,
51
- 38: 34,
52
- 39: 35,
53
- 40: 36,
54
- 41: 37,
55
- 42: 38,
56
- 43: 39,
57
- 45: 40,
58
- 46: 41,
59
- 47: 42,
60
- 48: 43,
61
- 49: 44,
62
- 50: 45,
63
- 51: 46,
64
- 52: 47,
65
- 53: 48,
66
- 54: 49,
67
- 55: 50,
68
- 56: 51,
69
- 57: 52,
70
- 58: 53,
71
- 59: 54,
72
- 60: 55,
73
- 61: 56,
74
- 62: 57,
75
- 63: 58,
76
- 64: 59,
77
- 66: 60,
78
- 69: 61,
79
- 71: 62,
80
- 72: 63,
81
- 73: 64,
82
- 74: 65,
83
- 75: 66,
84
- 76: 67,
85
- 77: 68,
86
- 78: 69,
87
- 79: 70,
88
- 80: 71,
89
- 81: 72,
90
- 83: 73,
91
- 84: 74,
92
- 85: 75,
93
- 86: 76,
94
- 87: 77,
95
- 88: 78,
96
- 89: 79,
97
- 91: 80,
98
- 92: 81,
99
- 93: 82,
100
- 94: 83,
101
- 95: 84,
102
- 96: 85,
103
- 97: 86,
104
- 98: 87,
105
- 99: 88,
106
- 100: 89,
107
- 101: 90,
108
- 102: 91,
109
- 103: 92,
110
- 104: 93,
111
- 105: 94,
112
- 106: 95,
113
- 107: 96,
114
- 108: 97,
115
- 109: 98,
116
- 110: 99,
117
- 111: 100,
118
- 112: 101,
119
- 113: 102,
120
- 114: 103,
121
- 115: 104,
122
- 116: 105,
123
- 117: 106,
124
- 118: 107,
125
- 119: 108,
126
- 120: 109,
127
- 121: 110,
128
- 122: 111,
129
- 123: 112,
130
- 124: 113,
131
- 125: 114,
132
- 126: 115,
133
- 127: 116,
134
- 128: 117,
135
- 129: 118,
136
- 130: 119,
137
- 131: 120,
138
- 132: 121,
139
- 133: 122,
140
- 134: 123,
141
- 135: 124,
142
- 136: 125,
143
- 137: 126,
144
- 138: 127,
145
- 139: 128,
146
- 140: 129,
147
- 141: 130,
148
- 142: 131,
149
- 143: 132,
150
- 144: 133,
151
- 145: 134,
152
- 146: 135,
153
- 147: 136,
154
- 148: 137,
155
- 149: 138,
156
- 150: 139,
157
- 151: 140,
158
- 152: 141,
159
- 153: 142,
160
- 154: 143,
161
- 155: 144,
162
- 156: 145,
163
- 157: 146,
164
- 158: 147,
165
- 159: 148,
166
- 160: 149,
167
- 161: 150,
168
- 162: 151,
169
- 163: 152,
170
- 164: 153,
171
- 165: 154,
172
- 166: 155,
173
- 167: 156,
174
- 168: 157,
175
- 169: 158,
176
- 170: 159,
177
- 171: 160,
178
- 172: 161,
179
- 173: 162,
180
- 174: 163,
181
- 175: 164,
182
- 176: 165,
183
- 177: 166,
184
- 178: 167,
185
- 179: 168,
186
- 180: 169,
187
- 181: 170,
188
- 255: 255,
189
- }
190
-
191
- def convert_to_trainID(
192
- maskpath, out_mask_dir, is_train, clsID_to_trID=full_clsID_to_trID, suffix=""
193
- ):
194
- mask = np.array(Image.open(maskpath))
195
- mask_copy = np.ones_like(mask, dtype=np.uint8) * 255
196
- for clsID, trID in clsID_to_trID.items():
197
- mask_copy[mask == clsID] = trID
198
- seg_filename = (
199
- osp.join(out_mask_dir, "train2017" + suffix, osp.basename(maskpath))
200
- if is_train
201
- else osp.join(out_mask_dir, "val2017" + suffix, osp.basename(maskpath))
202
- )
203
- if len(np.unique(mask_copy)) == 1 and np.unique(mask_copy)[0] == 255:
204
- return
205
- Image.fromarray(mask_copy).save(seg_filename, "PNG")
206
-
207
-
208
-
209
- if __name__ == "__main__":
210
- dataset_dir = Path(os.getenv("DETECTRON2_DATASETS", "datasets"))
211
- print('Caution: we only generate the training set!')
212
- coco_path = dataset_dir / "coco"
213
- mask_dir = coco_path / "stuffthingmaps"
214
- out_mask_dir = coco_path / "stuffthingmaps_detectron2"
215
- for name in ["train2017"]:
216
- os.makedirs((out_mask_dir / name), exist_ok=True)
217
- train_list = glob(osp.join(mask_dir, "train2017", "*.png"))
218
- for file in tqdm.tqdm(train_list):
219
- convert_to_trainID(file, out_mask_dir, is_train=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/prepare_pascal_context.py DELETED
@@ -1,69 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import tqdm
5
- import os
6
- import os.path as osp
7
- from pathlib import Path
8
-
9
- import numpy as np
10
- from PIL import Image
11
- import scipy.io
12
-
13
- def convert_pc59(mask_path, new_mask_path, pc59_dict):
14
- mat = scipy.io.loadmat(mask_path)
15
- mask = mat['LabelMap']
16
-
17
- mask_copy = np.ones_like(mask, dtype=np.uint8) * 255
18
- for trID, clsID in pc59_dict.items():
19
- mask_copy[mask == clsID] = trID
20
-
21
- min_value = np.amin(mask_copy)
22
- assert min_value >= 0, print(min_value)
23
- Image.fromarray(mask_copy).save(new_mask_path, "PNG")
24
-
25
- def convert_pc459(mask_path, new_mask_path):
26
- mat = scipy.io.loadmat(mask_path)
27
- mask = mat['LabelMap']
28
- mask = mask - 1
29
- min_value = np.amin(mask)
30
- assert min_value >= 0, print(min_value)
31
- Image.fromarray(mask).save(new_mask_path, "TIFF")
32
-
33
-
34
- if __name__ == "__main__":
35
- dataset_dir = Path(os.getenv("DETECTRON2_DATASETS", "datasets"))
36
- print('Caution: we only generate the validation set!')
37
- pc_path = dataset_dir / "VOCdevkit/VOC2010"
38
-
39
- val_list = open(pc_path / "pascalcontext_val.txt", "r")
40
- pc459_labels = open(pc_path / "labels.txt", "r")
41
- pc59_labels = open(pc_path / "59_labels.txt", "r")
42
-
43
- pc459_dict = {}
44
- for line in pc459_labels.readlines():
45
- if ':' in line:
46
- idx, name = line.split(':')
47
- idx = int(idx.strip())
48
- name = name.strip()
49
- pc459_dict[name] = idx
50
-
51
- pc59_dict = {}
52
- for i, line in enumerate(pc59_labels.readlines()):
53
- name = line.split(':')[-1].strip()
54
- if name is not '':
55
- pc59_dict[i] = pc459_dict[name]
56
-
57
- pc459_dir = pc_path / "annotations_detectron2" / "pc459_val"
58
- pc459_dir.mkdir(parents=True, exist_ok=True)
59
- pc59_dir = pc_path / "annotations_detectron2" / "pc59_val"
60
- pc59_dir.mkdir(parents=True, exist_ok=True)
61
-
62
- for line in tqdm.tqdm(val_list.readlines()):
63
- fileid = line.strip()
64
- ori_mask = f'{pc_path}/trainval/{fileid}.mat'
65
- pc459_dst = f'{pc459_dir}/{fileid}.tif'
66
- pc59_dst = f'{pc59_dir}/{fileid}.png'
67
- if osp.exists(ori_mask):
68
- convert_pc459(ori_mask, pc459_dst)
69
- convert_pc59(ori_mask, pc59_dst, pc59_dict)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
datasets/prepare_voc_sem_seg.py DELETED
@@ -1,71 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
- # Modified by Feng Liang from https://github.com/MendelXu/zsseg.baseline/blob/master/datasets/prepare_voc_sem_seg.py
4
-
5
- import os
6
- import os.path as osp
7
- from pathlib import Path
8
- import tqdm
9
-
10
- import numpy as np
11
- from PIL import Image
12
-
13
-
14
- clsID_to_trID = {
15
- 0: 255,
16
- 1: 0,
17
- 2: 1,
18
- 3: 2,
19
- 4: 3,
20
- 5: 4,
21
- 6: 5,
22
- 7: 6,
23
- 8: 7,
24
- 9: 8,
25
- 10: 9,
26
- 11: 10,
27
- 12: 11,
28
- 13: 12,
29
- 14: 13,
30
- 15: 14,
31
- 16: 15,
32
- 17: 16,
33
- 18: 17,
34
- 19: 18,
35
- 20: 19,
36
- 255: 255,
37
- }
38
-
39
- def convert_to_trainID(
40
- maskpath, out_mask_dir, is_train, clsID_to_trID=clsID_to_trID, suffix=""
41
- ):
42
- mask = np.array(Image.open(maskpath))
43
- mask_copy = np.ones_like(mask, dtype=np.uint8) * 255
44
- for clsID, trID in clsID_to_trID.items():
45
- mask_copy[mask == clsID] = trID
46
- seg_filename = (
47
- osp.join(out_mask_dir, "train" + suffix, osp.basename(maskpath))
48
- if is_train
49
- else osp.join(out_mask_dir, "val" + suffix, osp.basename(maskpath))
50
- )
51
- if len(np.unique(mask_copy)) == 1 and np.unique(mask_copy)[0] == 255:
52
- return
53
- Image.fromarray(mask_copy).save(seg_filename, "PNG")
54
-
55
-
56
-
57
- if __name__ == "__main__":
58
- dataset_dir = Path(os.getenv("DETECTRON2_DATASETS", "datasets"))
59
- print('Caution: we only generate the validation set!')
60
- voc_path = dataset_dir / "VOCdevkit" / "VOC2012"
61
- out_mask_dir = voc_path / "annotations_detectron2"
62
- out_image_dir = voc_path / "images_detectron2"
63
- for name in ["val"]:
64
- os.makedirs((out_mask_dir / name), exist_ok=True)
65
- os.makedirs((out_image_dir / name), exist_ok=True)
66
- val_list = [
67
- osp.join(voc_path, "SegmentationClassAug", f + ".png")
68
- for f in np.loadtxt(osp.join(voc_path, "ImageSets/Segmentation/val.txt"), dtype=np.str).tolist()
69
- ]
70
- for file in tqdm.tqdm(val_list):
71
- convert_to_trainID(file, out_mask_dir, is_train=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
open_vocab_seg/.DS_Store CHANGED
Binary files a/open_vocab_seg/.DS_Store and b/open_vocab_seg/.DS_Store differ
open_vocab_seg/modeling/.DS_Store CHANGED
Binary files a/open_vocab_seg/modeling/.DS_Store and b/open_vocab_seg/modeling/.DS_Store differ
open_vocab_seg/modeling/clip_adapter/__init__.py CHANGED
@@ -21,3 +21,5 @@ def build_text_prompt(cfg):
21
  "Prompt learner {} is not supported".format(cfg.TEXT_TEMPLATES)
22
  )
23
  return text_templates
 
 
21
  "Prompt learner {} is not supported".format(cfg.TEXT_TEMPLATES)
22
  )
23
  return text_templates
24
+
25
+ from .clip import tokenize
open_vocab_seg/modeling/clip_adapter/clip/__init__.py ADDED
@@ -0,0 +1 @@
 
1
+ from .clip import *
open_vocab_seg/modeling/clip_adapter/clip/bpe_simple_vocab_16e6.txt.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
3
+ size 1356917
open_vocab_seg/modeling/clip_adapter/clip/clip.py ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import hashlib
2
+ import os
3
+ import urllib
4
+ import warnings
5
+ from collections import OrderedDict
6
+ from typing import Union, List
7
+
8
+ import torch
9
+ from PIL import Image
10
+ from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
11
+ from tqdm import tqdm
12
+
13
+ from .model import build_model
14
+ from .simple_tokenizer import SimpleTokenizer as _Tokenizer
15
+
16
+ try:
17
+ from torchvision.transforms import InterpolationMode
18
+
19
+ BICUBIC = InterpolationMode.BICUBIC
20
+ except ImportError:
21
+ BICUBIC = Image.BICUBIC
22
+
23
+
24
+ if torch.__version__.split(".") < ["1", "7", "1"]:
25
+ warnings.warn("PyTorch version 1.7.1 or higher is recommended")
26
+
27
+
28
+ __all__ = ["available_models", "load", "tokenize"]
29
+ _tokenizer = _Tokenizer()
30
+
31
+ _MODELS = {
32
+ "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
33
+ "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
34
+ "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
35
+ "RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
36
+ "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
37
+ "ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
38
+ "ViT-L/14": "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
39
+ "ViT-L/14@336px": "https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt",
40
+ }
41
+
42
+
43
+ def _download(url: str, root: str = os.path.expanduser("~/.cache/clip")):
44
+ os.makedirs(root, exist_ok=True)
45
+ filename = os.path.basename(url)
46
+
47
+ expected_sha256 = url.split("/")[-2]
48
+ download_target = os.path.join(root, filename)
49
+
50
+ if os.path.exists(download_target) and not os.path.isfile(download_target):
51
+ raise RuntimeError(f"{download_target} exists and is not a regular file")
52
+
53
+ if os.path.isfile(download_target):
54
+ if (
55
+ hashlib.sha256(open(download_target, "rb").read()).hexdigest()
56
+ == expected_sha256
57
+ ):
58
+ return download_target
59
+ else:
60
+ warnings.warn(
61
+ f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file"
62
+ )
63
+
64
+ with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
65
+ with tqdm(
66
+ total=int(source.info().get("Content-Length")),
67
+ ncols=80,
68
+ unit="iB",
69
+ unit_scale=True,
70
+ ) as loop:
71
+ while True:
72
+ buffer = source.read(8192)
73
+ if not buffer:
74
+ break
75
+
76
+ output.write(buffer)
77
+ loop.update(len(buffer))
78
+
79
+ if (
80
+ hashlib.sha256(open(download_target, "rb").read()).hexdigest()
81
+ != expected_sha256
82
+ ):
83
+ raise RuntimeError(
84
+ f"Model has been downloaded but the SHA256 checksum does not not match"
85
+ )
86
+
87
+ return download_target
88
+
89
+
90
+ def _transform(n_px):
91
+ return Compose(
92
+ [
93
+ Resize(n_px, interpolation=BICUBIC),
94
+ CenterCrop(n_px),
95
+ lambda image: image.convert("RGB"),
96
+ ToTensor(),
97
+ Normalize(
98
+ (0.48145466, 0.4578275, 0.40821073),
99
+ (0.26862954, 0.26130258, 0.27577711),
100
+ ),
101
+ ]
102
+ )
103
+
104
+
105
+ def available_models() -> List[str]:
106
+ """Returns the names of available CLIP models"""
107
+ return list(_MODELS.keys())
108
+
109
+
110
+ def load(
111
+ name: str,
112
+ mask_prompt_depth: int = 0,
113
+ device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu",
114
+ jit=False,
115
+ ):
116
+ """Load a CLIP model
117
+
118
+ Parameters
119
+ ----------
120
+ name : str
121
+ A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict
122
+
123
+ device : Union[str, torch.device]
124
+ The device to put the loaded model
125
+
126
+ jit : bool
127
+ Whether to load the optimized JIT model or more hackable non-JIT model (default).
128
+
129
+ Returns
130
+ -------
131
+ model : torch.nn.Module
132
+ The CLIP model
133
+
134
+ preprocess : Callable[[PIL.Image], torch.Tensor]
135
+ A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input
136
+ """
137
+ if name in _MODELS:
138
+ model_path = _download(_MODELS[name])
139
+ elif os.path.isfile(name):
140
+ model_path = name
141
+ else:
142
+ raise RuntimeError(
143
+ f"Model {name} not found; available models = {available_models()}"
144
+ )
145
+
146
+ try:
147
+ # loading JIT archive
148
+ model = torch.jit.load(model_path, map_location=device if jit else "cpu").eval()
149
+ state_dict = None
150
+ except RuntimeError:
151
+ # loading saved state dict
152
+ if jit:
153
+ warnings.warn(
154
+ f"File {model_path} is not a JIT archive. Loading as a state dict instead"
155
+ )
156
+ jit = False
157
+ state_dict = torch.load(model_path, map_location="cpu")
158
+ if 'state_dict' in state_dict:
159
+ new_state_dict = OrderedDict()
160
+ for k, v in state_dict['state_dict'].items():
161
+ if k.startswith('module.'):
162
+ name = k[7:] # remove `module.`
163
+ new_state_dict[name] = v
164
+ state_dict = new_state_dict
165
+
166
+ if not jit:
167
+ model = build_model(state_dict or model.state_dict(), mask_prompt_depth).to(device)
168
+ if str(device) == "cpu":
169
+ model.float()
170
+ return model, _transform(model.visual.input_resolution)
171
+
172
+ # patch the device names
173
+ device_holder = torch.jit.trace(
174
+ lambda: torch.ones([]).to(torch.device(device)), example_inputs=[]
175
+ )
176
+ device_node = [
177
+ n
178
+ for n in device_holder.graph.findAllNodes("prim::Constant")
179
+ if "Device" in repr(n)
180
+ ][-1]
181
+
182
+ def patch_device(module):
183
+ try:
184
+ graphs = [module.graph] if hasattr(module, "graph") else []
185
+ except RuntimeError:
186
+ graphs = []
187
+
188
+ if hasattr(module, "forward1"):
189
+ graphs.append(module.forward1.graph)
190
+
191
+ for graph in graphs:
192
+ for node in graph.findAllNodes("prim::Constant"):
193
+ if "value" in node.attributeNames() and str(node["value"]).startswith(
194
+ "cuda"
195
+ ):
196
+ node.copyAttributes(device_node)
197
+
198
+ model.apply(patch_device)
199
+ patch_device(model.encode_image)
200
+ patch_device(model.encode_text)
201
+
202
+ # patch dtype to float32 on CPU
203
+ if str(device) == "cpu":
204
+ float_holder = torch.jit.trace(
205
+ lambda: torch.ones([]).float(), example_inputs=[]
206
+ )
207
+ float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
208
+ float_node = float_input.node()
209
+
210
+ def patch_float(module):
211
+ try:
212
+ graphs = [module.graph] if hasattr(module, "graph") else []
213
+ except RuntimeError:
214
+ graphs = []
215
+
216
+ if hasattr(module, "forward1"):
217
+ graphs.append(module.forward1.graph)
218
+
219
+ for graph in graphs:
220
+ for node in graph.findAllNodes("aten::to"):
221
+ inputs = list(node.inputs())
222
+ for i in [
223
+ 1,
224
+ 2,
225
+ ]: # dtype can be the second or third argument to aten::to()
226
+ if inputs[i].node()["value"] == 5:
227
+ inputs[i].node().copyAttributes(float_node)
228
+
229
+ model.apply(patch_float)
230
+ patch_float(model.encode_image)
231
+ patch_float(model.encode_text)
232
+
233
+ model.float()
234
+
235
+ return model, _transform(model.input_resolution.item())
236
+
237
+
238
+ def tokenize(
239
+ texts: Union[str, List[str]],
240
+ context_length: int = 77,
241
+ truncate: bool = False,
242
+ return_length: bool = False,
243
+ ) -> torch.LongTensor:
244
+ """
245
+ Returns the tokenized representation of given input string(s)
246
+
247
+ Parameters
248
+ ----------
249
+ texts : Union[str, List[str]]
250
+ An input string or a list of input strings to tokenize
251
+
252
+ context_length : int
253
+ The context length to use; all CLIP models use 77 as the context length
254
+
255
+ truncate: bool
256
+ Whether to truncate the text in case its encoding is longer than the context length
257
+
258
+ Returns
259
+ -------
260
+ A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
261
+ """
262
+ if isinstance(texts, str):
263
+ texts = [texts]
264
+
265
+ sot_token = _tokenizer.encoder["<|startoftext|>"]
266
+ eot_token = _tokenizer.encoder["<|endoftext|>"]
267
+ all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
268
+ result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
269
+ length = []
270
+ for i, tokens in enumerate(all_tokens):
271
+ if len(tokens) > context_length:
272
+ if truncate:
273
+ tokens = tokens[:context_length]
274
+ tokens[-1] = eot_token
275
+ length.append(context_length)
276
+ else:
277
+ raise RuntimeError(
278
+ f"Input {texts[i]} is too long for context length {context_length}"
279
+ )
280
+ else:
281
+ length.append(len(tokens))
282
+ result[i, : len(tokens)] = torch.tensor(tokens)
283
+ if return_length:
284
+ return result, length
285
+ return result
open_vocab_seg/modeling/clip_adapter/clip/model.py ADDED
@@ -0,0 +1,613 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Facebook, Inc. and its affiliates.
2
+ # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
+ # Modified by Feng Liang from https://github.com/openai/CLIP/blob/main/clip/model.py
4
+
5
+ from collections import OrderedDict
6
+ from typing import Tuple, Union
7
+
8
+ import numpy as np
9
+ import torch
10
+ import torch.nn.functional as F
11
+ from torch import nn
12
+
13
+
14
+ class Bottleneck(nn.Module):
15
+ expansion = 4
16
+
17
+ def __init__(self, inplanes, planes, stride=1):
18
+ super().__init__()
19
+
20
+ # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
21
+ self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
22
+ self.bn1 = nn.BatchNorm2d(planes)
23
+
24
+ self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
25
+ self.bn2 = nn.BatchNorm2d(planes)
26
+
27
+ self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()
28
+
29
+ self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
30
+ self.bn3 = nn.BatchNorm2d(planes * self.expansion)
31
+
32
+ self.relu = nn.ReLU(inplace=True)
33
+ self.downsample = None
34
+ self.stride = stride
35
+
36
+ if stride > 1 or inplanes != planes * Bottleneck.expansion:
37
+ # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
38
+ self.downsample = nn.Sequential(
39
+ OrderedDict(
40
+ [
41
+ ("-1", nn.AvgPool2d(stride)),
42
+ (
43
+ "0",
44
+ nn.Conv2d(
45
+ inplanes,
46
+ planes * self.expansion,
47
+ 1,
48
+ stride=1,
49
+ bias=False,
50
+ ),
51
+ ),
52
+ ("1", nn.BatchNorm2d(planes * self.expansion)),
53
+ ]
54
+ )
55
+ )
56
+
57
+ def forward(self, x: torch.Tensor):
58
+ identity = x
59
+
60
+ out = self.relu(self.bn1(self.conv1(x)))
61
+ out = self.relu(self.bn2(self.conv2(out)))
62
+ out = self.avgpool(out)
63
+ out = self.bn3(self.conv3(out))
64
+
65
+ if self.downsample is not None:
66
+ identity = self.downsample(x)
67
+
68
+ out += identity
69
+ out = self.relu(out)
70
+ return out
71
+
72
+
73
+ class AttentionPool2d(nn.Module):
74
+ def __init__(
75
+ self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None
76
+ ):
77
+ super().__init__()
78
+ self.positional_embedding = nn.Parameter(
79
+ torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5
80
+ )
81
+ self.k_proj = nn.Linear(embed_dim, embed_dim)
82
+ self.q_proj = nn.Linear(embed_dim, embed_dim)
83
+ self.v_proj = nn.Linear(embed_dim, embed_dim)
84
+ self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
85
+ self.num_heads = num_heads
86
+ self.grid_size = spacial_dim
87
+
88
+ def forward(self, x, mask=None, return_cls=True):
89
+ b, c, gh, gw = x.shape
90
+ # remove irrelated feature
91
+ if mask is not None:
92
+ mask = F.interpolate(mask[:, None, ...], size=(gh, gw)).squeeze(
93
+ 1
94
+ ) # [N,H,W] -> [N,grid,grid]
95
+ mask = (mask > 0.5).reshape(mask.shape[0], -1)
96
+ mask = torch.cat([mask, mask.new_ones(mask.shape[0], 1)], dim=1)
97
+ if x.size()[0] == 1:
98
+ x = x.expand(mask.shape[0], c, gh, gw)
99
+
100
+ x = x.reshape(x.shape[0], c, gh * gw).permute(2, 0, 1) # NCHW -> (HW)NC
101
+
102
+ x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0) # (HW+1)NC
103
+ positional_embedding = self.positional_embedding
104
+ if not (self.positional_embedding.shape[0] == x.shape[0]):
105
+ cls_pos = positional_embedding[0:1, :]
106
+ per_pos_embedding = (
107
+ F.interpolate(
108
+ positional_embedding[1:, :]
109
+ .permute(1, 0)
110
+ .view(1, -1, self.grid_size, self.grid_size),
111
+ size=(gh, gw),
112
+ mode="bicubic",
113
+ )
114
+ .reshape(-1, gh * gw)
115
+ .permute(1, 0)
116
+ )
117
+ positional_embedding = torch.cat([cls_pos, per_pos_embedding])
118
+
119
+ x = x + positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC
120
+ x, _ = F.multi_head_attention_forward(
121
+ query=x,
122
+ key=x,
123
+ value=x,
124
+ embed_dim_to_check=x.shape[-1],
125
+ num_heads=self.num_heads,
126
+ q_proj_weight=self.q_proj.weight,
127
+ k_proj_weight=self.k_proj.weight,
128
+ v_proj_weight=self.v_proj.weight,
129
+ in_proj_weight=None,
130
+ in_proj_bias=torch.cat(
131
+ [self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]
132
+ ),
133
+ bias_k=None,
134
+ bias_v=None,
135
+ add_zero_attn=False,
136
+ dropout_p=0,
137
+ out_proj_weight=self.c_proj.weight,
138
+ out_proj_bias=self.c_proj.bias,
139
+ use_separate_proj_weight=True,
140
+ training=self.training,
141
+ need_weights=False,
142
+ key_padding_mask=mask,
143
+ )
144
+
145
+ if return_cls:
146
+ return x[0]
147
+ else:
148
+ return x
149
+
150
+
151
+ class ModifiedResNet(nn.Module):
152
+ """
153
+ A ResNet class that is similar to torchvision's but contains the following changes:
154
+ - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
155
+ - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
156
+ - The final pooling layer is a QKV attention instead of an average pool
157
+ """
158
+
159
+ def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
160
+ super().__init__()
161
+ self.output_dim = output_dim
162
+ self.input_resolution = input_resolution
163
+
164
+ # the 3-layer stem
165
+ self.conv1 = nn.Conv2d(
166
+ 3, width // 2, kernel_size=3, stride=2, padding=1, bias=False
167
+ )
168
+ self.bn1 = nn.BatchNorm2d(width // 2)
169
+ self.conv2 = nn.Conv2d(
170
+ width // 2, width // 2, kernel_size=3, padding=1, bias=False
171
+ )
172
+ self.bn2 = nn.BatchNorm2d(width // 2)
173
+ self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
174
+ self.bn3 = nn.BatchNorm2d(width)
175
+ self.avgpool = nn.AvgPool2d(2)
176
+ self.relu = nn.ReLU(inplace=True)
177
+
178
+ # residual layers
179
+ self._inplanes = width # this is a *mutable* variable used during construction
180
+ self.layer1 = self._make_layer(width, layers[0])
181
+ self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
182
+ self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
183
+ self.layer4 = self._make_layer(width * 8, layers[3], stride=2)
184
+
185
+ embed_dim = width * 32 # the ResNet feature dimension
186
+ self.attnpool = AttentionPool2d(
187
+ input_resolution // 32, embed_dim, heads, output_dim
188
+ )
189
+
190
+ def _make_layer(self, planes, blocks, stride=1):
191
+ layers = [Bottleneck(self._inplanes, planes, stride)]
192
+
193
+ self._inplanes = planes * Bottleneck.expansion
194
+ for _ in range(1, blocks):
195
+ layers.append(Bottleneck(self._inplanes, planes))
196
+
197
+ return nn.Sequential(*layers)
198
+
199
+ def forward(self, x, mask: torch.Tensor = None, return_cls=True):
200
+ def stem(x):
201
+ for conv, bn in [
202
+ (self.conv1, self.bn1),
203
+ (self.conv2, self.bn2),
204
+ (self.conv3, self.bn3),
205
+ ]:
206
+ x = self.relu(bn(conv(x)))
207
+ x = self.avgpool(x)
208
+ return x
209
+
210
+ x = x.type(self.conv1.weight.dtype)
211
+ x = stem(x) # 1/4,1/4
212
+ x = self.layer1(x)
213
+ x = self.layer2(x) # 1/8,1/8
214
+ x = self.layer3(x) # 1/16,1/16
215
+ x = self.layer4(x) # 1/32,1/32
216
+ b, c, gh, gw = x.shape
217
+ x = self.attnpool(x, mask, return_cls)
218
+ if not return_cls:
219
+ return x[1:].permute(1, 0, 2).reshape(b, gh, gw, x.shape[-1]) # N,L,C
220
+ return x
221
+
222
+
223
+ class LayerNorm(nn.LayerNorm):
224
+ """Subclass torch's LayerNorm to handle fp16."""
225
+
226
+ def forward(self, x: torch.Tensor):
227
+ orig_type = x.dtype
228
+ ret = super().forward(x.type(torch.float32))
229
+ return ret.type(orig_type)
230
+
231
+
232
+ class QuickGELU(nn.Module):
233
+ def forward(self, x: torch.Tensor):
234
+ return x * torch.sigmoid(1.702 * x)
235
+
236
+
237
+ class ResidualAttentionBlock(nn.Module):
238
+ def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
239
+ super().__init__()
240
+
241
+ self.attn = nn.MultiheadAttention(d_model, n_head)
242
+ self.ln_1 = LayerNorm(d_model)
243
+ self.mlp = nn.Sequential(
244
+ OrderedDict(
245
+ [
246
+ ("c_fc", nn.Linear(d_model, d_model * 4)),
247
+ ("gelu", QuickGELU()),
248
+ ("c_proj", nn.Linear(d_model * 4, d_model)),
249
+ ]
250
+ )
251
+ )
252
+ self.ln_2 = LayerNorm(d_model)
253
+ self.attn_mask = attn_mask
254
+
255
+ def attention(self, x: torch.Tensor, **kwargs):
256
+ self.attn_mask = (
257
+ self.attn_mask.to(dtype=x.dtype, device=x.device)
258
+ if self.attn_mask is not None
259
+ else None
260
+ )
261
+ return self.attn(
262
+ x, x, x, need_weights=False, attn_mask=self.attn_mask, **kwargs
263
+ )[0]
264
+
265
+ def forward(self, x: torch.Tensor, **kwargs):
266
+ x = x + self.attention(self.ln_1(x), **kwargs)
267
+ x = x + self.mlp(self.ln_2(x))
268
+ return x
269
+
270
+
271
+ class Transformer(nn.Module):
272
+ def __init__(
273
+ self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None
274
+ ):
275
+ super().__init__()
276
+ self.width = width
277
+ self.layers = layers
278
+ self.resblocks = nn.Sequential(
279
+ *[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)]
280
+ )
281
+
282
+ def forward(self, x: torch.Tensor, **kwargs):
283
+ for block in self.resblocks:
284
+ x = block(x, **kwargs)
285
+ return x
286
+
287
+
288
+ class VisionTransformer(nn.Module):
289
+ def __init__(
290
+ self,
291
+ input_resolution: int,
292
+ patch_size: int,
293
+ mask_prompt_depth: int,
294
+ width: int,
295
+ layers: int,
296
+ heads: int,
297
+ output_dim: int,
298
+ ):
299
+ super().__init__()
300
+ self.input_resolution = input_resolution
301
+ self.output_dim = output_dim
302
+ self.conv1 = nn.Conv2d(
303
+ in_channels=3,
304
+ out_channels=width,
305
+ kernel_size=patch_size,
306
+ stride=patch_size,
307
+ bias=False,
308
+ )
309
+
310
+ scale = width ** -0.5
311
+ self.class_embedding = nn.Parameter(scale * torch.randn(width))
312
+ self.positional_embedding = nn.Parameter(
313
+ scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
314
+ )
315
+ self.grid_size = input_resolution // patch_size
316
+ self.ln_pre = LayerNorm(width)
317
+
318
+ self.transformer = Transformer(width, layers, heads)
319
+
320
+ self.ln_post = LayerNorm(width)
321
+ self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
322
+
323
+ self.mask_pool = nn.AvgPool2d(patch_size, stride=patch_size)
324
+ self.mask_prompt_depth = mask_prompt_depth
325
+ self.mask_embedding = nn.Parameter(torch.zeros(self.mask_prompt_depth, self.grid_size * self.grid_size, width))
326
+
327
+ def forward(self, x: torch.Tensor, m: torch.Tensor = None):
328
+ x = self.conv1(x) # shape = [*, width, grid, grid]
329
+ x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
330
+ x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
331
+ if m is not None:
332
+ m = self.mask_pool(m.to(torch.float).squeeze()).reshape(m.shape[0], -1).unsqueeze(-1)
333
+ m = torch.ceil(m)
334
+ if self.mask_embedding.shape[1] == 1:
335
+ mask_embedding = self.mask_embedding.to(x.dtype).repeat(1, x.shape[1], 1)
336
+ else:
337
+ mask_embedding = self.mask_embedding.to(x.dtype)
338
+ x = x * m + mask_embedding[0].unsqueeze(0) * (1 - m)
339
+
340
+ x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width]
341
+ x = x + self.positional_embedding.to(x.dtype)
342
+ x = self.ln_pre(x)
343
+
344
+ x = x.permute(1, 0, 2) # NLD -> LND
345
+ if m is not None:
346
+ for i, blk in enumerate(self.transformer.resblocks):
347
+ d = i + 1
348
+ x = blk(x)
349
+ if d < self.mask_prompt_depth:
350
+ masked_x = x[1:, :, :] * m.permute(1, 0, 2) + \
351
+ mask_embedding[d].unsqueeze(0).permute(1, 0, 2) * (1 - m.permute(1, 0, 2))
352
+ x = torch.cat([x[:1, :, :], masked_x], dim=0)
353
+ else:
354
+ x = self.transformer(x)
355
+ x = x.permute(1, 0, 2) # LND -> NLD
356
+
357
+ x = self.ln_post(x[:, 0, :])
358
+
359
+ if self.proj is not None:
360
+ x = x @ self.proj
361
+
362
+ return x
363
+
364
+
365
+
366
+ class CLIP(nn.Module):
367
+ def __init__(
368
+ self,
369
+ embed_dim: int,
370
+ # vision
371
+ image_resolution: int,
372
+ vision_layers: Union[Tuple[int, int, int, int], int],
373
+ vision_width: int,
374
+ vision_patch_size: int,
375
+ mask_prompt_depth: int,
376
+ # text
377
+ context_length: int,
378
+ vocab_size: int,
379
+ transformer_width: int,
380
+ transformer_heads: int,
381
+ transformer_layers: int,
382
+ ):
383
+ super().__init__()
384
+
385
+ self.context_length = context_length
386
+
387
+ if isinstance(vision_layers, (tuple, list)):
388
+ vision_heads = vision_width * 32 // 64
389
+ self.visual = ModifiedResNet(
390
+ layers=vision_layers,
391
+ output_dim=embed_dim,
392
+ heads=vision_heads,
393
+ input_resolution=image_resolution,
394
+ width=vision_width,
395
+ )
396
+ else:
397
+ vision_heads = vision_width // 64
398
+ self.visual = VisionTransformer(
399
+ input_resolution=image_resolution,
400
+ patch_size=vision_patch_size,
401
+ mask_prompt_depth=mask_prompt_depth,
402
+ width=vision_width,
403
+ layers=vision_layers,
404
+ heads=vision_heads,
405
+ output_dim=embed_dim,
406
+ )
407
+
408
+ self.transformer = Transformer(
409
+ width=transformer_width,
410
+ layers=transformer_layers,
411
+ heads=transformer_heads,
412
+ attn_mask=self.build_attention_mask(),
413
+ )
414
+
415
+ self.vocab_size = vocab_size
416
+ self.token_embedding = nn.Embedding(vocab_size, transformer_width)
417
+ self.positional_embedding = nn.Parameter(
418
+ torch.empty(self.context_length, transformer_width)
419
+ )
420
+ self.ln_final = LayerNorm(transformer_width)
421
+
422
+ self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
423
+ self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
424
+
425
+ self.initialize_parameters()
426
+
427
+ def initialize_parameters(self):
428
+ nn.init.normal_(self.token_embedding.weight, std=0.02)
429
+ nn.init.normal_(self.positional_embedding, std=0.01)
430
+
431
+ if isinstance(self.visual, ModifiedResNet):
432
+ if self.visual.attnpool is not None:
433
+ std = self.visual.attnpool.c_proj.in_features ** -0.5
434
+ nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
435
+ nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
436
+ nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
437
+ nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)
438
+
439
+ for resnet_block in [
440
+ self.visual.layer1,
441
+ self.visual.layer2,
442
+ self.visual.layer3,
443
+ self.visual.layer4,
444
+ ]:
445
+ for name, param in resnet_block.named_parameters():
446
+ if name.endswith("bn3.weight"):
447
+ nn.init.zeros_(param)
448
+
449
+ proj_std = (self.transformer.width ** -0.5) * (
450
+ (2 * self.transformer.layers) ** -0.5
451
+ )
452
+ attn_std = self.transformer.width ** -0.5
453
+ fc_std = (2 * self.transformer.width) ** -0.5
454
+ for block in self.transformer.resblocks:
455
+ nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
456
+ nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
457
+ nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
458
+ nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)
459
+
460
+ if self.text_projection is not None:
461
+ nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)
462
+
463
+ def build_attention_mask(self):
464
+ # lazily create causal attention mask, with full attention between the vision tokens
465
+ # pytorch uses additive attention mask; fill with -inf
466
+ mask = torch.empty(self.context_length, self.context_length)
467
+ mask.fill_(float("-inf"))
468
+ mask.triu_(1) # zero out the lower diagonal
469
+ return mask
470
+
471
+ @property
472
+ def dtype(self):
473
+ return self.visual.conv1.weight.dtype
474
+
475
+ def encode_image(self, image, **kwargs):
476
+ return self.visual(image.type(self.dtype), **kwargs)
477
+
478
+ def encode_text(self, text):
479
+ x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]
480
+
481
+ x = x + self.positional_embedding.type(self.dtype)
482
+ x = x.permute(1, 0, 2) # NLD -> LND
483
+ x = self.transformer(x)
484
+ x = x.permute(1, 0, 2) # LND -> NLD
485
+ x = self.ln_final(x).type(self.dtype)
486
+
487
+ # x.shape = [batch_size, n_ctx, transformer.width]
488
+ # take features from the eot embedding (eot_token is the highest number in each sequence)
489
+ x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
490
+
491
+ return x
492
+
493
+ def forward(self, image, text):
494
+ image_features = self.encode_image(image)
495
+ text_features = self.encode_text(text)
496
+
497
+ # normalized features
498
+ image_features = image_features / image_features.norm(dim=-1, keepdim=True)
499
+ text_features = text_features / text_features.norm(dim=-1, keepdim=True)
500
+
501
+ # cosine similarity as logits
502
+ logit_scale = self.logit_scale.exp()
503
+ logits_per_image = logit_scale * image_features @ text_features.t()
504
+ logits_per_text = logit_scale * text_features @ image_features.t()
505
+
506
+ # shape = [global_batch_size, global_batch_size]
507
+ return logits_per_image, logits_per_text
508
+
509
+
510
+ def convert_weights(model: nn.Module):
511
+ """Convert applicable model parameters to fp16"""
512
+
513
+ def _convert_weights_to_fp16(l):
514
+ if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
515
+ l.weight.data = l.weight.data.half()
516
+ if l.bias is not None:
517
+ l.bias.data = l.bias.data.half()
518
+
519
+ if isinstance(l, nn.MultiheadAttention):
520
+ for attr in [
521
+ *[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]],
522
+ "in_proj_bias",
523
+ "bias_k",
524
+ "bias_v",
525
+ ]:
526
+ tensor = getattr(l, attr)
527
+ if tensor is not None:
528
+ tensor.data = tensor.data.half()
529
+
530
+ for name in ["text_projection", "proj"]:
531
+ if hasattr(l, name):
532
+ attr = getattr(l, name)
533
+ if attr is not None:
534
+ attr.data = attr.data.half()
535
+
536
+ model.apply(_convert_weights_to_fp16)
537
+
538
+
539
+ def build_model(state_dict: dict, mask_prompt_depth: int = 0):
540
+ vit = "visual.proj" in state_dict
541
+
542
+ if vit:
543
+ vision_width = state_dict["visual.conv1.weight"].shape[0]
544
+ vision_layers = len(
545
+ [
546
+ k
547
+ for k in state_dict.keys()
548
+ if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")
549
+ ]
550
+ )
551
+ vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
552
+ grid_size = round(
553
+ (state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5
554
+ )
555
+ image_resolution = vision_patch_size * grid_size
556
+ else:
557
+ assert mask_prompt_depth == 0, 'ResNets do not support mask prompt tuning'
558
+ counts: list = [
559
+ len(
560
+ set(
561
+ k.split(".")[2]
562
+ for k in state_dict
563
+ if k.startswith(f"visual.layer{b}")
564
+ )
565
+ )
566
+ for b in [1, 2, 3, 4]
567
+ ]
568
+ vision_layers = tuple(counts)
569
+ vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
570
+ output_width = round(
571
+ (state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5
572
+ )
573
+ vision_patch_size = None
574
+ assert (
575
+ output_width ** 2 + 1
576
+ == state_dict["visual.attnpool.positional_embedding"].shape[0]
577
+ )
578
+ image_resolution = output_width * 32
579
+
580
+ embed_dim = state_dict["text_projection"].shape[1]
581
+ context_length = state_dict["positional_embedding"].shape[0]
582
+ vocab_size = state_dict["token_embedding.weight"].shape[0]
583
+ transformer_width = state_dict["ln_final.weight"].shape[0]
584
+ transformer_heads = transformer_width // 64
585
+ transformer_layers = len(
586
+ set(
587
+ k.split(".")[2]
588
+ for k in state_dict
589
+ if k.startswith(f"transformer.resblocks")
590
+ )
591
+ )
592
+
593
+ model = CLIP(
594
+ embed_dim,
595
+ image_resolution,
596
+ vision_layers,
597
+ vision_width,
598
+ vision_patch_size,
599
+ mask_prompt_depth,
600
+ context_length,
601
+ vocab_size,
602
+ transformer_width,
603
+ transformer_heads,
604
+ transformer_layers,
605
+ )
606
+
607
+ for key in ["input_resolution", "context_length", "vocab_size"]:
608
+ if key in state_dict:
609
+ del state_dict[key]
610
+
611
+ convert_weights(model)
612
+ model.load_state_dict(state_dict, strict=False)
613
+ return model.eval()
open_vocab_seg/modeling/clip_adapter/clip/simple_tokenizer.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gzip
2
+ import html
3
+ import os
4
+ from functools import lru_cache
5
+
6
+ import ftfy
7
+ import regex as re
8
+
9
+
10
+ @lru_cache()
11
+ def default_bpe():
12
+ return os.path.join(
13
+ os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz"
14
+ )
15
+
16
+
17
+ @lru_cache()
18
+ def bytes_to_unicode():
19
+ """
20
+ Returns list of utf-8 byte and a corresponding list of unicode strings.
21
+ The reversible bpe codes work on unicode strings.
22
+ This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
23
+ When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
24
+ This is a signficant percentage of your normal, say, 32K bpe vocab.
25
+ To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
26
+ And avoids mapping to whitespace/control characters the bpe code barfs on.
27
+ """
28
+ bs = (
29
+ list(range(ord("!"), ord("~") + 1))
30
+ + list(range(ord("¡"), ord("¬") + 1))
31
+ + list(range(ord("®"), ord("ÿ") + 1))
32
+ )
33
+ cs = bs[:]
34
+ n = 0
35
+ for b in range(2 ** 8):
36
+ if b not in bs:
37
+ bs.append(b)
38
+ cs.append(2 ** 8 + n)
39
+ n += 1
40
+ cs = [chr(n) for n in cs]
41
+ return dict(zip(bs, cs))
42
+
43
+
44
+ def get_pairs(word):
45
+ """Return set of symbol pairs in a word.
46
+ Word is represented as tuple of symbols (symbols being variable-length strings).
47
+ """
48
+ pairs = set()
49
+ prev_char = word[0]
50
+ for char in word[1:]:
51
+ pairs.add((prev_char, char))
52
+ prev_char = char
53
+ return pairs
54
+
55
+
56
+ def basic_clean(text):
57
+ text = ftfy.fix_text(text)
58
+ text = html.unescape(html.unescape(text))
59
+ return text.strip()
60
+
61
+
62
+ def whitespace_clean(text):
63
+ text = re.sub(r"\s+", " ", text)
64
+ text = text.strip()
65
+ return text
66
+
67
+
68
+ class SimpleTokenizer(object):
69
+ def __init__(self, bpe_path: str = default_bpe()):
70
+ self.byte_encoder = bytes_to_unicode()
71
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
72
+ merges = gzip.open(bpe_path).read().decode("utf-8").split("\n")
73
+ merges = merges[1 : 49152 - 256 - 2 + 1]
74
+ merges = [tuple(merge.split()) for merge in merges]
75
+ vocab = list(bytes_to_unicode().values())
76
+ vocab = vocab + [v + "</w>" for v in vocab]
77
+ for merge in merges:
78
+ vocab.append("".join(merge))
79
+ vocab.extend(["<|startoftext|>", "<|endoftext|>"])
80
+ self.encoder = dict(zip(vocab, range(len(vocab))))
81
+ self.decoder = {v: k for k, v in self.encoder.items()}
82
+ self.bpe_ranks = dict(zip(merges, range(len(merges))))
83
+ self.cache = {
84
+ "<|startoftext|>": "<|startoftext|>",
85
+ "<|endoftext|>": "<|endoftext|>",
86
+ }
87
+ self.pat = re.compile(
88
+ r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
89
+ re.IGNORECASE,
90
+ )
91
+
92
+ def bpe(self, token):
93
+ if token in self.cache:
94
+ return self.cache[token]
95
+ word = tuple(token[:-1]) + (token[-1] + "</w>",)
96
+ pairs = get_pairs(word)
97
+
98
+ if not pairs:
99
+ return token + "</w>"
100
+
101
+ while True:
102
+ bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
103
+ if bigram not in self.bpe_ranks:
104
+ break
105
+ first, second = bigram
106
+ new_word = []
107
+ i = 0
108
+ while i < len(word):
109
+ try:
110
+ j = word.index(first, i)
111
+ new_word.extend(word[i:j])
112
+ i = j
113
+ except:
114
+ new_word.extend(word[i:])
115
+ break
116
+
117
+ if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
118
+ new_word.append(first + second)
119
+ i += 2
120
+ else:
121
+ new_word.append(word[i])
122
+ i += 1
123
+ new_word = tuple(new_word)
124
+ word = new_word
125
+ if len(word) == 1:
126
+ break
127
+ else:
128
+ pairs = get_pairs(word)
129
+ word = " ".join(word)
130
+ self.cache[token] = word
131
+ return word
132
+
133
+ def encode(self, text):
134
+ bpe_tokens = []
135
+ text = whitespace_clean(basic_clean(text)).lower()
136
+ for token in re.findall(self.pat, text):
137
+ token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
138
+ bpe_tokens.extend(
139
+ self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
140
+ )
141
+ return bpe_tokens
142
+
143
+ def decode(self, tokens):
144
+ text = "".join([self.decoder[token] for token in tokens])
145
+ text = (
146
+ bytearray([self.byte_decoder[c] for c in text])
147
+ .decode("utf-8", errors="replace")
148
+ .replace("</w>", " ")
149
+ )
150
+ return text
open_vocab_seg/modeling/clip_adapter/text_template.py CHANGED
@@ -6,7 +6,8 @@
6
 
7
  from typing import List
8
 
9
- import clip
 
10
  import torch
11
  from torch import nn
12
 
@@ -130,7 +131,7 @@ class PredefinedPromptExtractor(PromptExtractor):
130
  def forward(self, noun_list: List[str], clip_model: nn.Module):
131
  text_features_bucket = []
132
  for template in self.templates:
133
- noun_tokens = [clip.tokenize(template.format(noun)) for noun in noun_list]
134
  text_inputs = torch.cat(noun_tokens).to(
135
  clip_model.text_projection.data.device
136
  )
6
 
7
  from typing import List
8
 
9
+ # import clip
10
+ from .clip import tokenize
11
  import torch
12
  from torch import nn
13
 
131
  def forward(self, noun_list: List[str], clip_model: nn.Module):
132
  text_features_bucket = []
133
  for template in self.templates:
134
+ noun_tokens = [tokenize(template.format(noun)) for noun in noun_list]
135
  text_inputs = torch.cat(noun_tokens).to(
136
  clip_model.text_projection.data.device
137
  )
open_vocab_seg/modeling/clip_adapter/utils.py CHANGED
@@ -4,7 +4,7 @@
4
  from typing import Tuple
5
  import numpy as np
6
  import torch
7
- import clip
8
  from detectron2.utils.comm import get_local_rank, synchronize
9
 
10
 
@@ -70,10 +70,10 @@ def build_clip_model(model: str, mask_prompt_depth: int = 0, frozen: bool = True
70
  rank = get_local_rank()
71
  if rank == 0:
72
  # download on rank 0 only
73
- model, _ = clip.load(model, mask_prompt_depth=mask_prompt_depth, device="cpu")
74
  synchronize()
75
  if rank != 0:
76
- model, _ = clip.load(model, mask_prompt_depth=mask_prompt_depth, device="cpu")
77
  synchronize()
78
  if frozen:
79
  for param in model.parameters():
4
  from typing import Tuple
5
  import numpy as np
6
  import torch
7
+ from .clip import load as clip_load
8
  from detectron2.utils.comm import get_local_rank, synchronize
9
 
10
 
70
  rank = get_local_rank()
71
  if rank == 0:
72
  # download on rank 0 only
73
+ model, _ = clip_load(model, mask_prompt_depth=mask_prompt_depth, device="cpu")
74
  synchronize()
75
  if rank != 0:
76
+ model, _ = clip_load(model, mask_prompt_depth=mask_prompt_depth, device="cpu")
77
  synchronize()
78
  if frozen:
79
  for param in model.parameters():
configs/ovseg_swinB_vitL_demo.yaml → ovseg_swinB_vitL_demo.yaml RENAMED
@@ -12,7 +12,7 @@ MODEL:
12
  DROP_PATH_RATE: 0.3
13
  PATCH_NORM: True
14
  PRETRAIN_IMG_SIZE: 384
15
- WEIGHTS: "swin_base_patch4_window12_384_22k.pkl"
16
  PIXEL_MEAN: [123.675, 116.280, 103.530]
17
  PIXEL_STD: [58.395, 57.120, 57.375]
18
  SEM_SEG_HEAD:
12
  DROP_PATH_RATE: 0.3
13
  PATCH_NORM: True
14
  PRETRAIN_IMG_SIZE: 384
15
+ WEIGHTS: "./ovseg_swinbase_vitL14_ft_mpt.pth"
16
  PIXEL_MEAN: [123.675, 116.280, 103.530]
17
  PIXEL_STD: [58.395, 57.120, 57.375]
18
  SEM_SEG_HEAD:
requirements.txt CHANGED
@@ -7,8 +7,14 @@ wandb
7
  fire
8
  opencv-python
9
  pandas
10
- torch==1.10.1
11
- torchvision==0.11.2
 
 
 
 
 
 
12
 
13
  # Detectron
14
  --find-links https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
7
  fire
8
  opencv-python
9
  pandas
10
+ ftfy
11
+ regex
12
+ tqdm
13
+ gdown
14
+ # Torch
15
+ --find-links https://download.pytorch.org/whl/cu113/torch_stable.html
16
+ torch==1.10.1+cu113
17
+ torchvision==0.11.2+cu113
18
 
19
  # Detectron
20
  --find-links https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
resources/demo_samples/sample_01.jpeg ADDED

Git LFS Details

  • SHA256: 154943906b5ed394b620da62124c4421dfa96f858f014839eb346678aaa71fc3
  • Pointer size: 132 Bytes
  • Size of remote file: 4.32 MB
resources/demo_samples/sample_02.jpeg ADDED

Git LFS Details

  • SHA256: 591c2bf26a843a62881d89dbd7f4e9a6f90dda9fb8786c9b6e5172a28623d1b0
  • Pointer size: 132 Bytes
  • Size of remote file: 1.84 MB
tools/convert-pretrained-clip-model-to-d2.py DELETED
@@ -1,69 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import pickle as pkl
5
- import sys
6
-
7
- import torch
8
-
9
- """
10
- Usage:
11
- # download pretrained swin model:
12
- wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
13
- # run the conversion
14
- ./convert-pretrained-model-to-d2.py swin_tiny_patch4_window7_224.pth swin_tiny_patch4_window7_224.pkl
15
- # Then, use swin_tiny_patch4_window7_224.pkl with the following changes in config:
16
- MODEL:
17
- WEIGHTS: "/path/to/swin_tiny_patch4_window7_224.pkl"
18
- INPUT:
19
- FORMAT: "RGB"
20
- """
21
-
22
-
23
- def transform(path):
24
- model = torch.load(path, map_location="cpu")
25
- print(f"loading {path}......")
26
- state_dict = model["model"]
27
- state_dict = {
28
- k.replace("visual_model.", ""): v
29
- for k, v in state_dict.items()
30
- if k.startswith("visual_model")
31
- }
32
- source_keys = [k for k in state_dict.keys() if "relative_coords" in k]
33
- for k in source_keys:
34
- state_dict[
35
- k.replace("relative_coords", "relative_position_index")
36
- ] = state_dict[k]
37
- del state_dict[k]
38
-
39
- source_keys = [k for k in state_dict.keys() if "atten_mask_matrix" in k]
40
- for k in source_keys:
41
- state_dict[k.replace("atten_mask_matrix", "attn_mask")] = state_dict[k]
42
- del state_dict[k]
43
-
44
- source_keys = [k for k in state_dict.keys() if "rel_pos_embed_table" in k]
45
- for k in source_keys:
46
- state_dict[
47
- k.replace("rel_pos_embed_table", "relative_position_bias_table")
48
- ] = state_dict[k]
49
- del state_dict[k]
50
-
51
- source_keys = [k for k in state_dict.keys() if "channel_reduction" in k]
52
- for k in source_keys:
53
- state_dict[k.replace("channel_reduction", "reduction")] = state_dict[k]
54
- del state_dict[k]
55
- return {
56
- k if k.startswith("backbone.") else "backbone." + k: v
57
- for k, v in state_dict.items()
58
- }
59
-
60
-
61
- if __name__ == "__main__":
62
- input = sys.argv[1]
63
- res = {
64
- "model": transform(input),
65
- "__author__": "third_party",
66
- "matching_heuristics": True,
67
- }
68
- with open(sys.argv[2], "wb") as f:
69
- pkl.dump(res, f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/convert-pretrained-swin-model-to-d2.py DELETED
@@ -1,30 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import pickle as pkl
5
- import sys
6
-
7
- import torch
8
-
9
- """
10
- Usage:
11
- # download pretrained swin model:
12
- wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
13
- # run the conversion
14
- ./convert-pretrained-model-to-d2.py swin_tiny_patch4_window7_224.pth swin_tiny_patch4_window7_224.pkl
15
- # Then, use swin_tiny_patch4_window7_224.pkl with the following changes in config:
16
- MODEL:
17
- WEIGHTS: "/path/to/swin_tiny_patch4_window7_224.pkl"
18
- INPUT:
19
- FORMAT: "RGB"
20
- """
21
-
22
- if __name__ == "__main__":
23
- input = sys.argv[1]
24
-
25
- obj = torch.load(input, map_location="cpu")["model"]
26
-
27
- res = {"model": obj, "__author__": "third_party", "matching_heuristics": True}
28
-
29
- with open(sys.argv[2], "wb") as f:
30
- pkl.dump(res, f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/convert-torchvision-to-d2.py DELETED
@@ -1,54 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import pickle as pkl
5
- import sys
6
-
7
- import torch
8
-
9
- """
10
- Usage:
11
- # download one of the ResNet{18,34,50,101,152} models from torchvision:
12
- wget https://download.pytorch.org/models/resnet50-19c8e357.pth -O r50.pth
13
- # run the conversion
14
- ./convert-torchvision-to-d2.py r50.pth r50.pkl
15
- # Then, use r50.pkl with the following changes in config:
16
- MODEL:
17
- WEIGHTS: "/path/to/r50.pkl"
18
- PIXEL_MEAN: [123.675, 116.280, 103.530]
19
- PIXEL_STD: [58.395, 57.120, 57.375]
20
- RESNETS:
21
- DEPTH: 50
22
- STRIDE_IN_1X1: False
23
- INPUT:
24
- FORMAT: "RGB"
25
- These models typically produce slightly worse results than the
26
- pre-trained ResNets we use in official configs, which are the
27
- original ResNet models released by MSRA.
28
- """
29
-
30
- if __name__ == "__main__":
31
- input = sys.argv[1]
32
-
33
- obj = torch.load(input, map_location="cpu")
34
-
35
- newmodel = {}
36
- for k in list(obj.keys()):
37
- old_k = k
38
- if "layer" not in k:
39
- k = "stem." + k
40
- for t in [1, 2, 3, 4]:
41
- k = k.replace("layer{}".format(t), "res{}".format(t + 1))
42
- for t in [1, 2, 3]:
43
- k = k.replace("bn{}".format(t), "conv{}.norm".format(t))
44
- k = k.replace("downsample.0", "shortcut")
45
- k = k.replace("downsample.1", "shortcut.norm")
46
- print(old_k, "->", k)
47
- newmodel[k] = obj.pop(old_k).detach().numpy()
48
-
49
- res = {"model": newmodel, "__author__": "torchvision", "matching_heuristics": True}
50
-
51
- with open(sys.argv[2], "wb") as f:
52
- pkl.dump(res, f)
53
- if obj:
54
- print("Unconverted keys:", obj.keys())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/ovseg_replace_clip.py DELETED
@@ -1,30 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import torch
5
- from collections import OrderedDict
6
-
7
-
8
- # PATH to new clip model
9
- clip_ckpt = torch.load('xx/open_clip/src/logs/2022_xx/checkpoints/epoch_x.pt')
10
-
11
- new_model = OrderedDict()
12
- state_dict = clip_ckpt['state_dict']
13
-
14
- for k, v in state_dict.items():
15
- new_key = k.replace('module.','')
16
- new_model[new_key] = v
17
-
18
- # PATH to trained ovseg model
19
- ovseg_model = torch.load('xx/ovseg/output/model_final.pth', 'cpu')
20
-
21
- for k, v in new_model.items():
22
- new_k = 'clip_adapter.clip_model.' + k
23
- if new_k in ovseg_model['model'].keys():
24
- ovseg_model['model'][new_k] = v
25
- else:
26
- print(f'{new_k} does not exist in ckpt')
27
-
28
- # ovseg_model['model']['clip_adapter.clip_model.visual.mask_embedding'] = new_model['visual.mask_embedding']
29
-
30
- torch.save(ovseg_model, 'xx/ovseg/output/ovseg_ft_mpt.pth')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tools/search_thr_ensemble_w.sh DELETED
@@ -1,11 +0,0 @@
1
- or MASK_THR in 0.35 0.4 0.45
2
- o
3
- for ENSEMBLE_WEIGHT in 0.6 0.65 0.7 0.75 0.8
4
- do
5
- python train_net.py --num-gpu 8 --eval-only --config-file configs/ovseg_swinB_vitL_bs32_120k.yaml \
6
- MODEL.WEIGHTS #PATH_of_ovseg_swinbase_vitL14_ft_mpt.pth DATASETS.TEST \(\"ade20k_sem_seg_val\"\) \
7
- MODEL.CLIP_ADAPTER.CLIP_ENSEMBLE_WEIGHT $ENSEMBLE_WEIGHT MODEL.CLIP_ADAPTER.MASK_THR $MASK_THR
8
- done
9
- one
10
-
11
-
 
 
 
 
 
 
 
 
 
 
 
tools/web_demo.py DELETED
@@ -1,76 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
-
4
- import multiprocessing as mp
5
-
6
- import numpy as np
7
- from PIL import Image
8
-
9
- from detectron2.config import get_cfg
10
-
11
- from detectron2.projects.deeplab import add_deeplab_config
12
- from detectron2.data.detection_utils import read_image
13
- from open_vocab_seg import add_ovseg_config
14
- from open_vocab_seg.utils import VisualizationDemo
15
-
16
- import gradio as gr
17
-
18
- def setup_cfg(config_file):
19
- # load config from file and command-line arguments
20
- cfg = get_cfg()
21
- add_deeplab_config(cfg)
22
- add_ovseg_config(cfg)
23
- cfg.merge_from_file(config_file)
24
- cfg.freeze()
25
- return cfg
26
-
27
-
28
- def inference(class_names, input_img):
29
- mp.set_start_method("spawn", force=True)
30
- config_file = './configs/ovseg_swinB_vitL_demo.yaml'
31
- cfg = setup_cfg(config_file)
32
-
33
- demo = VisualizationDemo(cfg)
34
-
35
- class_names = class_names.split(',')
36
- img = read_image(input_img, format="BGR")
37
- _, visualized_output = demo.run_on_image(img, class_names)
38
-
39
- return Image.fromarray(np.uint8(visualized_output.get_image())).convert('RGB')
40
-
41
- # demo = gr.Interface(fn=greet, inputs="text", outputs="text")
42
- # demo.launch()
43
-
44
-
45
- examples = [['Oculus, Ukulele', './resources/demo_samples/sample_03.jpeg'],]
46
- output_labels = ['segmentation map']
47
-
48
- title = 'OVSeg'
49
-
50
- description = """
51
- Gradio Demo for Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP \n
52
- You may click on of the examples or upload your own image. \n
53
- OVSeg could perform open vocabulary segmentation, you may input more classes (seperate by comma).
54
- """
55
-
56
- article = """
57
- <p style='text-align: center'>
58
- <a href='https://arxiv.org/abs/2210.04150' target='_blank'>
59
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
60
- </a>
61
- |
62
- <a href='https://github.com' target='_blank'>Github Repo</a></p>
63
- """
64
-
65
- gr.Interface(
66
- inference,
67
- inputs=[
68
- gr.inputs.Textbox(
69
- lines=1, placeholder=None, default='', label='class names'),
70
- gr.inputs.Image(type='filepath')
71
- ],
72
- outputs=gr.outputs.Image(label='segmentation map'),
73
- title=title,
74
- description=description,
75
- article=article,
76
- examples=examples).launch(enable_queue=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
train_net.py DELETED
@@ -1,309 +0,0 @@
1
- # Copyright (c) Facebook, Inc. and its affiliates.
2
- # Copyright (c) Meta Platforms, Inc. All Rights Reserved
3
- # Modified by Feng Liang from https://github.com/MendelXu/zsseg.baseline/blob/master/train_net.py
4
-
5
- """
6
- OVSeg Training Script.
7
-
8
- This script is a simplified version of the training script in detectron2/tools.
9
- """
10
- import copy
11
- import itertools
12
- import logging
13
- import os
14
- from collections import OrderedDict
15
- from typing import Any, Dict, List, Set
16
-
17
- import detectron2.utils.comm as comm
18
- import torch
19
- from detectron2.checkpoint import DetectionCheckpointer
20
- from detectron2.config import get_cfg
21
- from detectron2.data import MetadataCatalog
22
- from detectron2.engine import (
23
- DefaultTrainer,
24
- default_argument_parser,
25
- default_setup,
26
- launch,
27
- )
28
- from detectron2.evaluation import (
29
- DatasetEvaluator,
30
- CityscapesSemSegEvaluator,
31
- COCOEvaluator,
32
- DatasetEvaluators,
33
- verify_results,
34
- )
35
- from detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler
36
- from detectron2.solver.build import maybe_add_gradient_clipping
37
- from detectron2.utils.logger import setup_logger
38
- from detectron2.utils.events import CommonMetricPrinter, JSONWriter
39
-
40
- # OVSeg
41
- from open_vocab_seg import SemanticSegmentorWithTTA, add_ovseg_config
42
- from open_vocab_seg.data import (
43
- MaskFormerSemanticDatasetMapper,
44
- )
45
-
46
- from open_vocab_seg.data import (
47
- build_detection_test_loader,
48
- build_detection_train_loader,
49
- )
50
- from open_vocab_seg.evaluation import (
51
- GeneralizedSemSegEvaluator,
52
- )
53
- from open_vocab_seg.utils.events import WandbWriter, setup_wandb
54
- from open_vocab_seg.utils.post_process_utils import dense_crf_post_process
55
-
56
-
57
- class Trainer(DefaultTrainer):
58
- """
59
- Extension of the Trainer class adapted to DETR.
60
- """
61
-
62
- @classmethod
63
- def build_evaluator(cls, cfg, dataset_name, output_folder=None):
64
- """
65
- Create evaluator(s) for a given dataset.
66
- This uses the special metadata "evaluator_type" associated with each
67
- builtin dataset. For your own dataset, you can simply create an
68
- evaluator manually in your script and do not have to worry about the
69
- hacky if-else logic here.
70
- """
71
- if output_folder is None:
72
- output_folder = os.path.join(cfg.OUTPUT_DIR, "inference")
73
- evaluator_list = []
74
- evaluator_type = MetadataCatalog.get(dataset_name).evaluator_type
75
- if evaluator_type in ["sem_seg"]:
76
- evaluator = GeneralizedSemSegEvaluator
77
- evaluator_list.append(
78
- evaluator(
79
- dataset_name,
80
- distributed=True,
81
- output_dir=output_folder,
82
- post_process_func=dense_crf_post_process
83
- if cfg.TEST.DENSE_CRF
84
- else None,
85
- )
86
- )
87
-
88
- if len(evaluator_list) == 0:
89
- raise NotImplementedError(
90
- "no Evaluator for the dataset {} with the type {}".format(
91
- dataset_name, evaluator_type
92
- )
93
- )
94
- elif len(evaluator_list) == 1:
95
- return evaluator_list[0]
96
- return DatasetEvaluators(evaluator_list)
97
-
98
- @classmethod
99
- def build_train_loader(cls, cfg):
100
- dataset = None
101
- # Semantic segmentation dataset mapper
102
- if cfg.INPUT.DATASET_MAPPER_NAME == "mask_former_semantic":
103
- mapper = MaskFormerSemanticDatasetMapper(cfg, True)
104
- else:
105
- raise NotImplementedError
106
- return build_detection_train_loader(cfg, mapper=mapper, dataset=dataset)
107
-
108
- @classmethod
109
- def build_test_loader(cls, cfg, dataset_name):
110
- """
111
- Returns:
112
- iterable
113
- It now calls :func:`detectron2.data.build_detection_test_loader`.
114
- Overwrite it if you'd like a different data loader.
115
- """
116
- return build_detection_test_loader(cfg, dataset_name, mapper=None)
117
-
118
- def build_writers(self):
119
- """
120
- Build a list of writers to be used. By default it contains
121
- writers that write metrics to the screen,
122
- a json file, and a tensorboard event file respectively.
123
- If you'd like a different list of writers, you can overwrite it in
124
- your trainer.
125
-
126
- Returns:
127
- list[EventWriter]: a list of :class:`EventWriter` objects.
128
-
129
- It is now implemented by:
130
- ::
131
- return [
132
- CommonMetricPrinter(self.max_iter),
133
- JSONWriter(os.path.join(self.cfg.OUTPUT_DIR, "metrics.json")),
134
- TensorboardXWriter(self.cfg.OUTPUT_DIR),
135
- ]
136
-
137
- """
138
- # Here the default print/log frequency of each writer is used.
139
- return [
140
- # It may not always print what you want to see, since it prints "common" metrics only.
141
- CommonMetricPrinter(self.max_iter),
142
- JSONWriter(os.path.join(self.cfg.OUTPUT_DIR, "metrics.json")),
143
- WandbWriter(),
144
- ]
145
-
146
- @classmethod
147
- def build_lr_scheduler(cls, cfg, optimizer):
148
- """
149
- It now calls :func:`detectron2.solver.build_lr_scheduler`.
150
- Overwrite it if you'd like a different scheduler.
151
- """
152
- return build_lr_scheduler(cfg, optimizer)
153
-
154
- @classmethod
155
- def build_optimizer(cls, cfg, model):
156
- weight_decay_norm = cfg.SOLVER.WEIGHT_DECAY_NORM
157
- weight_decay_embed = cfg.SOLVER.WEIGHT_DECAY_EMBED
158
-
159
- defaults = {}
160
- defaults["lr"] = cfg.SOLVER.BASE_LR
161
- defaults["weight_decay"] = cfg.SOLVER.WEIGHT_DECAY
162
-
163
- norm_module_types = (
164
- torch.nn.BatchNorm1d,
165
- torch.nn.BatchNorm2d,
166
- torch.nn.BatchNorm3d,
167
- torch.nn.SyncBatchNorm,
168
- # NaiveSyncBatchNorm inherits from BatchNorm2d
169
- torch.nn.GroupNorm,
170
- torch.nn.InstanceNorm1d,
171
- torch.nn.InstanceNorm2d,
172
- torch.nn.InstanceNorm3d,
173
- torch.nn.LayerNorm,
174
- torch.nn.LocalResponseNorm,
175
- )
176
-
177
- params: List[Dict[str, Any]] = []
178
- memo: Set[torch.nn.parameter.Parameter] = set()
179
- for module_name, module in model.named_modules():
180
- for module_param_name, value in module.named_parameters(recurse=False):
181
- if not value.requires_grad:
182
- continue
183
- # Avoid duplicating parameters
184
- if value in memo:
185
- continue
186
- memo.add(value)
187
-
188
- hyperparams = copy.copy(defaults)
189
- if "backbone" in module_name:
190
- hyperparams["lr"] = (
191
- hyperparams["lr"] * cfg.SOLVER.BACKBONE_MULTIPLIER
192
- )
193
- if (
194
- "relative_position_bias_table" in module_param_name
195
- or "absolute_pos_embed" in module_param_name
196
- ):
197
- print(module_param_name)
198
- hyperparams["weight_decay"] = 0.0
199
- if isinstance(module, norm_module_types):
200
- hyperparams["weight_decay"] = weight_decay_norm
201
- if isinstance(module, torch.nn.Embedding):
202
- hyperparams["weight_decay"] = weight_decay_embed
203
- params.append({"params": [value], **hyperparams})
204
-
205
- def maybe_add_full_model_gradient_clipping(optim):
206
- # detectron2 doesn't have full model gradient clipping now
207
- clip_norm_val = cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE
208
- enable = (
209
- cfg.SOLVER.CLIP_GRADIENTS.ENABLED
210
- and cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == "full_model"
211
- and clip_norm_val > 0.0
212
- )
213
-
214
- class FullModelGradientClippingOptimizer(optim):
215
- def step(self, closure=None):
216
- all_params = itertools.chain(
217
- *[x["params"] for x in self.param_groups]
218
- )
219
- torch.nn.utils.clip_grad_norm_(all_params, clip_norm_val)
220
- super().step(closure=closure)
221
-
222
- return FullModelGradientClippingOptimizer if enable else optim
223
-
224
- optimizer_type = cfg.SOLVER.OPTIMIZER
225
- if optimizer_type == "SGD":
226
- optimizer = maybe_add_full_model_gradient_clipping(torch.optim.SGD)(
227
- params, cfg.SOLVER.BASE_LR, momentum=cfg.SOLVER.MOMENTUM
228
- )
229
- elif optimizer_type == "ADAMW":
230
- optimizer = maybe_add_full_model_gradient_clipping(torch.optim.AdamW)(
231
- params, cfg.SOLVER.BASE_LR
232
- )
233
- else:
234
- raise NotImplementedError(f"no optimizer type {optimizer_type}")
235
- if not cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == "full_model":
236
- optimizer = maybe_add_gradient_clipping(cfg, optimizer)
237
- return optimizer
238
-
239
- @classmethod
240
- def test_with_TTA(cls, cfg, model):
241
- logger = logging.getLogger("detectron2.trainer")
242
- # In the end of training, run an evaluation with TTA.
243
- logger.info("Running inference with test-time augmentation ...")
244
- model = SemanticSegmentorWithTTA(cfg, model)
245
- evaluators = [
246
- cls.build_evaluator(
247
- cfg, name, output_folder=os.path.join(cfg.OUTPUT_DIR, "inference_TTA")
248
- )
249
- for name in cfg.DATASETS.TEST
250
- ]
251
- res = cls.test(cfg, model, evaluators)
252
- res = OrderedDict({k + "_TTA": v for k, v in res.items()})
253
- return res
254
-
255
-
256
- def setup(args):
257
- """
258
- Create configs and perform basic setups.
259
- """
260
- cfg = get_cfg()
261
- # for poly lr schedule
262
- add_deeplab_config(cfg)
263
- add_ovseg_config(cfg)
264
- cfg.merge_from_file(args.config_file)
265
- cfg.merge_from_list(args.opts)
266
- cfg.freeze()
267
- default_setup(cfg, args)
268
- # Setup logger for "ovseg" module
269
- if not args.eval_only:
270
- setup_wandb(cfg, args)
271
- setup_logger(
272
- output=cfg.OUTPUT_DIR, distributed_rank=comm.get_rank(), name="ovseg"
273
- )
274
- return cfg
275
-
276
-
277
- def main(args):
278
- cfg = setup(args)
279
-
280
- if args.eval_only:
281
- model = Trainer.build_model(cfg)
282
- DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
283
- cfg.MODEL.WEIGHTS, resume=args.resume
284
- )
285
-
286
- if cfg.TEST.AUG.ENABLED:
287
- res = Trainer.test_with_TTA(cfg, model)
288
- else:
289
- res = Trainer.test(cfg, model)
290
- if comm.is_main_process():
291
- verify_results(cfg, res)
292
- return res
293
-
294
- trainer = Trainer(cfg)
295
- trainer.resume_or_load(resume=args.resume)
296
- return trainer.train()
297
-
298
-
299
- if __name__ == "__main__":
300
- args = default_argument_parser().parse_args()
301
- print("Command Line Args:", args)
302
- launch(
303
- main,
304
- args.num_gpus,
305
- num_machines=args.num_machines,
306
- machine_rank=args.machine_rank,
307
- dist_url=args.dist_url,
308
- args=(args,),
309
- )