silwals commited on
Commit
e61c1f7
1 Parent(s): 5c91c4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -70
README.md CHANGED
@@ -3,86 +3,118 @@ license: cc-by-nc-4.0
3
  pipeline_tag: robotics
4
  ---
5
  Model Card: VC-1 (Visual Cortex ViT-Large)
6
- Last updated: 2023-03-28
7
 
8
  Version: 1.0
9
 
10
  Code: https://github.com/facebookresearch/eai-vc
11
  Other Links: VC-1 Website, VC-1 Blogpost, VC-1 Paper, VC-1 Demo
12
  The VC-1 model is a vision transformer (ViT) pre-trained on over 4,000 hours of egocentric videos from 7 different sources, together with ImageNet. The model is trained using Masked Auto-Encoding (MAE) and is available in two sizes: ViT-B and ViT-L. The model is intended for use for EmbodiedAI tasks, such as object manipulation and indoor navigation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- Model Details
15
- Model Name: VC-1 (Vision Transformer-based model)
16
- Architecture:
17
- Patch size: 16x16
18
- Embedding dimension: 768
19
- Number of layers: 12
20
- Number of heads: 12
21
- MLP ratio: 4
22
- QKV bias: True
23
- Layer normalization: eps=1e-6
24
- Inputs: Images presented in 224x224x3.
25
- Outputs: 768x1 embedding.
26
- Image Size: 224
27
- Use of Classification Token: True
28
- Dropout Rate: 0.0
29
- Algorithm: MAE
30
- Epochs trained: 182
31
- Model authors: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier.
32
- Person of Contact: Oleksandr Maksymets (FAIR)
33
- Citation
34
  If you use this model, please cite:
35
 
36
- @inproceedings{majumdar2023vc1,
37
- title = {Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
38
- author = {Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
39
- publisher = {arXiv},
40
- year = {2023}
 
 
 
41
  }
 
 
 
 
 
 
 
42
 
43
- Model Data
44
- Training data: The VC-1 model was trained on a large-scale dataset of egocentric videos, consisting of over 5.6 million frames. The dataset includes three modalities: manipulation, navigation, and object recognition. The manipulation modality includes videos of people performing various manipulations, such as cooking, cleaning, and tool use. The navigation modality includes videos of people moving around in indoor environments, such as homes and offices. The object recognition modality includes images from the ImageNet dataset, which contains over 1.2 million images of objects in various categories.
45
 
46
  This table provides an overview of the assembled datasets used for scaling hypothesis experiments, including the total number of frames and the frames used for each dataset:
47
 
48
- Dataset Contains Total Frames Frames used
49
- Ego4D Ego4D 418,578,043 2,790,520
50
- EgoM (Manipulation) Ego4D 418,578,043 2,790,520
51
- 100DOH 99,899 99,899
52
- SS-v2 25,209,271 315,115
53
- Epic Kitchens 19,965,439 332,757
54
- Total 3,538,291
55
- EgoO (OpenHouse24) Ego4D 418,578,043 2,790,520
56
- OpenHouse24 27,806,971 499,442
57
- Total 3,289,962
58
- EgoN (Navigation) Ego4D 418,578,043 2,790,520
59
- OpenHouse24 27,806,971 499,442
60
- RealEstate10K 10,000,000 303,087
61
- Total 3,289,962
62
- EgoMN (Manipulation, Navigation) Ego4D+M 3,538,291 3,538,291
63
- OpenHouse24 27,806,971 499,442
64
- RealEstate10K 10,000,000 303,087
65
- Total 4,340,820
66
- EgoMNI (Manipulation, Navigation, ImageNet) Ego4D+MN 4,340,820 4,340,820
67
- ImageNet 1,281,167 1,281,167
68
- Total 5,621,987
 
 
 
 
 
 
 
69
  The VC-1 models were trained on EgoMNI (Manipulation, Navigation, ImageNet) assembled dataset.
70
 
71
- Evaluation data (see also section Evaluation Results below): The mode was evaluated on CortexBench that includes 17 tasks from 7 benchmarks and described below:
72
 
73
- Benchmark Tasks
74
- Adroit Relocate, Reorient-Pen
75
- MetaWorld Assembly, Bin-Picking, Button-Press, Drawer-Open, Hammer
76
- DeepMind Control Finger-Spin, Reacher-Hard, Cheetah-Run, Walker-Stand, Walker-Walk
77
- TriFinger Reach-Cube, Push-Cube
78
- Habitat Image-Goal Navigation (ImageNav), Object-Goal Navigation (ObjectNav)
79
- Habitat 2.0 Mobile Pick
80
- Model Creation & Maintenance
 
 
 
 
 
 
 
 
 
81
  The VC-1 model was created by pre-training ViT-B and ViT-L on a combination of egocentric videos and ImageNet using Masked Auto-Encoding (MAE). The model is maintained by the authors and is available for open-source use.
82
 
83
- Model Usage
 
84
  The VC-1 model is intended for EmbodiedAI tasks, such as object manipulation and indoor navigation.. The model outputs embeddings for image frame, which can be used as features for downstream tasks:
85
 
 
86
  from vc_models.models.vit import model_utils
87
 
88
  model,embd_size,model_transforms,model_info = model_utils.load_model(model_utils.VC1_BASE_NAME)
@@ -94,20 +126,25 @@ img = your_function_here ...
94
  transformed_img = model_transforms(img)
95
  #img will be 1x768
96
  embedding = model(transformed_img)
 
 
 
97
 
98
- Performance
99
  The performance of the models on the CortexBench:
100
 
101
- Model Adroit Meta-World DMControl Trifinger ObjectNav ImageNav Mobile Pick Mean Rank Mean Success
102
- Ego4D (VIT-B) 48.7 ± 1.3 86.1 ± 2.1 64.1 ± 2.3 68.3 ± 1.1 46.8 ± 1.1 64.0 ± 0.7 57.4 ± 2.2 8.6 62.2
103
- Ego4D (VIT-L) 50.0 ± 1.2 92.9 ± 2.4 60.8 ± 3.3 69.7 ± 0.5 47.6 ± 1.1 55.8 ± 0.8 67.6 ± 2.1 5.9 63.5
104
- Ego4D+N (VIT-B) 50.0 ± 2.4 86.4 ± 2.9 59.5 ± 2.4 67.8 ± 1.3 54.7 ± 1.1 68.7 ± 0.7 59.4 ± 2.2 7.2 63.8
105
- Ego4D+N (VIT-L) 54.0 ± 1.2 89.1 ± 2.9 66.4 ± 1.7 66.9 ± 0.4 57.4 ± 1.1 70.5 ± 0.7 65.2 ± 2.1 3.5 67.1
106
- Ego4D+M (VIT-B) 51.3 ± 2.4 83.5 ± 2.6 64.3 ± 1.8 69.1 ± 0.4 47.3 ± 1.1 65.8 ± 0.7 59.8 ± 2.2 7.0 63.0
107
- Ego4D+M (VIT-L) 52.0 ± 1.3 88.3 ± 3.2 64.7 ± 2.4 64.7 ± 0.9 47.3 ± 1.1 65.5 ± 0.7 68.6 ± 2.1 6.0 64.4
108
- VC-1: Ego4D+MN (VIT-B) 48.7 ± 2.4 85.3 ± 5.2 64.2 ± 1.9 70.3 ± 0.5 52.8 ± 1.1 68.9 ± 0.7 58.6 ± 2.2 6.9 64.1
109
- VC-1: Ego4D + MNI (VIT-L) 59.3 ± 5.2 88.8 ± 2.2 66.9 ± 1.4 71.7 ± 0.4 60.3 ± 1.1 70.3 ± 0.7 63.2 ± 2.2 2.4 68.7
110
- Limitations
 
 
 
111
  The VC-1 model has been evaluated on a limited set of benchmarks and may not perform as well on other tasks. While we have focused on masked auto-encoders as the pre-training objective and ViT as the architecture in our study, there may be other SSL algorithms that exhibit different scaling behaviors or superior performance on the proposed datasets in our benchmark.
112
 
113
  Additionally, the VC-1 model is computationally expensive to train and may not be practical for all use cases. The large size of the model may also pose challenges for deployment on resource-constrained devices.
 
3
  pipeline_tag: robotics
4
  ---
5
  Model Card: VC-1 (Visual Cortex ViT-Large)
6
+ Last updated: 2023-04-07
7
 
8
  Version: 1.0
9
 
10
  Code: https://github.com/facebookresearch/eai-vc
11
  Other Links: VC-1 Website, VC-1 Blogpost, VC-1 Paper, VC-1 Demo
12
  The VC-1 model is a vision transformer (ViT) pre-trained on over 4,000 hours of egocentric videos from 7 different sources, together with ImageNet. The model is trained using Masked Auto-Encoding (MAE) and is available in two sizes: ViT-B and ViT-L. The model is intended for use for EmbodiedAI tasks, such as object manipulation and indoor navigation.
13
+ The VC-1 model is a vision transformer (ViT) pre-trained on over 4,000 hours of egocentric videos from 7 different sources, together with ImageNet. The model is trained using Masked Auto-Encoding (MAE) and is available in two sizes: ViT-B and ViT-L. The model is intended for use for EmbodiedAI tasks, such as object manipulation and indoor navigation.
14
+ * VC-1 (ViT-L): Our best model, uses a ViT-L backbone, also known simply as `VC-1` | [Download](https://dl.fbaipublicfiles.com/eai-vc/vc1_vitl.pth)
15
+ * VC-1-base (VIT-B): pre-trained on the same data as VC-1 but with a smaller backbone (ViT-B) | [Download](https://dl.fbaipublicfiles.com/eai-vc/vc1_vitb.pth)
16
+
17
+ ## Model Details
18
+
19
+ - Model Name: VC-1 (Vision Transformer-based model)
20
+ - Architecture:
21
+ - Patch size: 16x16
22
+ - Embedding dimension: 1024
23
+ - Number of layers: 24
24
+ - Number of heads: 16
25
+ - MLP ratio: 4
26
+ - QKV bias: True
27
+ - Layer normalization: eps=1e-6
28
+ - Inputs: Images presented in 224x224x3.
29
+ - Outputs: 1024x1 embedding.
30
+ - Image Size: 224
31
+ - Use of Classification Token: True
32
+ - Dropout Rate: 0.0
33
+ - Algorithm: MAE
34
+ - Epochs trained: 182
35
+ - Model authors: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier.
36
+ - Person of Contact: Oleksandr Maksymets (FAIR)
37
+
38
+
39
+ ## Citation
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  If you use this model, please cite:
42
 
43
+ ```bibtex
44
+ @inproceedings{vc2023,
45
+ title={Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?},
46
+ author={Arjun Majumdar and Karmesh Yadav and Sergio Arnaud and Yecheng Jason Ma and Claire Chen and Sneha Silwal and Aryan Jain and Vincent-Pierre Berges and Pieter Abbeel and Jitendra Malik and Dhruv Batra and Yixin Lin and Oleksandr Maksymets and Aravind Rajeswaran and Franziska Meier},
47
+ year={2023},
48
+ eprint={2303.18240},
49
+ archivePrefix={arXiv},
50
+ primaryClass={cs.CV}
51
  }
52
+ ```
53
+
54
+
55
+ ## Model Data
56
+
57
+ Training data:
58
+ The VC-1 model was trained on a large-scale dataset of egocentric videos, consisting of over 5.6 million frames. The dataset includes three modalities: manipulation, navigation, and object recognition. The manipulation modality includes videos of people performing various manipulations, such as cooking, cleaning, and tool use. The navigation modality includes videos of people moving around in indoor environments, such as homes and offices. The object recognition modality includes images from the ImageNet dataset, which contains over 1.2 million images of objects in various categories.
59
 
 
 
60
 
61
  This table provides an overview of the assembled datasets used for scaling hypothesis experiments, including the total number of frames and the frames used for each dataset:
62
 
63
+ | Dataset | Contains | Total Frames | Frames used |
64
+ | ----------------------|:-----------:|:-------------:|:-----------:|
65
+ | Ego4D | Ego4D | 418,578,043 | 2,790,520 |
66
+ | | | | |
67
+ | EgoM (Manipulation) | Ego4D | 418,578,043 | 2,790,520 |
68
+ | | 100DOH | 99,899 | 99,899 |
69
+ | | SS-v2 | 25,209,271 | 315,115 |
70
+ | | Epic Kitchens| 19,965,439 | 332,757 |
71
+ | | | Total | 3,538,291 |
72
+ | | | | |
73
+ | EgoO (OpenHouse24) | Ego4D | 418,578,043 | 2,790,520 |
74
+ | | OpenHouse24 | 27,806,971 | 499,442 |
75
+ | | | Total | 3,289,962 |
76
+ | | | | |
77
+ | EgoN (Navigation) | Ego4D | 418,578,043 | 2,790,520 |
78
+ | | OpenHouse24 | 27,806,971 | 499,442 |
79
+ | | RealEstate10K| 10,000,000 | 303,087 |
80
+ | | | Total | 3,289,962 |
81
+ | | | | |
82
+ | EgoMN (Manipulation, Navigation) | Ego4D+M | 3,538,291 | 3,538,291 |
83
+ | | OpenHouse24 | 27,806,971 | 499,442 |
84
+ | | RealEstate10K| 10,000,000 | 303,087 |
85
+ | | | Total | 4,340,820 |
86
+ | | | | |
87
+ | EgoMNI (Manipulation, Navigation, ImageNet) | Ego4D+MN | 4,340,820 | 4,340,820 |
88
+ | | ImageNet | 1,281,167 | 1,281,167 |
89
+ | | | Total | 5,621,987 |
90
+
91
  The VC-1 models were trained on EgoMNI (Manipulation, Navigation, ImageNet) assembled dataset.
92
 
 
93
 
94
+
95
+ Evaluation data (see also section [Evaluation Results](#performance)
96
+ below):
97
+ The mode was evaluated on CortexBench that includes 17 tasks from 7 benchmarks and described below:
98
+
99
+ | Benchmark | Tasks |
100
+ |-----------|-------|
101
+ | Adroit | Relocate, Reorient-Pen |
102
+ | MetaWorld | Assembly, Bin-Picking, Button-Press, Drawer-Open, Hammer |
103
+ | DeepMind Control | Finger-Spin, Reacher-Hard, Cheetah-Run, Walker-Stand, Walker-Walk |
104
+ | TriFinger | Reach-Cube, Push-Cube |
105
+ | Habitat | Image-Goal Navigation (ImageNav), Object-Goal Navigation (ObjectNav) |
106
+ | Habitat 2.0 | Mobile Pick |
107
+
108
+
109
+ ## Model Creation & Maintenance
110
+
111
  The VC-1 model was created by pre-training ViT-B and ViT-L on a combination of egocentric videos and ImageNet using Masked Auto-Encoding (MAE). The model is maintained by the authors and is available for open-source use.
112
 
113
+ ## Model Usage
114
+
115
  The VC-1 model is intended for EmbodiedAI tasks, such as object manipulation and indoor navigation.. The model outputs embeddings for image frame, which can be used as features for downstream tasks:
116
 
117
+ ```
118
  from vc_models.models.vit import model_utils
119
 
120
  model,embd_size,model_transforms,model_info = model_utils.load_model(model_utils.VC1_BASE_NAME)
 
126
  transformed_img = model_transforms(img)
127
  #img will be 1x768
128
  embedding = model(transformed_img)
129
+ ```
130
+
131
+ ## Performance
132
 
 
133
  The performance of the models on the CortexBench:
134
 
135
+ | Model | Adroit | Meta-World | DMControl | Trifinger | ObjectNav | ImageNav | Mobile Pick | Mean Rank | Mean Success |
136
+ | ------------------------ | ------------ | ---------------------- | ------------------- | --------------------- | ------------ | ------------ | ----------------- | --------- | -------------- |
137
+ | Ego4D (VIT-B) | 48.7 ± 1.3 | 86.1 ± 2.1 | 64.1 ± 2.3 | 68.3 ± 1.1 | 46.8 ± 1.1 | 64.0 ± 0.7 | 57.4 ± 2.2 | 8.6 | 62.2 |
138
+ | Ego4D (VIT-L) | 50.0 ± 1.2 | 92.9 ± 2.4 | 60.8 ± 3.3 | 69.7 ± 0.5 | 47.6 ± 1.1 | 55.8 ± 0.8 | 67.6 ± 2.1 | 5.9 | 63.5 |
139
+ | Ego4D+N (VIT-B) | 50.0 ± 2.4 | 86.4 ± 2.9 | 59.5 ± 2.4 | 67.8 ± 1.3 | 54.7 ± 1.1 | 68.7 ± 0.7 | 59.4 ± 2.2 | 7.2 | 63.8 |
140
+ | Ego4D+N (VIT-L) | 54.0 ± 1.2 | 89.1 ± 2.9 | 66.4 ± 1.7 | 66.9 ± 0.4 | 57.4 ± 1.1 | 70.5 ± 0.7 | 65.2 ± 2.1 | 3.5 | 67.1 |
141
+ | Ego4D+M (VIT-B) | 51.3 ± 2.4 | 83.5 ± 2.6 | 64.3 ± 1.8 | 69.1 ± 0.4 | 47.3 ± 1.1 | 65.8 ± 0.7 | 59.8 ± 2.2 | 7.0 | 63.0 |
142
+ | Ego4D+M (VIT-L) | 52.0 ± 1.3 | 88.3 ± 3.2 | 64.7 ± 2.4 | 64.7 ± 0.9 | 47.3 ± 1.1 | 65.5 ± 0.7 | 68.6 ± 2.1 | 6.0 | 64.4 |
143
+ | VC-1: Ego4D+MN (VIT-B) | 48.7 ± 2.4 | 85.3 ± 5.2 | 64.2 ± 1.9 | 70.3 ± 0.5 | 52.8 ± 1.1 | 68.9 ± 0.7 | 58.6 ± 2.2 | 6.9 | 64.1 |
144
+ | VC-1: Ego4D + MNI (VIT-L) | 59.3 ± 5.2 | 88.8 ± 2.2 | 66.9 ± 1.4 | 71.7 ± 0.4 | 60.3 ± 1.1 | 70.3 ± 0.7 | 63.2 ± 2.2 | 2.4 | 68.7 |
145
+
146
+ ## Limitations
147
+
148
  The VC-1 model has been evaluated on a limited set of benchmarks and may not perform as well on other tasks. While we have focused on masked auto-encoders as the pre-training objective and ViT as the architecture in our study, there may be other SSL algorithms that exhibit different scaling behaviors or superior performance on the proposed datasets in our benchmark.
149
 
150
  Additionally, the VC-1 model is computationally expensive to train and may not be practical for all use cases. The large size of the model may also pose challenges for deployment on resource-constrained devices.