jonathanzkoch commited on
Commit
41c6668
1 Parent(s): a74a033

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +78 -0
  2. demo_jepa_encoder.py +14 -0
  3. jepa-latest.pth.tar +3 -0
  4. params-encoder.yaml +89 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ VJEPA Encoder
2
+
3
+ The VJEPA Encoder finetuned JEPA model trained on [High Speed and High Dynamic Range Video with an Event Camera IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019](https://rpg.ifi.uzh.ch/event_driving_datasets.html). This package is an adaptation to `facebookresearch/jepa` to enable ease of use of the Jepa Architecture built with Vision Transformers.
4
+
5
+ ## Installation
6
+
7
+ To install the VJEPA Encoder package, you can use pip:
8
+
9
+ ```
10
+ pip install vjepa_encoder
11
+ ```
12
+
13
+ ## Usage
14
+
15
+ To use the VJEPA Encoder in your Python code, you can import it as follows:
16
+
17
+ ```python
18
+ from vjepa_encoder.vision_encoder import JepaEncoder
19
+ ```
20
+
21
+ ### Loading the Encoder
22
+
23
+ To load the pre-trained encoder, you can use the `load_model` function:
24
+
25
+ ```python
26
+ encoder = JepaEncoder.load_model(config_file_path, devices)
27
+ ```
28
+
29
+ - `config_file_path`: Path to the configuration file (YAML) containing the model settings.
30
+ - `devices`: List of devices (e.g., `['cuda:0']`) to use for distributed training. If not provided, the model will be loaded on the CPU.
31
+
32
+ ### Preprocessing Data
33
+
34
+ The VJEPA Encoder provides a `preprocess_data` function to preprocess input data before feeding it to the encoder:
35
+
36
+ ```python
37
+ preprocessed_data = encoder.preprocess_data(input_data)
38
+ ```
39
+
40
+ - `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
41
+
42
+ ### Embedding Images
43
+
44
+ To obtain the embeddings for an image, you can use the `embed_image` function:
45
+
46
+ ```python
47
+ embeddings = encoder.embed_image(input_data)
48
+ ```
49
+
50
+ - `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
51
+
52
+ The function returns the embeddings generated by the encoder.
53
+
54
+ ## Configuration
55
+
56
+ The VJEPA Encoder requires a configuration file in YAML format to specify the model settings. The configuration file should include the following sections:
57
+
58
+ - `meta`: General settings such as the checkpoint file path, random seed, etc.
59
+ - `mask`: Settings related to masking.
60
+ - `model`: Model architecture settings.
61
+ - `data`: Data-related settings such as crop size, patch size, etc.
62
+ - `logging`: Logging settings.
63
+
64
+ Please refer to the provided configuration file template for more details.
65
+
66
+ ## License
67
+
68
+ The VJEPA Encoder is released under the [MIT License](LICENSE).
69
+
70
+ ## Acknowledgments
71
+
72
+ The VJEPA Encoder is based on the research work conducted by Facebook AI Research. We would like to acknowledge their contributions to the field of computer vision and representation learning.
73
+
74
+ ## Contact
75
+
76
+ If you have any questions or suggestions regarding the VJEPA Encoder, please feel free to contact me at johnnykoch02@gmail.com.
77
+
78
+ ---
demo_jepa_encoder.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from vjepa_encoder.vision_encoder import JepaEncoder
2
+
3
+ encoder = JepaEncoder.load_model(
4
+ "logs/params-encoder.yaml"
5
+ )
6
+
7
+ import numpy
8
+ img = numpy.random.random(size=(360, 480, 3))
9
+
10
+ print("Input Img:", img.shape)
11
+ embedding = encoder.embed_image(img)
12
+
13
+ print(embedding)
14
+ print(embedding.shape)
jepa-latest.pth.tar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8aea9275bcb92ec575e1e58f2a2068377d0885f884973f3b31281fa9e6db30ae
3
+ size 5143102207
params-encoder.yaml ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ app: vjepa
2
+ data:
3
+ batch_size: 8
4
+ clip_duration: null
5
+ crop_size: 224
6
+ dataset_type: VideoDataset
7
+ datasets:
8
+ - /path/to/dataset.csv
9
+ decode_one_clip: true
10
+ filter_short_videos: false
11
+ num_clips: 1
12
+ num_frames: 16
13
+ num_workers: 4
14
+ patch_size: 16
15
+ pin_mem: true
16
+ sampling_rate: 4
17
+ tubelet_size: 2
18
+ data_aug:
19
+ auto_augment: false
20
+ motion_shift: false
21
+ random_resize_aspect_ratio:
22
+ - 0.75
23
+ - 1.35
24
+ random_resize_scale:
25
+ - 0.3
26
+ - 1.0
27
+ reprob: 0.0
28
+ logging:
29
+ folder: /path/to/logs
30
+ write_tag: jepa
31
+ loss:
32
+ loss_exp: 1.0
33
+ reg_coeff: 0.0
34
+ mask:
35
+ - aspect_ratio:
36
+ - 0.75
37
+ - 1.5
38
+ max_keep: null
39
+ max_temporal_keep: 1.0
40
+ num_blocks: 8
41
+ spatial_scale:
42
+ - 0.15
43
+ - 0.15
44
+ temporal_scale:
45
+ - 1.0
46
+ - 1.0
47
+ - aspect_ratio:
48
+ - 0.75
49
+ - 1.5
50
+ max_keep: null
51
+ max_temporal_keep: 1.0
52
+ num_blocks: 2
53
+ spatial_scale:
54
+ - 0.7
55
+ - 0.7
56
+ temporal_scale:
57
+ - 1.0
58
+ - 1.0
59
+ meta:
60
+ dtype: bfloat16
61
+ eval_freq: 100
62
+ load_checkpoint: true
63
+ read_checkpoint: /path/to/vitl16.pth.tar
64
+ save_every_freq: 5
65
+ seed: 234
66
+ use_sdpa: true
67
+ model:
68
+ model_name: vit_large
69
+ pred_depth: 12
70
+ pred_embed_dim: 384
71
+ uniform_power: true
72
+ use_mask_tokens: true
73
+ zero_init_mask_tokens: true
74
+ nodes: 16
75
+ optimization:
76
+ clip_grad: 10.0
77
+ ema:
78
+ - 0.998
79
+ - 1.0
80
+ epochs: 25
81
+ final_lr: 1.0e-06
82
+ final_weight_decay: 0.4
83
+ ipe: 300
84
+ ipe_scale: 1.25
85
+ lr: 0.000625
86
+ start_lr: 0.0002
87
+ warmup: 40
88
+ weight_decay: 0.04
89
+ tasks_per_node: 8