jonathanzkoch
commited on
Commit
•
41c6668
1
Parent(s):
a74a033
Upload 4 files
Browse files- README.md +78 -0
- demo_jepa_encoder.py +14 -0
- jepa-latest.pth.tar +3 -0
- params-encoder.yaml +89 -0
README.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
VJEPA Encoder
|
2 |
+
|
3 |
+
The VJEPA Encoder finetuned JEPA model trained on [High Speed and High Dynamic Range Video with an Event Camera IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019](https://rpg.ifi.uzh.ch/event_driving_datasets.html). This package is an adaptation to `facebookresearch/jepa` to enable ease of use of the Jepa Architecture built with Vision Transformers.
|
4 |
+
|
5 |
+
## Installation
|
6 |
+
|
7 |
+
To install the VJEPA Encoder package, you can use pip:
|
8 |
+
|
9 |
+
```
|
10 |
+
pip install vjepa_encoder
|
11 |
+
```
|
12 |
+
|
13 |
+
## Usage
|
14 |
+
|
15 |
+
To use the VJEPA Encoder in your Python code, you can import it as follows:
|
16 |
+
|
17 |
+
```python
|
18 |
+
from vjepa_encoder.vision_encoder import JepaEncoder
|
19 |
+
```
|
20 |
+
|
21 |
+
### Loading the Encoder
|
22 |
+
|
23 |
+
To load the pre-trained encoder, you can use the `load_model` function:
|
24 |
+
|
25 |
+
```python
|
26 |
+
encoder = JepaEncoder.load_model(config_file_path, devices)
|
27 |
+
```
|
28 |
+
|
29 |
+
- `config_file_path`: Path to the configuration file (YAML) containing the model settings.
|
30 |
+
- `devices`: List of devices (e.g., `['cuda:0']`) to use for distributed training. If not provided, the model will be loaded on the CPU.
|
31 |
+
|
32 |
+
### Preprocessing Data
|
33 |
+
|
34 |
+
The VJEPA Encoder provides a `preprocess_data` function to preprocess input data before feeding it to the encoder:
|
35 |
+
|
36 |
+
```python
|
37 |
+
preprocessed_data = encoder.preprocess_data(input_data)
|
38 |
+
```
|
39 |
+
|
40 |
+
- `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
|
41 |
+
|
42 |
+
### Embedding Images
|
43 |
+
|
44 |
+
To obtain the embeddings for an image, you can use the `embed_image` function:
|
45 |
+
|
46 |
+
```python
|
47 |
+
embeddings = encoder.embed_image(input_data)
|
48 |
+
```
|
49 |
+
|
50 |
+
- `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
|
51 |
+
|
52 |
+
The function returns the embeddings generated by the encoder.
|
53 |
+
|
54 |
+
## Configuration
|
55 |
+
|
56 |
+
The VJEPA Encoder requires a configuration file in YAML format to specify the model settings. The configuration file should include the following sections:
|
57 |
+
|
58 |
+
- `meta`: General settings such as the checkpoint file path, random seed, etc.
|
59 |
+
- `mask`: Settings related to masking.
|
60 |
+
- `model`: Model architecture settings.
|
61 |
+
- `data`: Data-related settings such as crop size, patch size, etc.
|
62 |
+
- `logging`: Logging settings.
|
63 |
+
|
64 |
+
Please refer to the provided configuration file template for more details.
|
65 |
+
|
66 |
+
## License
|
67 |
+
|
68 |
+
The VJEPA Encoder is released under the [MIT License](LICENSE).
|
69 |
+
|
70 |
+
## Acknowledgments
|
71 |
+
|
72 |
+
The VJEPA Encoder is based on the research work conducted by Facebook AI Research. We would like to acknowledge their contributions to the field of computer vision and representation learning.
|
73 |
+
|
74 |
+
## Contact
|
75 |
+
|
76 |
+
If you have any questions or suggestions regarding the VJEPA Encoder, please feel free to contact me at johnnykoch02@gmail.com.
|
77 |
+
|
78 |
+
---
|
demo_jepa_encoder.py
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from vjepa_encoder.vision_encoder import JepaEncoder
|
2 |
+
|
3 |
+
encoder = JepaEncoder.load_model(
|
4 |
+
"logs/params-encoder.yaml"
|
5 |
+
)
|
6 |
+
|
7 |
+
import numpy
|
8 |
+
img = numpy.random.random(size=(360, 480, 3))
|
9 |
+
|
10 |
+
print("Input Img:", img.shape)
|
11 |
+
embedding = encoder.embed_image(img)
|
12 |
+
|
13 |
+
print(embedding)
|
14 |
+
print(embedding.shape)
|
jepa-latest.pth.tar
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8aea9275bcb92ec575e1e58f2a2068377d0885f884973f3b31281fa9e6db30ae
|
3 |
+
size 5143102207
|
params-encoder.yaml
ADDED
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
app: vjepa
|
2 |
+
data:
|
3 |
+
batch_size: 8
|
4 |
+
clip_duration: null
|
5 |
+
crop_size: 224
|
6 |
+
dataset_type: VideoDataset
|
7 |
+
datasets:
|
8 |
+
- /path/to/dataset.csv
|
9 |
+
decode_one_clip: true
|
10 |
+
filter_short_videos: false
|
11 |
+
num_clips: 1
|
12 |
+
num_frames: 16
|
13 |
+
num_workers: 4
|
14 |
+
patch_size: 16
|
15 |
+
pin_mem: true
|
16 |
+
sampling_rate: 4
|
17 |
+
tubelet_size: 2
|
18 |
+
data_aug:
|
19 |
+
auto_augment: false
|
20 |
+
motion_shift: false
|
21 |
+
random_resize_aspect_ratio:
|
22 |
+
- 0.75
|
23 |
+
- 1.35
|
24 |
+
random_resize_scale:
|
25 |
+
- 0.3
|
26 |
+
- 1.0
|
27 |
+
reprob: 0.0
|
28 |
+
logging:
|
29 |
+
folder: /path/to/logs
|
30 |
+
write_tag: jepa
|
31 |
+
loss:
|
32 |
+
loss_exp: 1.0
|
33 |
+
reg_coeff: 0.0
|
34 |
+
mask:
|
35 |
+
- aspect_ratio:
|
36 |
+
- 0.75
|
37 |
+
- 1.5
|
38 |
+
max_keep: null
|
39 |
+
max_temporal_keep: 1.0
|
40 |
+
num_blocks: 8
|
41 |
+
spatial_scale:
|
42 |
+
- 0.15
|
43 |
+
- 0.15
|
44 |
+
temporal_scale:
|
45 |
+
- 1.0
|
46 |
+
- 1.0
|
47 |
+
- aspect_ratio:
|
48 |
+
- 0.75
|
49 |
+
- 1.5
|
50 |
+
max_keep: null
|
51 |
+
max_temporal_keep: 1.0
|
52 |
+
num_blocks: 2
|
53 |
+
spatial_scale:
|
54 |
+
- 0.7
|
55 |
+
- 0.7
|
56 |
+
temporal_scale:
|
57 |
+
- 1.0
|
58 |
+
- 1.0
|
59 |
+
meta:
|
60 |
+
dtype: bfloat16
|
61 |
+
eval_freq: 100
|
62 |
+
load_checkpoint: true
|
63 |
+
read_checkpoint: /path/to/vitl16.pth.tar
|
64 |
+
save_every_freq: 5
|
65 |
+
seed: 234
|
66 |
+
use_sdpa: true
|
67 |
+
model:
|
68 |
+
model_name: vit_large
|
69 |
+
pred_depth: 12
|
70 |
+
pred_embed_dim: 384
|
71 |
+
uniform_power: true
|
72 |
+
use_mask_tokens: true
|
73 |
+
zero_init_mask_tokens: true
|
74 |
+
nodes: 16
|
75 |
+
optimization:
|
76 |
+
clip_grad: 10.0
|
77 |
+
ema:
|
78 |
+
- 0.998
|
79 |
+
- 1.0
|
80 |
+
epochs: 25
|
81 |
+
final_lr: 1.0e-06
|
82 |
+
final_weight_decay: 0.4
|
83 |
+
ipe: 300
|
84 |
+
ipe_scale: 1.25
|
85 |
+
lr: 0.000625
|
86 |
+
start_lr: 0.0002
|
87 |
+
warmup: 40
|
88 |
+
weight_decay: 0.04
|
89 |
+
tasks_per_node: 8
|