jonathanzkoch
/

vjepa-self-driving

Model card Files Files and versions Community

jonathanzkoch commited on Apr 12, 2024

Commit

41c6668

verified ·

1 Parent(s): a74a033

Upload 4 files

Browse files

Files changed (4) hide show

README.md +78 -0
demo_jepa_encoder.py +14 -0
jepa-latest.pth.tar +3 -0
params-encoder.yaml +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,78 @@

+ VJEPA Encoder
+The VJEPA Encoder finetuned JEPA model trained on [High Speed and High Dynamic Range Video with an Event Camera IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019](https://rpg.ifi.uzh.ch/event_driving_datasets.html). This package is an adaptation to `facebookresearch/jepa` to enable ease of use of the Jepa Architecture built with Vision Transformers.
+## Installation
+To install the VJEPA Encoder package, you can use pip:
+```
+pip install vjepa_encoder
+```
+## Usage
+To use the VJEPA Encoder in your Python code, you can import it as follows:
+```python
+from vjepa_encoder.vision_encoder import JepaEncoder
+```
+### Loading the Encoder
+To load the pre-trained encoder, you can use the `load_model` function:
+```python
+encoder = JepaEncoder.load_model(config_file_path, devices)
+```
+- `config_file_path`: Path to the configuration file (YAML) containing the model settings.
+- `devices`: List of devices (e.g., `['cuda:0']`) to use for distributed training. If not provided, the model will be loaded on the CPU.
+### Preprocessing Data
+The VJEPA Encoder provides a `preprocess_data` function to preprocess input data before feeding it to the encoder:
+```python
+preprocessed_data = encoder.preprocess_data(input_data)
+```
+- `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
+### Embedding Images
+To obtain the embeddings for an image, you can use the `embed_image` function:
+```python
+embeddings = encoder.embed_image(input_data)
+```
+- `input_data`: Input data, which can be an image path, image array, PIL Image, or PyTorch tensor.
+The function returns the embeddings generated by the encoder.
+## Configuration
+The VJEPA Encoder requires a configuration file in YAML format to specify the model settings. The configuration file should include the following sections:
+- `meta`: General settings such as the checkpoint file path, random seed, etc.
+- `mask`: Settings related to masking.
+- `model`: Model architecture settings.
+- `data`: Data-related settings such as crop size, patch size, etc.
+- `logging`: Logging settings.
+Please refer to the provided configuration file template for more details.
+## License
+The VJEPA Encoder is released under the [MIT License](LICENSE).
+## Acknowledgments
+The VJEPA Encoder is based on the research work conducted by Facebook AI Research. We would like to acknowledge their contributions to the field of computer vision and representation learning.
+## Contact
+If you have any questions or suggestions regarding the VJEPA Encoder, please feel free to contact me at johnnykoch02@gmail.com.
+---

demo_jepa_encoder.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from vjepa_encoder.vision_encoder import JepaEncoder
+encoder = JepaEncoder.load_model(
+    "logs/params-encoder.yaml"
+)
+import numpy
+img = numpy.random.random(size=(360, 480, 3))
+print("Input Img:", img.shape)
+embedding = encoder.embed_image(img)
+print(embedding)
+print(embedding.shape)

jepa-latest.pth.tar ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8aea9275bcb92ec575e1e58f2a2068377d0885f884973f3b31281fa9e6db30ae
+size 5143102207

params-encoder.yaml ADDED Viewed

	@@ -0,0 +1,89 @@

+app: vjepa
+data:
+  batch_size: 8
+  clip_duration: null
+  crop_size: 224
+  dataset_type: VideoDataset
+  datasets:
+  - /path/to/dataset.csv
+  decode_one_clip: true
+  filter_short_videos: false
+  num_clips: 1
+  num_frames: 16
+  num_workers: 4
+  patch_size: 16
+  pin_mem: true
+  sampling_rate: 4
+  tubelet_size: 2
+data_aug:
+  auto_augment: false
+  motion_shift: false
+  random_resize_aspect_ratio:
+  - 0.75
+  - 1.35
+  random_resize_scale:
+  - 0.3
+  - 1.0
+  reprob: 0.0
+logging:
+  folder: /path/to/logs
+  write_tag: jepa
+loss:
+  loss_exp: 1.0
+  reg_coeff: 0.0
+mask:
+- aspect_ratio:
+  - 0.75
+  - 1.5
+  max_keep: null
+  max_temporal_keep: 1.0
+  num_blocks: 8
+  spatial_scale:
+  - 0.15
+  - 0.15
+  temporal_scale:
+  - 1.0
+  - 1.0
+- aspect_ratio:
+  - 0.75
+  - 1.5
+  max_keep: null
+  max_temporal_keep: 1.0
+  num_blocks: 2
+  spatial_scale:
+  - 0.7
+  - 0.7
+  temporal_scale:
+  - 1.0
+  - 1.0
+meta:
+  dtype: bfloat16
+  eval_freq: 100
+  load_checkpoint: true
+  read_checkpoint: /path/to/vitl16.pth.tar
+  save_every_freq: 5
+  seed: 234
+  use_sdpa: true
+model:
+  model_name: vit_large
+  pred_depth: 12
+  pred_embed_dim: 384
+  uniform_power: true
+  use_mask_tokens: true
+  zero_init_mask_tokens: true
+nodes: 16
+optimization:
+  clip_grad: 10.0
+  ema:
+  - 0.998
+  - 1.0
+  epochs: 25
+  final_lr: 1.0e-06
+  final_weight_decay: 0.4
+  ipe: 300
+  ipe_scale: 1.25
+  lr: 0.000625
+  start_lr: 0.0002
+  warmup: 40
+  weight_decay: 0.04
+tasks_per_node: 8