Create lv.md (#1)
Browse files- Create lv.md (e91243e869791f9e7d7d536a81f30fbdf888cb5a)
lv.md
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Image Encoder
|
2 |
+
1. A ViT-H/14 variant of the vision transformer with 630M parameters, trained on 2.5B image-text pairs
|
3 |
+
2. Processes images of resolution 224x224, divided into 16x16 patches of 14x14 pixels each
|
4 |
+
3. Leverages multi-layer feature extraction utilizing features from the 4th, 8th, 16th, 24th, and 31st layers, in addition to the final layer features
|
5 |
+
4. 8 gated self-attention layers, resulting in a total of 40 transformer blocks, and with 850M total parameters
|
6 |
+
5. Encoder generates a 7680-dimensional representation for each patch, producing a total of 256 patches
|
7 |
+
|
8 |
+
Image Adapter
|
9 |
+
1. Cross-attention layers b/w visual and language model token representations, applied after every fourth self-attention layer in the language model, utilizing GQA
|
10 |
+
2. Initial pre-training: Trained on ≈6B image-text pairs, with images resized to fit within four tiles of 336x336 pixels, arranged to accommodate various aspect ratios.
|
11 |
+
3. Annealing: Training continues on ≈500M images, increasing per-tile resolution to enhance performance on downstream tasks
|
12 |
+
|
13 |
+
Video Adapter
|
14 |
+
1. Input: Up to 64 video frames, each encoded
|
15 |
+
2. Temporal modeling: Combines 32 consecutive frames and adds video cross-attention layers
|
16 |
+
3. Aggregator: Implemented as a perceiver resampler
|
17 |
+
4. Parameters: 0.6B and 4.6B for Llama 3 7B and 70B, respectively
|