Feature Extraction
Transformers
Safetensors
custom_code
gheinrich commited on
Commit
c1e2cd1
·
verified ·
1 Parent(s): e23431b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -3
README.md CHANGED
@@ -12,8 +12,6 @@ library_name: transformers
12
  This model performs visual feature extraction.
13
  For instance, RADIO generates image embeddings that can be used by a downstream model to classify images.
14
 
15
- This model is for research and development only.
16
-
17
  ### License/Terms of Use
18
 
19
  [License] This model is governed by the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
@@ -27,7 +25,7 @@ This model is for research and development only.
27
 
28
  ## Input:
29
  **Input Type(s):** Image <br>
30
- **Input Format(s):** Red, Green, Blue (RGB) <br>
31
  **Input Parameters:** Two Dimensional (2D) <br>
32
  **Other Properties Related to Input:** Image resolutions up to 2048x2028 in increments of 16 pixels <br>
33
 
@@ -37,6 +35,42 @@ This model is for research and development only.
37
  **Output Parameters:** 2D <br>
38
  **Other Properties Related to Output:** Downstream model required to leverage image features <br>
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Software Integration:
41
  **Runtime Engine(s):**
42
  * TAO- 24.10 <br>
 
12
  This model performs visual feature extraction.
13
  For instance, RADIO generates image embeddings that can be used by a downstream model to classify images.
14
 
 
 
15
  ### License/Terms of Use
16
 
17
  [License] This model is governed by the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
 
25
 
26
  ## Input:
27
  **Input Type(s):** Image <br>
28
+ **Input Format(s):** Red, Green, Blue (RGB) pixel values in [0, 1] range. <br>
29
  **Input Parameters:** Two Dimensional (2D) <br>
30
  **Other Properties Related to Input:** Image resolutions up to 2048x2028 in increments of 16 pixels <br>
31
 
 
35
  **Output Parameters:** 2D <br>
36
  **Other Properties Related to Output:** Downstream model required to leverage image features <br>
37
 
38
+ ## Usage:
39
+
40
+ RADIO will return a tuple with two tensors.
41
+ The `summary` is similar to the `cls_token` in ViT and is meant to represent the general concept of the entire image.
42
+ It has shape `(B,C)` with `B` being the batch dimension, and `C` being some number of channels.
43
+ The `spatial_features` represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM.
44
+
45
+ ```python
46
+ import torch
47
+ from PIL import Image
48
+ from transformers import AutoModel, CLIPImageProcessor
49
+
50
+ hf_repo = "nvidia/C-RADIO"
51
+
52
+ image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
53
+ model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
54
+ model.eval().cuda()
55
+
56
+ image = Image.open('./assets/radio.png').convert('RGB')
57
+ pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
58
+ pixel_values = pixel_values.cuda()
59
+
60
+ summary, features = model(pixel_values)
61
+ ```
62
+
63
+ Spatial features have shape `(B,T,D)` with `T` being the flattened spatial tokens, and `D` being the channels for spatial features. Note that `C!=D` in general.
64
+ Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For RADIO, the patch size is 16.
65
+
66
+ ```Python
67
+ from einops import rearrange
68
+ spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
69
+ ```
70
+
71
+ The resulting tensor will have shape `(B,D,H,W)`, as is typically seen with computer vision models.
72
+
73
+
74
  ## Software Integration:
75
  **Runtime Engine(s):**
76
  * TAO- 24.10 <br>