File size: 3,204 Bytes
3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d 3c63951 090346d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
---
{}
---
# AM-RADIO: Reduce All Domains Into One
Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov
[NVIDIA Research](https://www.nvidia.com/en-us/research/)
\[[AM-RADIO Paper](https://arxiv.org/abs/2312.06709)\]
\[[PHI-S Paper](https://arxiv.org/abs/2410.01680)\]
\[[BibTex](#citing-radio)\]\[[GitHub examples](https://github.com/NVlabs/RADIO)\]
\[[Tech report on v2.5](https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_tech_report.md)\]
### HuggingFace Hub
You can pull the model from a Python script:
```Python
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "nvidia/RADIO-H"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()
image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()
summary, features = model(pixel_values)
```
### Usage
RADIO will return a tuple with two tensors. The `summary` is similar to the `cls_token` in ViT and is meant to represent the general concept of the entire image. It has shape $(B,C)$ with $B$ being the batch dimension, and $C$ being some number of channels. The `spatial_features` represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM. It has shape $(B,T,D)$ with $T$ being the flattened spatial tokens, and $D$ being the channels for spatial features. Note that $C \neq D$ in general.
Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For 'radio_v1', the patch size is 14.
```Python
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
```
The resulting tensor will have shape $(B,D,H,W)$, as is typically seen with computer vision models.
### RADIOv2.5 Notes
See the [RADIOv2.5 technical report](https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_tech_report.md).
## License
RADIO code and weights are released under the [NSCLv1 License](LICENSE).
## Citing RADIO
If you find this repository useful, please consider giving a star and citation:
```
@InProceedings{Ranzinger_2024_CVPR,
author = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
title = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {12490-12500}
}
```
```
@misc{ranzinger2024phisdistributionbalancinglabelfree,
title={PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation},
author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
year={2024},
eprint={2410.01680},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.01680},
}
``` |