Disclaimer: The team releasing Van did not write a model card for this model so this model card has been written by the Hugging Face team.
This paper introduces a new attention layer based on convolution operations able to capture both local and distant relationships. This is done by combining normal and large kernel convolution layers. The latter uses a dilated convolution to capture distant correlations.
You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.
Here is how to use this model:
from transformers import AutoFeatureExtractor, VanForImageClassification import torch from datasets import load_dataset dataset = load_dataset("huggingface/cats-image") image = dataset["test"]["image"] feature_extractor = AutoFeatureExtractor.from_pretrained("Visual-Attention-Network/van-base") model = VanForImageClassification.from_pretrained("Visual-Attention-Network/van-base") inputs = feature_extractor(image, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits # model predicts one of the 1000 ImageNet classes predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) tabby, tabby cat
For more code examples, we refer to the documentation.
- Downloads last month