Transformers documentation
SuperGlue
SuperGlue
Overview
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems. The code and trained weights are publicly available at this URL.
How to use
Here is a quick example of using the model. Since this model is an image matching model, it requires pairs of images to be matched. The raw outputs contain the list of keypoints detected by the keypoint detector as well as the list of matches with their corresponding matching scores.
from transformers import AutoImageProcessor, AutoModel
import torch
from PIL import Image
import requests
url_image1 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg"
image1 = Image.open(requests.get(url_image1, stream=True).raw)
url_image2 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg"
image_2 = Image.open(requests.get(url_image2, stream=True).raw)
images = [image1, image2]
processor = AutoImageProcessor.from_pretrained("magic-leap-community/superglue_outdoor")
model = AutoModel.from_pretrained("magic-leap-community/superglue_outdoor")
inputs = processor(images, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)You can use the post_process_keypoint_matching method from the SuperGlueImageProcessor to get the keypoints and matches in a more readable format:
image_sizes = [[(image.height, image.width) for image in images]]
outputs = processor.post_process_keypoint_matching(outputs, image_sizes, threshold=0.2)
for i, output in enumerate(outputs):
    print("For the image pair", i)
    for keypoint0, keypoint1, matching_score in zip(
            output["keypoints0"], output["keypoints1"], output["matching_scores"]
    ):
        print(
            f"Keypoint at coordinate {keypoint0.numpy()} in the first image matches with keypoint at coordinate {keypoint1.numpy()} in the second image with a score of {matching_score}."
        )
From the outputs, you can visualize the matches between the two images using the following code:
import matplotlib.pyplot as plt
import numpy as np
# Create side by side image
merged_image = np.zeros((max(image1.height, image2.height), image1.width + image2.width, 3))
merged_image[: image1.height, : image1.width] = np.array(image1) / 255.0
merged_image[: image2.height, image1.width :] = np.array(image2) / 255.0
plt.imshow(merged_image)
plt.axis("off")
# Retrieve the keypoints and matches
output = outputs[0]
keypoints0 = output["keypoints0"]
keypoints1 = output["keypoints1"]
matching_scores = output["matching_scores"]
keypoints0_x, keypoints0_y = keypoints0[:, 0].numpy(), keypoints0[:, 1].numpy()
keypoints1_x, keypoints1_y = keypoints1[:, 0].numpy(), keypoints1[:, 1].numpy()
# Plot the matches
for keypoint0_x, keypoint0_y, keypoint1_x, keypoint1_y, matching_score in zip(
        keypoints0_x, keypoints0_y, keypoints1_x, keypoints1_y, matching_scores
):
    plt.plot(
        [keypoint0_x, keypoint1_x + image1.width],
        [keypoint0_y, keypoint1_y],
        color=plt.get_cmap("RdYlGn")(matching_score.item()),
        alpha=0.9,
        linewidth=0.5,
    )
    plt.scatter(keypoint0_x, keypoint0_y, c="black", s=2)
    plt.scatter(keypoint1_x + image1.width, keypoint1_y, c="black", s=2)
# Save the plot
plt.savefig("matched_image.png", dpi=300, bbox_inches='tight')
plt.close()
This model was contributed by stevenbucaille. The original code can be found here.
SuperGlueConfig
class transformers.SuperGlueConfig
< source >( keypoint_detector_config: SuperPointConfig = None hidden_size: int = 256 keypoint_encoder_sizes: typing.List[int] = None gnn_layers_types: typing.List[str] = None num_attention_heads: int = 4 sinkhorn_iterations: int = 100 matching_threshold: float = 0.0 initializer_range: float = 0.02 **kwargs )
Parameters
-  keypoint_detector_config (Union[AutoConfig, dict], optional, defaults toSuperPointConfig) — The config object or dictionary of the keypoint detector.
-  hidden_size (int, optional, defaults to 256) — The dimension of the descriptors.
-  keypoint_encoder_sizes (List[int], optional, defaults to[32, 64, 128, 256]) — The sizes of the keypoint encoder layers.
-  gnn_layers_types (List[str], optional, defaults to['self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross', 'self', 'cross']) — The types of the GNN layers. Must be either ‘self’ or ‘cross’.
-  num_attention_heads (int, optional, defaults to 4) — The number of heads in the GNN layers.
-  sinkhorn_iterations (int, optional, defaults to 100) — The number of Sinkhorn iterations.
-  matching_threshold (float, optional, defaults to 0.0) — The matching threshold.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a SuperGlueModel. It is used to instantiate a
SuperGlue model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the SuperGlue
magic-leap-community/superglue_indoor architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Examples:
>>> from transformers import SuperGlueConfig, SuperGlueModel
>>> # Initializing a SuperGlue superglue style configuration
>>> configuration = SuperGlueConfig()
>>> # Initializing a model from the superglue style configuration
>>> model = SuperGlueModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configSuperGlueImageProcessor
class transformers.SuperGlueImageProcessor
< source >( do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_grayscale: bool = True **kwargs )
Parameters
-  do_resize (bool, optional, defaults toTrue) — Controls whether to resize the image’s (height, width) dimensions to the specifiedsize. Can be overriden bydo_resizein thepreprocessmethod.
-  size (Dict[str, int]optional, defaults to{"height" -- 480, "width": 640}): Resolution of the output image afterresizeis applied. Only has an effect ifdo_resizeis set toTrue. Can be overriden bysizein thepreprocessmethod.
-  resample (PILImageResampling, optional, defaults toResampling.BILINEAR) — Resampling filter to use if resizing the image. Can be overriden byresamplein thepreprocessmethod.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image by the specified scalerescale_factor. Can be overriden bydo_rescalein thepreprocessmethod.
-  rescale_factor (intorfloat, optional, defaults to1/255) — Scale factor to use if rescaling the image. Can be overriden byrescale_factorin thepreprocessmethod.
-  do_grayscale (bool, optional, defaults toTrue) — Whether to convert the image to grayscale. Can be overriden bydo_grayscalein thepreprocessmethod.
Constructs a SuperGlue image processor.
post_process_keypoint_matching
< source >( outputs: KeypointMatchingOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, typing.List[typing.Tuple]] threshold: float = 0.0  ) → List[Dict]
Parameters
-  outputs (KeypointMatchingOutput) — Raw outputs of the model.
-  target_sizes (torch.TensororList[Tuple[Tuple[int, int]]], optional) — Tensor of shape(batch_size, 2, 2)or list of tuples of tuples (Tuple[int, int]) containing the target size(height, width)of each image in the batch. This must be the original image size (before any processing).
-  threshold (float, optional, defaults to 0.0) — Threshold to filter out the matches with low scores.
Returns
List[Dict]
A list of dictionaries, each dictionary containing the keypoints in the first and second image of the pair, the matching scores and the matching indices.
Converts the raw output of KeypointMatchingOutput into lists of keypoints, scores and descriptors
with coordinates absolute to the original image sizes.
preprocess
< source >( images do_resize: bool = None size: typing.Dict[str, int] = None resample: Resampling = None do_rescale: bool = None rescale_factor: float = None do_grayscale: bool = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )
Parameters
-  images (ImageInput) — Image pairs to preprocess. Expects either a list of 2 images or a list of list of 2 images list with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (Dict[str, int], optional, defaults toself.size) — Size of the output image afterresizehas been applied. Ifsize["shortest_edge"]>= 384, the image is resized to(size["shortest_edge"], size["shortest_edge"]). Otherwise, the smaller edge of the image will be matched toint(size["shortest_edge"]/ crop_pct), after which the image is cropped to(size["shortest_edge"], size["shortest_edge"]). Only has an effect ifdo_resizeis set toTrue.
-  resample (PILImageResampling, optional, defaults toself.resample) — Resampling filter to use if resizing the image. This can be one ofPILImageResampling, filters. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image values between [0 - 1].
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_grayscale (bool, optional, defaults toself.do_grayscale) — Whether to convert the image to grayscale.
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
-  data_format (ChannelDimensionorstr, optional, defaults toChannelDimension.FIRST) — The channel dimension format for the output image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Preprocess an image or batch of images.
resize
< source >( image: ndarray size: typing.Dict[str, int] data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )
Parameters
-  image (np.ndarray) — Image to resize.
-  size (Dict[str, int]) — Dictionary of the form{"height": int, "width": int}, specifying the size of the output image.
-  data_format (ChannelDimensionorstr, optional) — The channel dimension format of the output image. If not provided, it will be inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
Resize an image.
- preprocess
SuperGlueForKeypointMatching
class transformers.SuperGlueForKeypointMatching
< source >( config: SuperGlueConfig )
Parameters
- config (SuperGlueConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
SuperGlue model taking images as inputs and outputting the matching of them. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
SuperGlue feature matching middle-end
Given two sets of keypoints and locations, we determine the correspondences by:
- Keypoint Encoding (normalization + visual feature and location fusion)
- Graph Neural Network with multiple self and cross-attention layers
- Final projection layer
- Optimal Transport Layer (a differentiable Hungarian matching algorithm)
- Thresholding matrix based on mutual exclusivity and a match_threshold
The correspondence ids use -1 to indicate non-matching points.
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. In CVPR, 2020. https://arxiv.org/abs/1911.11763
forward
< source >( pixel_values: FloatTensor labels: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
-  pixel_values (torch.FloatTensorof shape(batch_size, num_channels, height, width)) — Pixel values. Pixel values can be obtained using SuperGlueImageProcessor. See SuperGlueImageProcessor.call() for details.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
- Examples —
-  ```python —
from transformers import AutoImageProcessor, AutoModel import torch from PIL import Image import requests 
The SuperGlueForKeypointMatching forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
- forward
- post_process_keypoint_matching