Transformers documentation
SuperPoint
This model was released on 2017-12-20 and added to Hugging Face Transformers on 2024-03-19.
SuperPoint
SuperPoint is the result of self-supervised training of a fully-convolutional network for interest point detection and description. The model is able to detect interest points that are repeatable under homographic transformations and provide a descriptor for each point. Usage on it’s own is limited, but it can be used as a feature extractor for other tasks such as homography estimation and image matching.

You can find all the original SuperPoint checkpoints under the Magic Leap Community organization.
This model was contributed by stevenbucaille.
Click on the SuperPoint models in the right sidebar for more examples of how to apply SuperPoint to different computer vision tasks.
The example below demonstrates how to detect interest points in an image with the AutoModel class.
from transformers import AutoImageProcessor, SuperPointForKeypointDetection
import torch
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")
inputs = processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Post-process to get keypoints, scores, and descriptors
image_size = (image.height, image.width)
processed_outputs = processor.post_process_keypoint_detection(outputs, [image_size])
Notes
SuperPoint outputs a dynamic number of keypoints per image, which makes it suitable for tasks requiring variable-length feature representations.
from transformers import AutoImageProcessor, SuperPointForKeypointDetection import torch from PIL import Image import requests processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint") model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint") url_image_1 = "http://images.cocodataset.org/val2017/000000039769.jpg" image_1 = Image.open(requests.get(url_image_1, stream=True).raw) url_image_2 = "http://images.cocodataset.org/test-stuff2017/000000000568.jpg" image_2 = Image.open(requests.get(url_image_2, stream=True).raw) images = [image_1, image_2] inputs = processor(images, return_tensors="pt") # Example of handling dynamic keypoint output outputs = model(**inputs) keypoints = outputs.keypoints # Shape varies per image scores = outputs.scores # Confidence scores for each keypoint descriptors = outputs.descriptors # 256-dimensional descriptors mask = outputs.mask # Value of 1 corresponds to a keypoint detection
The model provides both keypoint coordinates and their corresponding descriptors (256-dimensional vectors) in a single forward pass.
For batch processing with multiple images, you need to use the mask attribute to retrieve the respective information for each image. You can use the
post_process_keypoint_detection
from theSuperPointImageProcessor
to retrieve the each image information.# Batch processing example images = [image1, image2, image3] inputs = processor(images, return_tensors="pt") outputs = model(**inputs) image_sizes = [(img.height, img.width) for img in images] processed_outputs = processor.post_process_keypoint_detection(outputs, image_sizes)
You can then print the keypoints on the image of your choice to visualize the result:
import matplotlib.pyplot as plt plt.axis("off") plt.imshow(image_1) plt.scatter( outputs[0]["keypoints"][:, 0], outputs[0]["keypoints"][:, 1], c=outputs[0]["scores"] * 100, s=outputs[0]["scores"] * 50, alpha=0.8 ) plt.savefig(f"output_image.png")

Resources
- Refer to this notebook for an inference and visualization example.
SuperPointConfig
class transformers.SuperPointConfig
< source >( encoder_hidden_sizes: list = [64, 64, 128, 128] decoder_hidden_size: int = 256 keypoint_decoder_dim: int = 65 descriptor_decoder_dim: int = 256 keypoint_threshold: float = 0.005 max_keypoints: int = -1 nms_radius: int = 4 border_removal_distance: int = 4 initializer_range = 0.02 **kwargs )
Parameters
- encoder_hidden_sizes (
List
, optional, defaults to[64, 64, 128, 128]
) — The number of channels in each convolutional layer in the encoder. - decoder_hidden_size (
int
, optional, defaults to 256) — The hidden size of the decoder. - keypoint_decoder_dim (
int
, optional, defaults to 65) — The output dimension of the keypoint decoder. - descriptor_decoder_dim (
int
, optional, defaults to 256) — The output dimension of the descriptor decoder. - keypoint_threshold (
float
, optional, defaults to 0.005) — The threshold to use for extracting keypoints. - max_keypoints (
int
, optional, defaults to -1) — The maximum number of keypoints to extract. If-1
, will extract all keypoints. - nms_radius (
int
, optional, defaults to 4) — The radius for non-maximum suppression. - border_removal_distance (
int
, optional, defaults to 4) — The distance from the border to remove keypoints. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
This is the configuration class to store the configuration of a SuperPointForKeypointDetection. It is used to instantiate a SuperPoint model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SuperPoint magic-leap-community/superpoint architecture.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import SuperPointConfig, SuperPointForKeypointDetection
>>> # Initializing a SuperPoint superpoint style configuration
>>> configuration = SuperPointConfig()
>>> # Initializing a model from the superpoint style configuration
>>> model = SuperPointForKeypointDetection(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
SuperPointImageProcessor
class transformers.SuperPointImageProcessor
< source >( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_grayscale: bool = False **kwargs )
Parameters
- do_resize (
bool
, optional, defaults toTrue
) — Controls whether to resize the image’s (height, width) dimensions to the specifiedsize
. Can be overridden bydo_resize
in thepreprocess
method. - size (
dict[str, int]
optional, defaults to{"height" -- 480, "width": 640}
): Resolution of the output image afterresize
is applied. Only has an effect ifdo_resize
is set toTrue
. Can be overridden bysize
in thepreprocess
method. - resample (
Resampling
, optional, defaults to2
) — Resampling filter to use if resizing the image. Can be overridden byresample
in thepreprocess
method. - do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image by the specified scalerescale_factor
. Can be overridden bydo_rescale
in thepreprocess
method. - rescale_factor (
int
orfloat
, optional, defaults to1/255
) — Scale factor to use if rescaling the image. Can be overridden byrescale_factor
in thepreprocess
method. - do_grayscale (
bool
, optional, defaults toFalse
) — Whether to convert the image to grayscale. Can be overridden bydo_grayscale
in thepreprocess
method.
Constructs a SuperPoint image processor.
preprocess
< source >( images do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_grayscale: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )
Parameters
- images (
ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - do_resize (
bool
, optional, defaults toself.do_resize
) — Whether to resize the image. - size (
dict[str, int]
, optional, defaults toself.size
) — Size of the output image afterresize
has been applied. Ifsize["shortest_edge"]
>= 384, the image is resized to(size["shortest_edge"], size["shortest_edge"])
. Otherwise, the smaller edge of the image will be matched toint(size["shortest_edge"]/ crop_pct)
, after which the image is cropped to(size["shortest_edge"], size["shortest_edge"])
. Only has an effect ifdo_resize
is set toTrue
. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the image values between [0 - 1]. - rescale_factor (
float
, optional, defaults toself.rescale_factor
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_grayscale (
bool
, optional, defaults toself.do_grayscale
) — Whether to convert the image to grayscale. - return_tensors (
str
orTensorType
, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray
. TensorType.PYTORCH
or'pt'
: Return a batch of typetorch.Tensor
.TensorType.NUMPY
or'np'
: Return a batch of typenp.ndarray
.
- Unset: Return a list of
- data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
Preprocess an image or batch of images.
SuperPointImageProcessorFast
class transformers.SuperPointImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.superpoint.image_processing_superpoint.SuperPointImageProcessorKwargs] )
Constructs a fast Superpoint image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - do_convert_rgb (
bool
, optional) — Whether to convert the image to RGB. - do_resize (
bool
, optional) — Whether to resize the image. - size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]
) — Describes the maximum input dimensions to the model. - crop_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]
) — Size of the output image after applyingcenter_crop
. - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]
) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling
. Only has an effect ifdo_resize
is set toTrue
. - do_rescale (
bool
, optional) — Whether to rescale the image. - rescale_factor (
float
, optional) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional) — Whether to normalize the image. - image_mean (
Union[float, list[float], tuple[float, ...], NoneType]
) — Image mean to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - image_std (
Union[float, list[float], tuple[float, ...], NoneType]
) — Image standard deviation to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - do_pad (
bool
, optional) — Whether to pad the image. Padding is done either to the largest size in the batch or to a fixed square size per image. The exact padding strategy depends on the model. - pad_size (
Annotated[Union[int, list[int], tuple[int, ...], dict[str, int], NoneType], None]
) — The size in{"height": int, "width" int}
to pad the images to. Must be larger than any image size provided for preprocessing. Ifpad_size
is not provided, images will be padded to the largest height and width in the batch. Applied only whendo_pad=True.
- do_center_crop (
bool
, optional) — Whether to center crop the image. - data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]
) — OnlyChannelDimension.FIRST
is supported. Added for compatibility with slow processors. - input_data_format (
Union[str, ~image_utils.ChannelDimension, NoneType]
) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
- device (
Annotated[str, None]
, optional) — The device to process the images on. If unset, the device is inferred from the input images. - return_tensors (
Annotated[Union[str, ~utils.generic.TensorType, NoneType], None]
) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - disable_grouping (
bool
, optional) — Whether to disable grouping of images by size to process them individually and not in batches. If None, will be set to True if the images are on CPU, and False otherwise. This choice is based on empirical observations, as detailed here: https://github.com/huggingface/transformers/pull/38157
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict
) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType]
, optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
post_process_keypoint_detection
< source >( outputs: SuperPointKeypointDescriptionOutput target_sizes: typing.Union[transformers.utils.generic.TensorType, list[tuple]] ) → List[Dict]
Parameters
- outputs (
SuperPointKeypointDescriptionOutput
) — Raw outputs of the model containing keypoints in a relative (x, y) format, with scores and descriptors. - target_sizes (
torch.Tensor
orList[Tuple[int, int]]
) — Tensor of shape(batch_size, 2)
or list of tuples (Tuple[int, int]
) containing the target size(height, width)
of each image in the batch. This must be the original image size (before any processing).
Returns
List[Dict]
A list of dictionaries, each dictionary containing the keypoints in absolute format according to target_sizes, scores and descriptors for an image in the batch as predicted by the model.
Converts the raw output of SuperPointForKeypointDetection into lists of keypoints, scores and descriptors with coordinates absolute to the original image sizes.
SuperPointForKeypointDetection
class transformers.SuperPointForKeypointDetection
< source >( config: SuperPointConfig )
Parameters
- config (SuperPointConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
SuperPoint model outputting keypoints and descriptors.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor labels: typing.Optional[torch.LongTensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.superpoint.modeling_superpoint.SuperPointKeypointDescriptionOutput
or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, image_size, image_size)
) — The tensors corresponding to the input images. Pixel values can be obtained using SuperPointImageProcessor. See SuperPointImageProcessor.call() for details (processor_class
uses SuperPointImageProcessor for processing images). - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.superpoint.modeling_superpoint.SuperPointKeypointDescriptionOutput
or tuple(torch.FloatTensor)
A transformers.models.superpoint.modeling_superpoint.SuperPointKeypointDescriptionOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (SuperPointConfig) and inputs.
- loss (
torch.FloatTensor
of shape(1,)
, optional) — Loss computed during training. - keypoints (
torch.FloatTensor
of shape(batch_size, num_keypoints, 2)
) — Relative (x, y) coordinates of predicted keypoints in a given image. - scores (
torch.FloatTensor
of shape(batch_size, num_keypoints)
) — Scores of predicted keypoints. - descriptors (
torch.FloatTensor
of shape(batch_size, num_keypoints, descriptor_size)
) — Descriptors of predicted keypoints. - mask (
torch.BoolTensor
of shape(batch_size, num_keypoints)
) — Mask indicating which values in keypoints, scores and descriptors are keypoint information. - hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or - when
config.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape(batch_size, sequence_length, hidden_size)
. Hidden-states (also called feature maps) of the model at the output of each stage.
The SuperPointForKeypointDetection forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, SuperPointForKeypointDetection
>>> import torch
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
>>> model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")
>>> inputs = processor(image, return_tensors="pt")
>>> outputs = model(**inputs)