SUPERIX: Super-Resolution Intercomparison Exercise

Introduction

Super-resolution (SR) techniques are becoming more popular in improving the spatial resolution of freely available satellite imagery, such as Sentinel-2 and Landsat. SR could significantly improve the accuracy of various remote sensing downstream tasks, including road detection, crop delineation, and object recognition. However, some researchers argue that the benefits of SR are primarily aesthetic, suggesting that its main value lies in creating more visually appealing maps or aiding in visual interpretation.

Another criticism of SR is that it can degrade the original input data, potentially leading to incorrect conclusions. However, some SR methods appear more conservative than others in preserving reflectance integrity. Given this, a reliable benchmark is essential for providing quantitative assessments of the current state-of-the-art. Without such benchmarks, it remains difficult to conclusively determine the true impact of SR techniques on remote sensing data.

To establish a reliable framework, we propose the creation of a dedicated working group aimed at intercomparing super-resolution algorithms for Sentinel-2 data (SUPERIX). SR algorithms developed by teams from universities, research centers, industry, and space agencies are encouraged to participate in SUPERIX. This initiative will use OpenSR-test datasets and proposed metrics to evaluate the consistency with the original input data and the reliability of the high-frequency details introduced by the SR models.

Summarizing, multiple methods have been developed to address the problem of super-resolution in satellite imagery, but very few studies were carried out to quantitatively inter-compare state-of-the-art methods in this domain.

SUPERIX aims at inter-comparing SR algorithms for ESA Sentinel-2 mission.
SUPERIX will involve defining reference datasets, metrics and an analysis framework.
SUPERIX should allow to identify strengths and weaknesses of existing algorithms and potential areas of improvements.

Teams and SR Algorithms

Are you interested? Contact us!

Validation Datasets

Accurate validation datasets will allow a detailed analysis of SR strengths and weaknesses.

Validation datasets might vary in the way they are sampled and generated:

cross-sensor or synthetic
spatial scale factor
geographical distribution

Performance of SR algorithms will vary also depending on the reference dataset, which can be attributed to differences in radiometry, spectral response, spatial alignment, effective spatial resolution, considered landscapes, etc.

About the high-resolution (HR) reference, we are considering:

naip: A set of 62 RGBNIR orthophotos mainly from agricultural and forest regions in the USA.
spot: A set of 10 SPOT images obtained from Worldstrat.
spain_urban: A set of 20 RGBNIR orthophotos, primarily from urban areas of Spain, including roads.
spain_crops: A set of 20 RGBNIR orthophotos, primarily taken from agricultural areas near cities in Spain.
venus: A set of 60 VENµS images obtained from SEN2VENµS.

Each HR reference includes the corresponding Sentinel-2 imagery preprocessed at 1C and 2A levels. Here is an example of how to load each dataset.

import opensr_test

dataset = opensr_test.load("naip")
lr, hr = dataset["L2A"], dataset["HRharm"]

Quality Metrics

We propose the following metrics to assess the consistency of SR models:

Reflectance: This metric evaluates how SR affects the reflectance of the LR image, utilizing the Mean Absolute Error (MAE) distance by default. Lower values indicate better reflectance consistency. The SR image is downsampled to LR resolution using a triangular anti-aliasing filter and downsampling by the scale factor (bilinear interpolation).
Spectral: This metric measures how SR impacts the spectral signature of the LR image, employing the Spectral Angle Distance (SAM) by default. Lower values indicate better spectral consistency, with angles measured in degrees. The SR image is downsampled to LR resolution using a triangular anti-aliasing filter and downsampling by the scale factor (bilinear interpolation).
Spatial: This metric assesses the spatial alignment between SR and LR images, utilizing the Phase Correlation Coefficient (PCC) by default. Some SR models introduce spatial shifts, which this metric detects. The SR image is downsampled to LR resolution using a triangular anti-aliasing filter and downsampling by the scale factor (bilinear interpolation).

We propose three metrics to evaluate the high-frequency details introduced by SR models. The sum of these metrics always equals 1:

Improvements (im_score): This metric quantifies the similarity between the SR and HR images. A value closer to 1 indicates that the SR model closely corresponds to the HR image (i.e. improves the high-frequency details).
Omissions (om_score): This metric measures the similarity between the SR and LR images. A value closer to 1 suggests that the SR model closely compares the LR image downsampled with bilinear interpolation (i.e. omits high-frequency details present in HR but not in LR).
Halucinations (ha_score): This metric evaluates the similarity between SR and the HR and LR images. A value closer to 1 indicates that the SR model deviates significantly from both references (i.e. hallucinates introducing high-frequency details not present in HR).

Proposed Experiments

We are planning two experiments for both x4 and x2 scale factors. Participants are encouraged to submit their SR models for both scales. Additionally, models designed solely for the x4 scale will be assessed at the x2 scale by downsampling the SR image by a factor of 2.

In each experiment, we will employ two distinct approaches to evaluate the high-frequency details introduced by SR models. The first approach utilizes the Mean Absolute Error (MAE) as the distance metric for assessing high-frequency details. Alternatively, the second approach employs LPIPS. While MAE is sensitive to the intensity of high-frequency details, LPIPS is more sensitized to their structural differences. Contrasting the outcomes of these two metrics can offer a comprehensive understanding of the high-frequency details introduced by SR models. LPIPS metrics are consistently run on 32x32 patches of the HR image, while MAE is computed on 2x2 patches for x2 scale and 4x4 patches for x4 scale evaluations.

Proposed Protocol

The SUPERIX working group should first agree on the validation datasets appropriate for SR, the definition of best quality metrics, and how quantify hallucinations.
Each team will submit their SR models up to the deadline.
We will have two different types of models: open-source and closed-source. To be considered open-source, the code must be available in this repository within a folder named as the model name. Keep the code as simple as possible. See examples using torch, diffuser, and tensorflow libraries here, here, and here. The closed-source models are required to only provide the results in GeoTIFF format. See an example here.
The submission will be made through a pull request to this repository. The pull request MUST include the metadata.json file and the results in GeoTIFF format. The results must be in the same resolution as the HR image. We expect the following information in the metadata.json file:

{
  "name": "model_name",
  "authors": ["author1", "author2"],
  "affiliations": ["affiliation1", "affiliation2"],
  "description": "A brief description of the model",
  "code": "open-source" or "closed-source",
  "scale": "x2" or "x4",
  "url": "[OPTIONAL] URL to the model repository if it is open-source",
  "license": "license of the model"
}

The SUPERIX working group will evaluate the SR models after the deadline using the metrics discussed above.
After the metrics estimation, we will first independently contact the teams providing the results. If there are any issues with the submission, we will ask for clarification, and the team will have up to two weeks to provide the necessary corrections.
Questions and discussions will be held in the discussion section of this repository. The progress of the SUPERIX working group will be informed through the discussion section and by email.
After all the participants have provided the necessary corrections, the results will be published in the discussion section of this repository.

Expected Outcomes

No clear superiority of any methodology in all metrics is expected.
Analysis on validation scenes with major discrepancies between algorithms will be carried out.
A dedicated website and a technical report will be prepared to present the results and recommendations.
A research publication will be submitted to a remote sensing journal.
The paper will be prepared in overleaf, and all the participants will be invited to contribute to it.

isp-uv-es
/

superIX