SDO-FM: A foundation model for the Sun

1. Introduction

SDO-FM is a foundation model using data from NASA’s Solar Dynamics Observatory (SDO) spacecraft; integrating three separate instruments to encapsulate the Sun’s complex physical interactions into a multi-modal embedding space. This model can be used to streamline scientific investigations involving SDO by making the enormous datasets more computationally accessible for heliophysics research and enable investigations that require instrument fusion.

The overall process for building SDO-FM is composed of four stages; (1) data preparation, (2) large foundation model (FM) training, (3) embedding extraction, and (4) fine-tuning or direct embedding usage for scientific validation cases. Collectively we denote the data preparation as effort completed under SDOML [4], a machine-learning dataset of SDO.

Our models are based upon autoencoders, with training conducted under the objective of image reconstruction over the period beginning from satellite launch in 2010 to 2023. Once these models are trained, a compressed representation dataset is created from the embeddings by a full-pass over the encoder. The compressed representations are called direct embeddings and provide a helpful result as a set of available SDO features at around two-thousandths (0.002) the original size. Lastly, the direct embeddings as well as standard model fine-tuning, are used to conduct scientific validation through a validation harness which is used to check our results against past ML-based heliophysics approaches and to compare their computational expense.

2. Quick Start

To run inference with any of the provided models, you can pull a Docker image [Docker Hub, or follow installation instructions below. To run any of the scientific tasks:

python scripts/main.py --config-name=embeddings_nvae_virtualeve

2.1 Installation

SDO-FM can be installed locally by directly installing the package in this GitHub repository. It's advised to use the docker image, however dependencies are contained in the usual requirements.txt.

pip install -e .

2.2 Usage

To run any task we assume execution inside a container with the image described in the Dockerfile and Hydra configurations, these are kept in the experiments directory. The entry point is main.py and args will select a configuration:

python scripts/main.py --config-name=default

CLI overrides are still possible with this selection but be aware of some shells not escaping quotes or sqaure brackets:

python scripts/main.py --config-name=default experiment.seed=37

2.3 Pre-training

python scripts/main.py --config-name=pretrain_32.2M_samae_HP

2.4 Notebooks

A series of notebooks are available to explore each of the four downstream tasks described later in this document.

3 Method

SDO-FM is composed of a backbone, optional neck, and head. We define the backbone as the model initially trained on the reconstruction task, the neck as the converter between backbone and head, the former then selected for the downstream application (or validation task). We implement two model families as backbones, one stemming from a Nouveau Variational Autoencoder (NVAE) [11], the other from a MAE [12]. They are both adapted to better accommodate our scientific dataset and for intermediate export of their latent spaces in the form of “embeddings.” We additionally evaluated various feature engineering options regarding how to manage the solar disk, the most effective included a simple look up from Stonyhurst coordinates, a heliographic coordinate system for a fixed observer on Earth (suitable given the geosynchronous orbit), to pixel space.

3.1 Model choice

Model selection was initially determined by the ability to capture solar phenomena, guided by applicability to SDO imagery and the ease of access to the embeddings in the latent space. The autoencoder architecture was selected for the backbone for ease of embedding construction and extraction. By design, autoencoders create a lower-dimensional representation during the encoding process. Other requirements included engineering efficiencies, such as ability to mask the solar limb for on-disk experiments, and cheaply bias by importance sampling for areas of interest (e.g. active regions).

3.1.1 Solar-aware Masked Autoencoder

Masked Autoencoders (MAEs) learn to be capable at reconstructing images with random components removed [13]. The approach follows the standard ViT-patchification common to transformer computer vision approaches for deconstruction of the image that the attention mechanism can learn between. The source of this “powerful expressivity” is attributed to “rich hidden representation” [14]. This is particularly of interest in our scenario, as we seek to learn which components of solar imagery are of value for our scientific validation cases. This model was expanded to increase suitability for temporal information for remote sensing tasks [15]. We have continued to iterate, adding “solar-awareness” by including the ability to process the nine wavelengths of interest to us via the Atmospheric Imaging Assembly, efficiencies for processing the solar disk, and the ability to optionally bias the model towards learning active regions of scientific interest.

3.1.2 Nouveau-VAE

The Nouveau Variational Autoencoder (NVAE) is a deep hierarchical VAE created for image generation. Like the MAE, it is able to create a rich latent space using depth-wise separable convolutions and batch normalization. The NVIDIA team’s codebase was modified to permit access to the hierarchical structure to successfully extract embeddings.

4. Scientific Validation Cases

Predict F10.7 This index is a proxy for solar irradiance, which can be measured from the ground, as this frequency is not absorbed by the atmosphere. Can we achieve good agreement with ground measurements? There is limited scientific value in this prediction of a proxy measure such as F10.7, however this simple task clearly indicates learned capacity in a single result.

Virtual EVE In 2014, an instrument malfunction resulted in the loss of the MEGS-A module of SDO/EVE. With four years of overlapping data, [16, 17] used a hybrid CNN/linear regression model to successfully demonstrate the capability of machine learning methods to estimate missing EUV irradiance measurements from MEGS-A (and the degraded MEGS-B components of the EVE instrument). This validation task employs the embeddings constructed from AIA to understand the contributions from solar features on the EUV spectra, as the mapping between instruments exists due to the narrow-band images (SDO/AIA) and sun-as-a-star spectra (SDO/EVE) observing the same plasma distribution. A linear model accounts for a large portion of the relationship, while a CNN is used to correct for outlier events such as solar flares. There are known concerns regarding the model’s performance post-2020, as AIA instrument performance deviates further from the 2014 baseline. Some of these issues can be addressed by incorporating other sources of irradiance, such as data from sounding rockets, for training over longer periods, although these are sparse. Importantly, this outperforms a physics-based inversion approach [18].

Missing Channel Reconstruction The reconstruction of missing extreme ultraviolet (EUV) images from wavelength images is a crucial task given the often low or unusable quality of image data frames from the Solar Dynamics Observatory (SDO). Currently, there is no effective method to recover these missing steps. However, the foundation model is capable of reconstructing individual frames by leveraging contextual information available in other wavelength channels. This approach allows for interpolation to provide a best-guess estimate of missing data at any arbitrary time step. As with the Virtual EVE project, and differential emission measure analysis [18], the overlapping temperature range covered by different SDO/AIA wavelength channels allows for the temperature distribution of the underlying plasma to be reconstructed, may enable the inference of properties of different temperature ranges.

The overlapping range covered by different wavelength channels may enable the inference of properties of different temperature ranges. This overlap can be used within a machine learning model to produce an estimation to replace data that is either missing, corrupted, or otherwise unusable. Our objective is to develop a more robust model that operates with higher computational efficiency while producing results comparable to the current SOTA. Special attention is given to the model’s ability to capture non-linear relationships or rare events, such as intensity values in flaring regions. There are several uncertainties inherent in this process. Some channels may be more readily recreated than others due to the physical assumptions that channels in the middle of the temperature/wavelength ranges will have the most overlap with other channels, potentially yielding better results. However, this overlap might not always correspond to the actual missing data in the SDO. Addressing these uncertainties requires an understanding of the shortfalls to determine the appropriateness of this reconstruction technique in different scenarios.

Autocalibration The SDO/AIA EUV channels exhibit degradation due to exposure to the same emissions they are intended to measure. This degradation results in apparent dimming over time across multiple EUV channels with unique characteristics. This poses challenges for long-term studies, as degradation trends within the dataset need to be corrected. Until 2014, SDO utilized EVE to correct this degradation. As discussed, a malfunction of SDO/EVE resulted in the loss of the MEGS-A component, and calibration is currently performed by sounding rocket flights. In response to this, [19] used a CNN to reconstruct the Atmospheric Imaging Assembly (AIA) multi-channel degradation curves.

Data requirements for this study include the SDOML data from AIA as well as older correction tables. The sampling requirement is minimal, with data being required once per day or even less frequently. Traditional SOTA methods, such as those performed by the Lockheed Martin Solar and Astrophysics Laboratory (LMSAL), involve calibration using sounding rocket flights. These methods, while accurate, are expensive and technically demanding. Our goal is to reproduce the results from [19] with greater efficiency in terms of data required and computational resources. This efficiency is evaluated through an examination of the resultant images compared to those produced by SOTA calibration pipelines, alongside intensity histograms, data spike analysis, and other metrics.

5. Results

Overall, our model families were evaluated for their backbone reconstruction task and against our four scientific validation cases. In all but the autocalibration task they reached the same level of accuracy or surpassed their classical counterparts in a fraction of the required time. In the autocalibration case, the direct embedding approach was able to match, but took additional training time.

5.1 Reconstruction

Loss for the reconstruction task is measured by pixel RMSE within the solar disk. SAMAE results presented in fig. 5 indicate a clear ability to reconstruct most wavelengths under a small embedding dimension (128) and within a short number of training epochs (10). Interestingly this model struggles to reconstruct 131 & 171Å, which is likely due to a normalization error we’re still investigating. The Nouveau-VAE model on raw pixel intensity performs better, even when including the solar limb

5.2 Direct Embeddings

Training each scientific validation case on the embeddings directly led to generally much faster training time and matching or surpassing of accuracy. The was an effort made to best evaluate the embeddings outside of the scientific cases to consider embedding-to-embedding comparison. The common TSNE approach was taken over a small one-year sample and there was seperation by solar activity. This approach however is still fairly opaque and hence the validation approaches are considered more appropriate.

Acknowledgements

This work is the research product of the SDO-FM: A Multi-Modal Foundation Model POC for SDO. This has been funded and supported by NASA under Grant award No 80NSSC24K0701. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Aeronautics and Space Administration (NASA). The research and its outputs have been designed, managed and delivered by Trillium Technologies Inc (https://trillium.tech). Trillium is a research and development company with a focus on intelligent systems and collaborative communities for planetary stewardship, space exploration and human health. Trillium aspires to ensure that the latest tools and techniques in Artificial Intelligence (AI) and Machine Learning (ML) are applied to developing open science for all Humankind.

Authors

James Walsh, University of Cambridge
Daniel Gass, University of Central Lancashire
Raul Ramos Pollan, Universidad Industrial de Santander
Richard Galvez, Pure Storage
Paul Wright, Dublin Institute for Advanced Studies
Atılım Güneş Baydin, University of Oxford
Noah Kasmanoff, AE Studio
Jason Naradowsky, University of Tokyo

PI: Anne Spalding, Trillium Technolgies Inc
Co-I: James Parr, Trillium Technologies Inc

SpaceML
/

SDO-FM

You need to agree to share your contact information to access this model