Autoformer
概要
Autoformerモデルは、「Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting」という論文でHaixu Wu、Jiehui Xu、Jianmin Wang、Mingsheng Longによって提案されました。
このモデルは、予測プロセス中にトレンドと季節性成分を逐次的に分解できる深層分解アーキテクチャとしてTransformerを増強します。
論文の要旨は以下の通りです:
例えば異常気象の早期警告や長期的なエネルギー消費計画といった実応用において、予測時間を延長することは重要な要求です。本論文では、時系列の長期予測問題を研究しています。以前のTransformerベースのモデルは、長距離依存関係を発見するために様々なセルフアテンション機構を採用しています。しかし、長期未来の複雑な時間的パターンによってモデルが信頼できる依存関係を見つけることを妨げられます。また、Transformerは、長い系列の効率化のためにポイントワイズなセルフアテンションのスパースバージョンを採用する必要があり、情報利用のボトルネックとなります。Transformerを超えて、我々は自己相関機構を持つ新しい分解アーキテクチャとしてAutoformerを設計しました。系列分解の事前処理の慣行を破り、それを深層モデルの基本的な内部ブロックとして革新します。この設計は、複雑な時系列に対するAutoformerの進行的な分解能力を強化します。さらに、確率過程理論に触発されて、系列の周期性に基づいた自己相関機構を設計し、サブ系列レベルでの依存関係の発見と表現の集約を行います。自己相関は効率と精度の両方でセルフアテンションを上回ります。長期予測において、Autoformerは、エネルギー、交通、経済、気象、疾病の5つの実用的な応用をカバーする6つのベンチマークで38%の相対的な改善をもたらし、最先端の精度を達成します。
このモデルはelisimとkashifより提供されました。 オリジナルのコードはこちらで見ることができます。
参考資料
Autoformerの使用を開始するのに役立つ公式のHugging Faceおよびコミュニティ(🌎で示されている)の参考資料の一覧です。ここに参考資料を提出したい場合は、気兼ねなくPull Requestを開いてください。私たちはそれをレビューいたします!参考資料は、既存のものを複製するのではなく、何か新しいことを示すことが理想的です。
- HuggingFaceブログでAutoformerに関するブログ記事をチェックしてください:はい、Transformersは時系列予測に効果的です(+ Autoformer)
AutoformerConfig
class transformers.AutoformerConfig
< source >( prediction_length: Optional = None context_length: Optional = None distribution_output: str = 'student_t' loss: str = 'nll' input_size: int = 1 lags_sequence: List = [1, 2, 3, 4, 5, 6, 7] scaling: bool = True num_time_features: int = 0 num_dynamic_real_features: int = 0 num_static_categorical_features: int = 0 num_static_real_features: int = 0 cardinality: Optional = None embedding_dimension: Optional = None d_model: int = 64 encoder_attention_heads: int = 2 decoder_attention_heads: int = 2 encoder_layers: int = 2 decoder_layers: int = 2 encoder_ffn_dim: int = 32 decoder_ffn_dim: int = 32 activation_function: str = 'gelu' dropout: float = 0.1 encoder_layerdrop: float = 0.1 decoder_layerdrop: float = 0.1 attention_dropout: float = 0.1 activation_dropout: float = 0.1 num_parallel_samples: int = 100 init_std: float = 0.02 use_cache: bool = True is_encoder_decoder = True label_length: int = 10 moving_average: int = 25 autocorrelation_factor: int = 3 **kwargs )
Parameters
- prediction_length (
int
) — The prediction length for the decoder. In other words, the prediction horizon of the model. - context_length (
int
, optional, defaults toprediction_length
) — The context length for the encoder. If unset, the context length will be the same as theprediction_length
. - distribution_output (
string
, optional, defaults to"student_t"
) — The distribution emission head for the model. Could be either “student_t”, “normal” or “negative_binomial”. - loss (
string
, optional, defaults to"nll"
) — The loss function for the model corresponding to thedistribution_output
head. For parametric distributions it is the negative log likelihood (nll) - which currently is the only supported one. - input_size (
int
, optional, defaults to 1) — The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case of multivariate targets. - lags_sequence (
list[int]
, optional, defaults to[1, 2, 3, 4, 5, 6, 7]
) — The lags of the input time series as covariates often dictated by the frequency. Default is[1, 2, 3, 4, 5, 6, 7]
. - scaling (
bool
, optional defaults toTrue
) — Whether to scale the input targets. - num_time_features (
int
, optional, defaults to 0) — The number of time features in the input time series. - num_dynamic_real_features (
int
, optional, defaults to 0) — The number of dynamic real valued features. - num_static_categorical_features (
int
, optional, defaults to 0) — The number of static categorical features. - num_static_real_features (
int
, optional, defaults to 0) — The number of static real valued features. - cardinality (
list[int]
, optional) — The cardinality (number of different values) for each of the static categorical features. Should be a list of integers, having the same length asnum_static_categorical_features
. Cannot beNone
ifnum_static_categorical_features
is > 0. - embedding_dimension (
list[int]
, optional) — The dimension of the embedding for each of the static categorical features. Should be a list of integers, having the same length asnum_static_categorical_features
. Cannot beNone
ifnum_static_categorical_features
is > 0. - d_model (
int
, optional, defaults to 64) — Dimensionality of the transformer layers. - encoder_layers (
int
, optional, defaults to 2) — Number of encoder layers. - decoder_layers (
int
, optional, defaults to 2) — Number of decoder layers. - encoder_attention_heads (
int
, optional, defaults to 2) — Number of attention heads for each attention layer in the Transformer encoder. - decoder_attention_heads (
int
, optional, defaults to 2) — Number of attention heads for each attention layer in the Transformer decoder. - encoder_ffn_dim (
int
, optional, defaults to 32) — Dimension of the “intermediate” (often named feed-forward) layer in encoder. - decoder_ffn_dim (
int
, optional, defaults to 32) — Dimension of the “intermediate” (often named feed-forward) layer in decoder. - activation_function (
str
orfunction
, optional, defaults to"gelu"
) — The non-linear activation function (function or string) in the encoder and decoder. If string,"gelu"
and"relu"
are supported. - dropout (
float
, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the encoder, and decoder. - encoder_layerdrop (
float
, optional, defaults to 0.1) — The dropout probability for the attention and fully connected layers for each encoder layer. - decoder_layerdrop (
float
, optional, defaults to 0.1) — The dropout probability for the attention and fully connected layers for each decoder layer. - attention_dropout (
float
, optional, defaults to 0.1) — The dropout probability for the attention probabilities. - activation_dropout (
float
, optional, defaults to 0.1) — The dropout probability used between the two layers of the feed-forward networks. - num_parallel_samples (
int
, optional, defaults to 100) — The number of samples to generate in parallel for each time step of inference. - init_std (
float
, optional, defaults to 0.02) — The standard deviation of the truncated normal weight initialization distribution. - use_cache (
bool
, optional, defaults toTrue
) — Whether to use the past key/values attentions (if applicable to the model) to speed up decoding. - label_length (
int
, optional, defaults to 10) — Start token length of the Autoformer decoder, which is used for direct multi-step prediction (i.e. non-autoregressive generation). - moving_average (
int
, defaults to 25) — The window size of the moving average. In practice, it’s the kernel size in AvgPool1d of the Decomposition Layer. - autocorrelation_factor (
int
, defaults to 3) — “Attention” (i.e. AutoCorrelation mechanism) factor which is used to find top k autocorrelations delays. It’s recommended in the paper to set it to a number between 1 and 5.
This is the configuration class to store the configuration of an AutoformerModel. It is used to instantiate an Autoformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Autoformer huggingface/autoformer-tourism-monthly architecture.
Configuration objects inherit from PretrainedConfig can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import AutoformerConfig, AutoformerModel
>>> # Initializing a default Autoformer configuration
>>> configuration = AutoformerConfig()
>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = AutoformerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
AutoformerModel
class transformers.AutoformerModel
< source >( config: AutoformerConfig )
Parameters
- config (AutoformerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Autoformer Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: Optional = None static_real_features: Optional = None future_values: Optional = None future_time_features: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None use_cache: Optional = None return_dict: Optional = None ) → transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput
or tuple(torch.FloatTensor)
Parameters
- past_values (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Past values of the time series, that serve as context in order to predict the future. These values may contain lags, i.e. additional values from the past which are added in order to serve as “extra context”. Thepast_values
is what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features
,static_real_features
,past_time_features
).The sequence length here is equal to
context_length
+max(config.lags_sequence)
.Missing values need to be replaced with zeros.
- past_time_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_features)
, optional) — Optional time features, which the model internally will add topast_values
. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features.
The Autoformer only learns additional embeddings for
static_categorical_features
. - past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- static_categorical_features (
torch.LongTensor
of shape(batch_size, number of static categorical features)
, optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.Static categorical features are features which have the same value for all time steps (static over time).
A typical example of a static categorical feature is a time series ID.
- static_real_features (
torch.FloatTensor
of shape(batch_size, number of static real features)
, optional) — Optional static real features which the model will add to the values of the time series.Static real features are features which have the same value for all time steps (static over time).
A typical example of a static real feature is promotion information.
- future_values (
torch.FloatTensor
of shape(batch_size, prediction_length)
) — Future values of the time series, that serve as labels for the model. Thefuture_values
is what the Transformer needs to learn to output, given thepast_values
.See the demo notebook and code snippets for details.
Missing values need to be replaced with zeros.
- future_time_features (
torch.FloatTensor
of shape(batch_size, prediction_length, num_features)
, optional) — Optional time features, which the model internally will add tofuture_values
. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional features.
The Autoformer only learns additional embeddings for
static_categorical_features
. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on certain token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- decoder_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- cross_attn_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) — Tuple consists oflast_hidden_state
,hidden_states
(optional) andattentions
(optional)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
(optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput
or tuple(torch.FloatTensor)
A transformers.models.autoformer.modeling_autoformer.AutoformerModelOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (AutoformerConfig) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the decoder of the model.If
past_key_values
is used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)
is output. -
trend (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Trend tensor for each time series. -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.
-
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
-
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
-
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model. -
encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.
-
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
-
loc (
torch.FloatTensor
of shape(batch_size,)
or(batch_size, input_size)
, optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude. -
scale (
torch.FloatTensor
of shape(batch_size,)
or(batch_size, input_size)
, optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude. -
static_features: (
torch.FloatTensor
of shape(batch_size, feature size)
, optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.
The AutoformerModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import AutoformerModel
>>> file = hf_hub_download(
... repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> model = AutoformerModel.from_pretrained("huggingface/autoformer-tourism-monthly")
>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
... past_values=batch["past_values"],
... past_time_features=batch["past_time_features"],
... past_observed_mask=batch["past_observed_mask"],
... static_categorical_features=batch["static_categorical_features"],
... future_values=batch["future_values"],
... future_time_features=batch["future_time_features"],
... )
>>> last_hidden_state = outputs.last_hidden_state
AutoformerForPrediction
class transformers.AutoformerForPrediction
< source >( config: AutoformerConfig )
Parameters
- config (AutoformerConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Autoformer Model with a distribution head on top for time-series forecasting. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_time_features: Tensor past_observed_mask: Tensor static_categorical_features: Optional = None static_real_features: Optional = None future_values: Optional = None future_time_features: Optional = None future_observed_mask: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None use_cache: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.Seq2SeqTSPredictionOutput or tuple(torch.FloatTensor)
Parameters
- past_values (
torch.FloatTensor
of shape(batch_size, sequence_length)
) — Past values of the time series, that serve as context in order to predict the future. These values may contain lags, i.e. additional values from the past which are added in order to serve as “extra context”. Thepast_values
is what the Transformer encoder gets as input (with optional additional features, such asstatic_categorical_features
,static_real_features
,past_time_features
).The sequence length here is equal to
context_length
+max(config.lags_sequence)
.Missing values need to be replaced with zeros.
- past_time_features (
torch.FloatTensor
of shape(batch_size, sequence_length, num_features)
, optional) — Optional time features, which the model internally will add topast_values
. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional time features.
The Autoformer only learns additional embeddings for
static_categorical_features
. - past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- static_categorical_features (
torch.LongTensor
of shape(batch_size, number of static categorical features)
, optional) — Optional static categorical features for which the model will learn an embedding, which it will add to the values of the time series.Static categorical features are features which have the same value for all time steps (static over time).
A typical example of a static categorical feature is a time series ID.
- static_real_features (
torch.FloatTensor
of shape(batch_size, number of static real features)
, optional) — Optional static real features which the model will add to the values of the time series.Static real features are features which have the same value for all time steps (static over time).
A typical example of a static real feature is promotion information.
- future_values (
torch.FloatTensor
of shape(batch_size, prediction_length)
) — Future values of the time series, that serve as labels for the model. Thefuture_values
is what the Transformer needs to learn to output, given thepast_values
.See the demo notebook and code snippets for details.
Missing values need to be replaced with zeros.
- future_time_features (
torch.FloatTensor
of shape(batch_size, prediction_length, num_features)
, optional) — Optional time features, which the model internally will add tofuture_values
. These could be things like “month of year”, “day of the month”, etc. encoded as vectors (for instance as Fourier features). These could also be so-called “age” features, which basically help the model know “at which point in life” a time-series is. Age features have small values for distant past time steps and increase monotonically the more we approach the current time step.These features serve as the “positional encodings” of the inputs. So contrary to a model like BERT, where the position encodings are learned from scratch internally as parameters of the model, the Time Series Transformer requires to provide additional features.
The Autoformer only learns additional embeddings for
static_categorical_features
. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on certain token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional) — Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to make sure the model can only look at previous inputs in order to predict the future. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- decoder_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- cross_attn_head_mask (
torch.Tensor
of shape(decoder_layers, decoder_attention_heads)
, optional) — Mask to nullify selected heads of the cross-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- encoder_outputs (
tuple(tuple(torch.FloatTensor)
, optional) — Tuple consists oflast_hidden_state
,hidden_states
(optional) andattentions
(optional)last_hidden_state
of shape(batch_size, sequence_length, hidden_size)
(optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of alldecoder_input_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.Seq2SeqTSPredictionOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqTSPredictionOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (AutoformerConfig) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned when afuture_values
is provided) — Distributional loss. -
params (
torch.FloatTensor
of shape(batch_size, num_samples, num_params)
) — Parameters of the chosen distribution. -
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
-
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
-
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
-
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model. -
encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
-
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
-
loc (
torch.FloatTensor
of shape(batch_size,)
or(batch_size, input_size)
, optional) — Shift values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to shift back to the original magnitude. -
scale (
torch.FloatTensor
of shape(batch_size,)
or(batch_size, input_size)
, optional) — Scaling values of each time series’ context window which is used to give the model inputs of the same magnitude and then used to rescale back to the original magnitude. -
static_features (
torch.FloatTensor
of shape(batch_size, feature size)
, optional) — Static features of each time series’ in a batch which are copied to the covariates at inference time.
The AutoformerForPrediction forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import AutoformerForPrediction
>>> file = hf_hub_download(
... repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly")
>>> # during training, one provides both past and future values
>>> # as well as possible additional features
>>> outputs = model(
... past_values=batch["past_values"],
... past_time_features=batch["past_time_features"],
... past_observed_mask=batch["past_observed_mask"],
... static_categorical_features=batch["static_categorical_features"],
... static_real_features=batch["static_real_features"],
... future_values=batch["future_values"],
... future_time_features=batch["future_time_features"],
... )
>>> loss = outputs.loss
>>> loss.backward()
>>> # during inference, one only provides past values
>>> # as well as possible additional features
>>> # the model autoregressively generates future values
>>> outputs = model.generate(
... past_values=batch["past_values"],
... past_time_features=batch["past_time_features"],
... past_observed_mask=batch["past_observed_mask"],
... static_categorical_features=batch["static_categorical_features"],
... static_real_features=batch["static_real_features"],
... future_time_features=batch["future_time_features"],
... )
>>> mean_prediction = outputs.sequences.mean(dim=1)