PatchTST
Overview
The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.
At a high level the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:
The abstract from the paper is the following:
We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy.
This model was contributed by namctin, gsinthong, diepi, vijaye12, wmgifford, and kashif. The original code can be found here.
Usage tips
The model can also be used for time series classification and time series regression. See the respective PatchTSTForClassification and PatchTSTForRegression classes.
Resources
- A blog post explaining PatchTST in depth can be found here. The blog can also be opened in Google Colab.
PatchTSTConfig
class transformers.PatchTSTConfig
< source >( num_input_channels: int = 1 context_length: int = 32 distribution_output: str = 'student_t' loss: str = 'mse' patch_length: int = 1 patch_stride: int = 1 num_hidden_layers: int = 3 d_model: int = 128 num_attention_heads: int = 4 share_embedding: bool = True channel_attention: bool = False ffn_dim: int = 512 norm_type: str = 'batchnorm' norm_eps: float = 1e-05 attention_dropout: float = 0.0 dropout: float = 0.0 positional_dropout: float = 0.0 path_dropout: float = 0.0 ff_dropout: float = 0.0 bias: bool = True activation_function: str = 'gelu' pre_norm: bool = True positional_encoding_type: str = 'sincos' use_cls_token: bool = False init_std: float = 0.02 share_projection: bool = True scaling: Union = 'std' do_mask_input: Optional = None mask_type: str = 'random' random_mask_ratio: float = 0.5 num_forecast_mask_patches: Union = [2] channel_consistent_masking: Optional = False unmasked_channel_indices: Optional = None mask_value: int = 0 pooling_type: str = 'mean' head_dropout: float = 0.0 prediction_length: int = 24 num_targets: int = 1 output_range: Optional = None num_parallel_samples: int = 100 **kwargs )
Parameters
- num_input_channels (
int
, optional, defaults to 1) — The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case of multivariate targets. - context_length (
int
, optional, defaults to 32) — The context length of the input sequence. - distribution_output (
str
, optional, defaults to"student_t"
) — The distribution emission head for the model when loss is “nll”. Could be either “student_t”, “normal” or “negative_binomial”. - loss (
str
, optional, defaults to"mse"
) — The loss function for the model corresponding to thedistribution_output
head. For parametric distributions it is the negative log likelihood (“nll”) and for point estimates it is the mean squared error “mse”. - patch_length (
int
, optional, defaults to 1) — Define the patch length of the patchification process. - patch_stride (
int
, optional, defaults to 1) — Define the stride of the patchification process. - num_hidden_layers (
int
, optional, defaults to 3) — Number of hidden layers. - d_model (
int
, optional, defaults to 128) — Dimensionality of the transformer layers. - num_attention_heads (
int
, optional, defaults to 4) — Number of attention heads for each attention layer in the Transformer encoder. - share_embedding (
bool
, optional, defaults toTrue
) — Sharing the input embedding across all channels. - channel_attention (
bool
, optional, defaults toFalse
) — Activate channel attention block in the Transformer to allow channels to attend each other. - ffn_dim (
int
, optional, defaults to 512) — Dimension of the “intermediate” (often named feed-forward) layer in the Transformer encoder. - norm_type (
str
, optional, defaults to"batchnorm"
) — Normalization at each Transformer layer. Can be"batchnorm"
or"layernorm"
. - norm_eps (
float
, optional, defaults to 1e-05) — A value added to the denominator for numerical stability of normalization. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout probability for the attention probabilities. - dropout (
float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the Transformer. - positional_dropout (
float
, optional, defaults to 0.0) — The dropout probability in the positional embedding layer. - path_dropout (
float
, optional, defaults to 0.0) — The dropout path in the residual block. - ff_dropout (
float
, optional, defaults to 0.0) — The dropout probability used between the two layers of the feed-forward networks. - bias (
bool
, optional, defaults toTrue
) — Whether to add bias in the feed-forward networks. - activation_function (
str
, optional, defaults to"gelu"
) — The non-linear activation function (string) in the Transformer."gelu"
and"relu"
are supported. - pre_norm (
bool
, optional, defaults toTrue
) — Normalization is applied before self-attention if pre_norm is set toTrue
. Otherwise, normalization is applied after residual block. - positional_encoding_type (
str
, optional, defaults to"sincos"
) — Positional encodings. Options"random"
and"sincos"
are supported. - use_cls_token (
bool
, optional, defaults toFalse
) — Whether cls token is used. - init_std (
float
, optional, defaults to 0.02) — The standard deviation of the truncated normal weight initialization distribution. - share_projection (
bool
, optional, defaults toTrue
) — Sharing the projection layer across different channels in the forecast head. - scaling (
Union
, optional, defaults to"std"
) — Whether to scale the input targets via “mean” scaler, “std” scaler or no scaler ifNone
. IfTrue
, the scaler is set to “mean”. - do_mask_input (
bool
, optional) — Apply masking during the pretraining. - mask_type (
str
, optional, defaults to"random"
) — Masking type. Only"random"
and"forecast"
are currently supported. - random_mask_ratio (
float
, optional, defaults to 0.5) — Masking ratio applied to mask the input data during random pretraining. - num_forecast_mask_patches (
int
orlist
, optional, defaults to[2]
) — Number of patches to be masked at the end of each batch sample. If it is an integer, all the samples in the batch will have the same number of masked patches. If it is a list, samples in the batch will be randomly masked by numbers defined in the list. This argument is only used for forecast pretraining. - channel_consistent_masking (
bool
, optional, defaults toFalse
) — If channel consistent masking is True, all the channels will have the same masking pattern. - unmasked_channel_indices (
list
, optional) — Indices of channels that are not masked during pretraining. Values in the list are number between 1 andnum_input_channels
- mask_value (
int
, optional, defaults to 0) — Values in the masked patches will be filled bymask_value
. - pooling_type (
str
, optional, defaults to"mean"
) — Pooling of the embedding."mean"
,"max"
andNone
are supported. - head_dropout (
float
, optional, defaults to 0.0) — The dropout probability for head. - prediction_length (
int
, optional, defaults to 24) — The prediction horizon that the model will output. - num_targets (
int
, optional, defaults to 1) — Number of targets for regression and classification tasks. For classification, it is the number of classes. - output_range (
list
, optional) — Output range for regression task. The range of output values can be set to enforce the model to produce values within a range. - num_parallel_samples (
int
, optional, defaults to 100) — The number of samples is generated in parallel for probabilistic prediction.
This is the configuration class to store the configuration of an PatchTSTModel. It is used to instantiate an PatchTST model according to the specified arguments, defining the model architecture. ibm/patchtst architecture.
Configuration objects inherit from PretrainedConfig can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import PatchTSTConfig, PatchTSTModel
>>> # Initializing an PatchTST configuration with 12 time steps for prediction
>>> configuration = PatchTSTConfig(prediction_length=12)
>>> # Randomly initializing a model (with random weights) from the configuration
>>> model = PatchTSTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
PatchTSTModel
class transformers.PatchTSTModel
< source >( config: PatchTSTConfig )
Parameters
- config (PatchTSTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare PatchTST Model outputting raw hidden-states without any specific head. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_observed_mask: Optional = None future_values: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None return_dict: Optional = None )
Parameters
- past_values (
torch.Tensor
of shape(bs, sequence_length, num_input_channels)
, required) — Input sequence to the model - past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length, num_input_channels)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- future_values (
torch.BoolTensor
of shape(batch_size, prediction_length, num_input_channels)
, optional) — Future target values associated with thepast_values
- output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers - output_attentions (
bool
, optional) — Whether or not to return the output attention of all layers - return_dict (
bool
, optional) — Whether or not to return aModelOutput
instead of a plain tuple.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import PatchTSTModel
>>> file = hf_hub_download(
... repo_id="hf-internal-testing/etth1-hourly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> model = PatchTSTModel.from_pretrained("namctin/patchtst_etth1_pretrain")
>>> # during training, one provides both past and future values
>>> outputs = model(
... past_values=batch["past_values"],
... future_values=batch["future_values"],
... )
>>> last_hidden_state = outputs.last_hidden_state
PatchTSTForPrediction
class transformers.PatchTSTForPrediction
< source >( config: PatchTSTConfig )
Parameters
- config (PatchTSTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The PatchTST for prediction model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_observed_mask: Optional = None future_values: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None return_dict: Optional = None )
Parameters
- past_values (
torch.Tensor
of shape(bs, sequence_length, num_input_channels)
, required) — Input sequence to the model - past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length, num_input_channels)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- future_values (
torch.Tensor
of shape(bs, forecast_len, num_input_channels)
, optional) — Future target values associated with thepast_values
- output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers - output_attentions (
bool
, optional) — Whether or not to return the output attention of all layers - return_dict (
bool
, optional) — Whether or not to return aModelOutput
instead of a plain tuple.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import PatchTSTConfig, PatchTSTForPrediction
>>> file = hf_hub_download(
... repo_id="hf-internal-testing/etth1-hourly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> # Prediction task with 7 input channels and prediction length is 96
>>> model = PatchTSTForPrediction.from_pretrained("namctin/patchtst_etth1_forecast")
>>> # during training, one provides both past and future values
>>> outputs = model(
... past_values=batch["past_values"],
... future_values=batch["future_values"],
... )
>>> loss = outputs.loss
>>> loss.backward()
>>> # during inference, one only provides past values, the model outputs future values
>>> outputs = model(past_values=batch["past_values"])
>>> prediction_outputs = outputs.prediction_outputs
PatchTSTForClassification
class transformers.PatchTSTForClassification
< source >( config: PatchTSTConfig )
Parameters
- config (PatchTSTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The PatchTST for classification model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor target_values: Tensor = None past_observed_mask: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None return_dict: Optional = None )
Parameters
- past_values (
torch.Tensor
of shape(bs, sequence_length, num_input_channels)
, required) — Input sequence to the model - target_values (
torch.Tensor
, optional) — Labels associates with thepast_values
- past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length, num_input_channels)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers - output_attentions (
bool
, optional) — Whether or not to return the output attention of all layers - return_dict (
bool
, optional) — Whether or not to return aModelOutput
instead of a plain tuple.
Examples:
>>> from transformers import PatchTSTConfig, PatchTSTForClassification
>>> # classification task with two input channel2 and 3 classes
>>> config = PatchTSTConfig(
... num_input_channels=2,
... num_targets=3,
... context_length=512,
... patch_length=12,
... stride=12,
... use_cls_token=True,
... )
>>> model = PatchTSTForClassification(config=config)
>>> # during inference, one only provides past values
>>> past_values = torch.randn(20, 512, 2)
>>> outputs = model(past_values=past_values)
>>> labels = outputs.prediction_logits
PatchTSTForPretraining
class transformers.PatchTSTForPretraining
< source >( config: PatchTSTConfig )
Parameters
- config (PatchTSTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The PatchTST for pretrain model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor past_observed_mask: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None return_dict: Optional = None )
Parameters
- past_values (
torch.Tensor
of shape(bs, sequence_length, num_input_channels)
, required) — Input sequence to the model - past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length, num_input_channels)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers - output_attentions (
bool
, optional) — Whether or not to return the output attention of all layers - return_dict (
bool
, optional) — Whether or not to return aModelOutput
instead of a plain tuple.
Examples:
>>> from huggingface_hub import hf_hub_download
>>> import torch
>>> from transformers import PatchTSTConfig, PatchTSTForPretraining
>>> file = hf_hub_download(
... repo_id="hf-internal-testing/etth1-hourly-batch", filename="train-batch.pt", repo_type="dataset"
... )
>>> batch = torch.load(file)
>>> # Config for random mask pretraining
>>> config = PatchTSTConfig(
... num_input_channels=7,
... context_length=512,
... patch_length=12,
... stride=12,
... mask_type='random',
... random_mask_ratio=0.4,
... use_cls_token=True,
... )
>>> # Config for forecast mask pretraining
>>> config = PatchTSTConfig(
... num_input_channels=7,
... context_length=512,
... patch_length=12,
... stride=12,
... mask_type='forecast',
... num_forecast_mask_patches=5,
... use_cls_token=True,
... )
>>> model = PatchTSTForPretraining(config)
>>> # during training, one provides both past and future values
>>> outputs = model(past_values=batch["past_values"])
>>> loss = outputs.loss
>>> loss.backward()
PatchTSTForRegression
class transformers.PatchTSTForRegression
< source >( config: PatchTSTConfig )
Parameters
- config (PatchTSTConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The PatchTST for regression model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( past_values: Tensor target_values: Tensor = None past_observed_mask: Optional = None output_hidden_states: Optional = None output_attentions: Optional = None return_dict: Optional = None )
Parameters
- past_values (
torch.Tensor
of shape(bs, sequence_length, num_input_channels)
, required) — Input sequence to the model - target_values (
torch.Tensor
of shape(bs, num_input_channels)
) — Target values associates with thepast_values
- past_observed_mask (
torch.BoolTensor
of shape(batch_size, sequence_length, num_input_channels)
, optional) — Boolean mask to indicate whichpast_values
were observed and which were missing. Mask values selected in[0, 1]
:- 1 for values that are observed,
- 0 for values that are missing (i.e. NaNs that were replaced by zeros).
- output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers - output_attentions (
bool
, optional) — Whether or not to return the output attention of all layers - return_dict (
bool
, optional) — Whether or not to return aModelOutput
instead of a plain tuple.
Examples:
>>> from transformers import PatchTSTConfig, PatchTSTForRegression
>>> # Regression task with 6 input channels and regress 2 targets
>>> model = PatchTSTForRegression.from_pretrained("namctin/patchtst_etth1_regression")
>>> # during inference, one only provides past values, the model outputs future values
>>> past_values = torch.randn(20, 512, 6)
>>> outputs = model(past_values=past_values)
>>> regression_outputs = outputs.regression_outputs