huggingface/time-series-transformer-tourism-monthly · Confusion regarding the lags_sequence argument for the config (and also other size arguments) (Practical question)

Jan 31, 2023

Hi! I've been trying for some time to create a working TimeSeriesTransformerForPrediction model (https://huggingface.co/docs/transformers/model_doc/time_series_transformer) but without success - I can't get the config arguments to work for my use case. As the info on these arguments is fairly sparse currently, I'd really like any additional explanation and recommendations regarding what I should be using as config arguments.

Each of my examples is a 200-day vector, in which each day has 5 features. (So tensor.Size = (200,5))
I'll want to predict 50 days ahead, also 5 features each day. (So tensor.Size = (50,5))

I'm not very sure of what the lags should be, as I want to use the whole input sequence as context features to the Transformer model.
I may also be confused as to what the input_size, context_length, and prediction length should be, due to the fact that the blog (https://huggingface.co/blog/time-series-transformers) and the documentation both use only single-valued targets.

These are a few of the things I tried:

1:
lags_sequence = [] (empty list)
input_size = 5
context_length = 200
prediction_length = 50
Result: Exception trying to max(self.config.lags_sequence) in TimeSeriesTransformerModel._past_length.

2: Same, but with lags_sequence = [0] or [1]
Result: Exception embed_dim must be divisible by num_heads (got `embed_dim`: 17 and `num_heads`: 2)..

3:
lags_sequence = [1,2,3,4,...,200] or [0,1,2,3,...,199]
input_size = 5
context_length = 1
prediction_length = 50
Result: Tensor size mismatch in Meanscaler.forward.

Thank you for your help!

nielsr

Hugging Face org Jan 31, 2023

cc'ing @kashif here

kashif

Hugging Face org Jan 31, 2023

Hello!

So to clarify, you have a multivariate forecasting problem where the size of the multivariate dimension is 5, and you have a 200 time points in training. Since you want to predict 50 time points into the future, and there isn't much data, you might as well set the context length to be something small, e.g. 25, and the prediction length to be 50 of course.

The lag sequence is an array of indices from which one makes the target lag features, namely for example if lags_seq = [1, 2, 5] then for each time point in the training data we will contact to the target (5 dim array in your case) the value of the target 1 step back, 2 steps back and 5 steps back... meaning, in the end, you will end up with a vector of size 5 + len(lags_seq)*5

In the end, this array also gets concated by the time features, and thus you end up with a final input vector to the transformer.

The issue is that the number of heads of the transformer has to be divisible by the size of this final input vector. It is not in your case, so perhaps first try to set the number of heads to 1 in your code to get it working.

I have not documented how to do multivariate forecasting but it is essentially the input_size arg as you have it which will learn a diagonal student-t distribution.

I was thinking of removing the need for the vector size to be divisible by the number of heads by adding an initial projection layer... have to think about it a bit but perhaps that is a better API... what do you think?

Thanks again!

yuvalarbel

Feb 2, 2023

Hey Kashif!
Thanks so much for the response.

Even though I set the context length to 25, I still don't understand what the lags_sequence should be, so it's still not working.
(AssertionError: (('input length 250 and dynamic feature lengths 274 does not match',), 250, 274))
I know you've explained the idea behind lag_sequence a few times in the past, but for some reason I haven't worked it out.

Do you have a specific recommendation for what exactly I should have my lags_sequence be, in the case where my input is 200 x 5, my target is 50 x 5, and my context_length is, let's say, 25?

Regarding the divisibility by nheads: I definitely think having a projection layer could make the model be more easily usable - I think it's desired to use the model with all possible input sequence lengths x nheads combinations. That being said, as I'm still not 100% regarding the translation from the size config arguments to the Transformer's input sequence size, it may be the case that it should be the programmer's responsibility. Maybe after I understand how to use each config argument input I'll have a different opinion :)

Thank you!
Yuval

yuvalarbel

Feb 3, 2023

Hey @kashif / @nielsr , I'm still grappling with this. Any insights?
Thanks,
Yuval

nielsr

Hugging Face org Feb 7, 2023

Hi @yuvalarbel ,

we definitely would like to get people up and running with the Time Series Transformer as quick as possible so we definitely need to make it easier regarding the lags sequence.

@kashif is working on adding an input projection layer which will resolve the embed_dim must be divisible by num_heads error.

The current implementation always assumes a lags sequence to be provided - @kashif we can probably remove that requirement?

Tom91

Feb 7, 2023

I've got a similar issue with a multivariate problem I'm working on.

A multivariate example similar to the univariate one would be really useful if thats something that is avaliable?

kashif

Hugging Face org Feb 8, 2023

@Tom91 Sure I am writing up a multivariate example notebook to share.

giltinde

Feb 15, 2023

Hi @kashif , a notebook for the multivariate case would be great, any update on when it will available?

Thanks,
Gil

kashif

Hugging Face org Feb 15, 2023

sure so i was waiting on a PR https://github.com/huggingface/transformers/pull/21020 to get merged and then will push the Multivariate notebook too.

eddic

Feb 17, 2023

Hello, I want to know what the output of the model means when it is model.generate, such as num_ parallel_ Samples=100, my data batch is 32, so what does [32，100] represent?

kashif

Hugging Face org Feb 17, 2023

@eddic since this is a probabilistic forecast, the num_ parallel_ Samples represents the number of samples from the learned distribution at each time point, these samples are autoregressively passed back to the decoder to then sample the next time point, etc. till our prediction horizon given by the prediction_length Once you have your forecast you can use the samples (100 in this case) per time step to calculate any empirical statistic of interest. Point forecasts can be obtained by taking the mean for example.

eddic

Feb 17, 2023

@kashif Since I just started using huggingface, I wonder if we need to use optimizer.zero_grad() and optimizer.step() when using pytorch for iterative training?

giltinde

Feb 17, 2023

Hi @kashif ,

Does the model require the data to be spaced at regular intervals in time, or can it also handle irregular time intervals?

kashif

Hugging Face org Feb 17, 2023

•

edited Feb 17, 2023

@giltinde the model does not require inputs to be regular since you provide the time features. So you can give irregular time features that correspond to the values and at inference time one can potentially give irregular time features in the future_time_features tensor to then condition the forecast on it. The blog post showed how to make time features that were regular but you can make them from an index of irregular date-times too.

irismagicbox

Feb 17, 2023

Hi, I use your model to initialize the weights for training. My goal is to predict electricity consumption. When predicting, the output is averaged and used for point prediction, but I found a problem, some of which are averaged. Negative values, you know, the power consumption cannot be negative, and there are no negative numbers in my training data, what is the reason, I use distribution_output='normal'

kashif

Hugging Face org Feb 17, 2023

@irismagicbox so yes the reason some mean values are negative is that the distribution's std is large and if the mean is near zero, the average of the samples can turn out negative. Can you kindly confirm the negative means are tiny? You can for your particular needs clamp the forecast to be positive or we can think about a positive distributional output head. Currently, we have negative binomial output head but that is more for count data... what do you think?

kashif

Hugging Face org Feb 17, 2023

folks, perhaps it is best to make a new discussion for the issues since now we have a bunch of things being discussed here unrelated to the discussion, which is getting confusing for me to keep track of.

irismagicbox

Feb 18, 2023

@kashif Thank you very much for your reply, I will try it with large scale training data

yuvalarbel

Feb 27, 2023

Hey @kashif / @nielsr , do you know when the d_model will be available in the main package off of pip? It is not yet in the official documentation, and I'm still having trouble executing the model.

nielsr

Hugging Face org Feb 27, 2023

We usually have a release every month, so expect this to be available in March.

yuvalarbel

Mar 2, 2023

Thanks @nielsr , may I ask when in March?