any plans on increasing model's model_max_length?

#41
by lovodkin93 - opened

Hello,
I saw that the model's model_max_length is 512, which makes it difficult to use the model in in-context-learning setting for tasks involving longer than a sentence inputs (such as reading-comprehension tasks).
I also saw that the flan-ul2 has been updated to support longer inputs (from 512 tokens to 2048).
Therefore, I was wondering if there are any plans to update the flan-f5-xxl model as well?
Thanks!

@lovodkin93 The Flan-T5 models were trained on 2k token input windows and 512 output windows so should be able to manage pretty long in-context sequences. I'm not sure if there is a sequence length window imposed on this API?

@Shayne There is.
The tokenizer's model_max_length parameter is set by default to 512.
So I wanted to make sure whether updating it to 2048 would be something the model can handle.

This comment has been hidden

Hey @SaraAmd @lovodkin93 @Shayne , any updates regarding the maximum sequence length flan t5 XL (3 billion) can handle?

Please correct me if I'm wrong as I struggle with this as well, the maximum sequence length for the t5 tokenizer is 512 but the maximum sequence link for the model itself is theoretically unlimited with an ideal comprehension at 2048 tokens and lower.

I can't seem to figure out how to do chunks but so far this is what I've gathered

Is there any update on the token size?
Anyone able to use more than 512 tokens?

same question here, thanks!

Guys this is very important requirement. Can we get some traction on this please?

I have the same question here.

Google org

hi everyone,
thanks a lot for your interest and your questions. I think that you can use flan-t5 for much longer sequences out of the box - in the T5 modeling script nothing depends on tokenizer.model_max_length. I think that tokenizer.model_max_length was a code legacy as the tokenizer has been copied over from previous t5 models.
As a rule of thumb it is recommended to not run with a large sequence length than the one authors used for training (which I cannot find on their original paper). I have run a summarization example on a long sequence (> 512) and it seems to work fine:

from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForSeq2SeqLM

model_id = "google/flan-t5-xxl"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=quantization_config)
tok = AutoTokenizer.from_pretrained(model_id)


text = """Summarize the following news article in detail:\n
A capsule containing precious samples from an asteroid landed safely on Earth on Sunday, the culmination of a roughly 4-billion-mile journey over the past seven years.

The asteroid samples were collected by NASAs OSIRIS-REx spacecraft, which flew by Earth early Sunday morning and jettisoned the capsule over a designated landing zone in the Utah desert. The unofficial touchdown time was 8:52 a.m. MT, 3 minutes ahead of the predicted landing time.

The dramatic event — which the NASA livestream narrator described as “opening a time capsule to our ancient solar system” — marked a major milestone for the United States: The collected rocks and soil were NASAs first samples brought back to Earth from an asteroid. Experts have said the bounty could help scientists unlock secrets about the solar system and how it came to be, including how life emerged on this planet.

Bruce Betts, chief scientist at The Planetary Society, a nonprofit organization that conducts research, advocacy and outreach to promote space exploration, congratulated the NASA team on what he called an “impressive and very complicated mission,” adding that the asteroid samples are the start of a thrilling new chapter in space history.

“It's exciting because this mission launched in 2016 and so there's a feeling of, Wow, this day has finally come,'” he said. “But scientifically, it's exciting because this is an amazing opportunity to study a very complex story that goes way back to the dawn of the solar system.”

The samples were gathered from the surface of a near-Earth asteroid known as Bennu. The space rock, which is roughly as tall as the Empire State Building, is located more than 200 million miles away from Earth but orbits in such a way that it occasionally swings within 4.6 million miles of the planet.

Bennu's main draw owes to its age. The asteroid is estimated to have formed in the first 10 million years of the solar system's existence, making it a pristine remnant from a chaotic time more than 4.5 billion years ago. As such, studying an asteroid's chemical and physical properties is thought to be one of the best ways to understand the earliest days of the solar system.

“They're pretty well untouched from right around 4.5 billion years ago,” Betts said. “To get insights into these rocks gives real power to not just the science of asteroids but to everything in our solar system.”

Researchers are keen to understand what role — if any — asteroids played in the emergence of life on Earth. There are theories, for instance, that asteroids and comets may have delivered water and other building blocks of life to the planet.

Bennu has also been of interest to scientists because, like other near-Earth asteroids, it is classified as a potentially hazardous object. NASA's Planetary Defense Coordination Office previously estimated that there is a 1 in 2,700 chance of Bennu slamming into Earth sometime between the years 2175 and 2199.
"""

input_ids = tok(text, return_tensors="pt", padding=True).to(0)

out = model.generate(**input_ids, max_new_tokens=100, do_sample=False)
print(tok.batch_decode(out, skip_special_tokens=True))
>>>["The samples were collected by NASA's OSIRIS-REx spacecraft, which flew by Earth early Sunday morning and jettisoned the capsule over a designated landing zone in the Utah desert."]

If you want to disable the warning you can just set tokenizer.model_max_length=4096

@ybelkada Thanks for the pointers. Sorry I was not clear in my comment above, the concern was not for inout sequence length, but more of an output sequence length which is more important IMO. Currently if we try to generate the output longer than 512, this model truncates the output after reaching 512 tokens and left the output untraceable. I can see other models out there which support more than 512 tokens output out of the box. Thanks!

Google org

I see @umesh-c , if you run my script above with max_new_tokens=1000 and eos_token_id=-1 (to generate unlimited sequences) do you still see this behaviour?

Chiming in on this thread. Was Googling about this same issue discussed here and stumbled upon this thread. I too am very curious to hear about whether anyone has had any issues with generating outputs >512 tokens using the inference settings proposed by @ybelkada . Also, @ybelkada many thanks for you taking the time to answer this thread and put this question to rest because this was sort of a make or break for me for the project I'm working on currently.

Google org

Thanks @librehash !
Have you managed to run generation with outputs > 512 tokens? Can you share one of the failure case you are facing?

Hello all,

Unsure if this thread is dead, but this is from the Flan-T5 paper: "The input and target sequence lengths used in finetuning are 1024 and 256, respectively." Full paragraph from the paper is quoted below:

"Instruction tuning procedure. FLAN is the instruction-tuned version of LaMDA-PT. Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset. To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k.2 We finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5. The input and target sequence lengths used in finetuning are 1024 and 256, respectively. We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token. This instruction tuning takes around 60 hours on a TPUv3 with 128 cores. For all evaluations, we report results on the final checkpoint trained for 30k steps."

@MattBoraske I believe you are referring to this: https://arxiv.org/pdf/2109.01652.pdf
No, that refers to the FLAN which is 137B model and that is different from the Flan-t5

@tkarthikeyanai you're right, I was definitely wrong to use that as justification. However, I dived further into the T5 paper (https://arxiv.org/abs/1910.10683), and discovered that instead of using positional embeddings, it uses relative ones:

"Since self-attention is order-independent (i.e. it is an operation on sets), it is common to provide an explicit position signal to the Transformer. While the original Transformer used a sinusoidal position signal or learned position embeddings, it has recently become more common to use relative position embeddings (Shaw et al., 2018; Huang et al., 2018a). Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism. We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. For efficiency, we also share the position embedding parameters across all layers in our model, though within a given layer each attention head uses a different learned position embedding. Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. In this work, we use 32 embeddings for all of our models with ranges that increase in size logarithmically up to an offset of 128 beyond which we assign all relative positions to the same embedding. Note that a given layer is insensitive to relative position beyond 128 tokens, but subsequent layers can build a sensitivity to larger offsets by combining local information from previous layers."

To break this down into its key points:

  1. Relative Position Embeddings: Essentially, T5 is learning how far apart tokens are rather than their exact positions. It uses relative position embeddings rather than fixed position embeddings. This means that the position information encoded in the embeddings is based on the relative distances (offsets) between pairs of tokens (i.e., the "key" and "query"). This approach contrasts with absolute position embeddings where each position (token index) has a fixed embedding.
  2. Shared and Layer-Specific Position Embeddings: The position embedding parameters are shared across all layers, which enhances parameter efficiency. However, within each layer, each attention head has its own learned embeddings so that it can still create representations of token relationships within the same model layer.
  3. Range and Sensitivity to Position: The model uses a fixed number of embeddings (32) that cover logarithmically increasing ranges up to an offset of 128. Beyond this offset, all positions are treated the same. This design indicates that within a single layer, there is a limit to how much relative position information can influence the model beyond a certain distance between tokens. However, the arrangement of layers in the T5 model allows subsequent ones to build on the positional information from earlier layers. Accordingly, while a single layer might be insensitive beyond 128 tokens, deeper layers might aggregate this information to remain sensitive to broader context.

Now, in regards to modifying the context windows of the encoder and decoder during fine-tuning of Flan-T5, because T5 of its use of relative position embeddings and its approach to how embeddings are handled across layers, there is a potential flexibility in how it handles different sequence lengths or structures. This suggests that by we can finetune the model using increased context windows without fundamentally altering the core model's ability to process positional relationships. My Flan-T5 models that I finetuned on datasets that I curated from the Reddit AITA subreddit were configured to have an encoder context window of 1024 to handle the long lengths of submission texts, and they was successful. You can check them out here - https://huggingface.co/collections/MattBoraske/reddit-aita-finetuning-66038dc9281f16df5a9bab7f. Happy to talk to you more about it if you're curious about the implementation @tkarthikeyanai

I do want to issue a warning that while you can modify the Flan-T5 context windows, I'm unsure of the degree that you can extend them. My finetuned Flan-T5 models only doubled the context window from 512 to 1024, so I'm not confident at how well it would perform on anything greater than that. Basically, if the lengths of the tokenized input texts greatly exceed those that were used for pretraining, Flan-T5 may lose effectiveness for positions beyond what it has learned to handle (e.g., beyond 128 tokens as per the base configuration described in the T5 paper).

Sign up or log in to comment