Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
posted an update Mar 9
I have just published my first blog post.

While FlashAttention has been readily integrated into HuggingFace transformers, there are much higher gains to be had (at least theoretically) for finetuning models with examples of variable sequence lengths in a batch.

For a deeper dive, please go through my blog at

Interesting! @joaogante and @tomaarsen and @olivierdehaene might be interested in this too!

Nice blog!
@osanseviero we have been doing this in TGI and TEI for a while ;)
Padding free implementations also make dynamic batching easier to implement and more predictable in memory.


yeah, its just that people have not been using this for finetuning where it can give considerable memory savings. I guess the issue is the core design of HF transformers.

I am planning to release the code for this sometime soon :)

Really Intresting ,can't wait to see the code

This comment has been hidden
This comment has been hidden

really nice blog


Thanks a lot @julien-c
means a lot coming from you :)