view reply How is packing implemented in your code? Have you tried using a 4D attention mask to avoid the overlap between samples that you mentioned?