How does the self-attention mechanism in Transformers enable capturing dependencies in input sequences?
Self-attention allows each element in the input sequence to attend to every other element, capturing dependencies regardless of distance. This helps Transformers excel in tasks requiring understanding of context across long ranges.