absorb.md

Attention Sinks And Compression Valleys In Llms Are Two Sides Of The Same Coin

1 mentions across 1 person

Visit ↗
Yann LeCun
paper · 2025-10-07
Recommended

In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle laye

Unified Theory for Attention Sinks and Compression Valleys in Large Language Mod