Express Language Modeling

Albert Gong; A. Carrell; Raaz Dwivedi; Lester Mackey

Express Language Modeling

Albert Gong ,
A. Carrell ,
Raaz Dwivedi ,
Lester Mackey

June 2026

arXiv

Download BibTex

We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $l o g^{3 / 2} (n) / s$ approximation error with only $O (s)$ memory and $O (s^{2} l o g^{2} (n))$ compression overhead for a sequence of length $n$ . We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.