Iz Beltagy, Matthew E. Peters, & Arman Cohan (2020)
arXiv.
DOI: https://doi.org/10.48550/arxiv.2004.05150
Abstract. Introduces Longformer, which combines local sliding-window attention with a small number of global tokens to reduce the quadratic cost of self-attention to linear, enabling processing of documents with thousands of tokens.
Tags: transformer attention efficiency longformer
Cited in: