Ofir Press, Noah A. Smith, & Mike Lewis (2021)
arXiv.
DOI: https://doi.org/10.48550/arxiv.2108.12409
Abstract. Introduces ALiBi, which subtracts a linear penalty proportional to the distance between positions from the attention logits with no learned parameters. ALiBi enables transformers to extrapolate to sequence lengths far beyond those seen during training.
Tags: transformer positional-encoding alibi