Jimmy Lei Ba, Jamie Ryan Kiros, & Geoffrey E. Hinton (2016)
arXiv.
DOI: https://doi.org/10.48550/arxiv.1607.06450
Abstract. Proposes layer normalisation, which computes normalisation statistics over the features of a single example rather than over a mini-batch. Layer norm is independent of batch size and has become standard in transformer architectures.
Tags: regularisation layer-normalisation