Dan Hendrycks & Kevin Gimpel (2016)
arXiv.
DOI: https://doi.org/10.48550/arxiv.1606.08415
Abstract. Introduces the Gaussian Error Linear Unit, a smooth activation that weights its input by the CDF of a standard normal. GELU has become the default activation in transformer architectures including BERT and GPT.
Tags: neural-networks activations gelu
Cited in: