As the penalty grows, Lasso sets coefficients to zero one by one; Ridge shrinks all together.
From the chapter: Chapter 6: ML Fundamentals
Glossary: lasso, ridge regression, regularisation
Transcript
Linear regression with many features. The model overfits. Add a penalty.
Ridge regression adds the L2 penalty. The sum of squared coefficients. Solve, and every coefficient shrinks proportionally toward zero.
Plot each coefficient as a function of the penalty strength. As the penalty grows, all coefficients smoothly shrink. None become exactly zero.
Lasso adds the L1 penalty. The sum of absolute values of coefficients. Solve, and the geometry changes.
Plot each coefficient as a function of the penalty strength. As the penalty grows, individual coefficients hit zero and stay there. One by one, features drop out of the model. At very high penalty, only a handful of coefficients remain non-zero.
This is feature selection by optimisation. Lasso prefers sparse solutions because the L1 ball has corners along the axes.
Why use one over the other. Use Ridge when you believe many features contribute small amounts. Use Lasso when you believe only a few features matter and want to identify them.
Elastic net combines both penalties. Modern penalised methods, like adaptive Lasso, group Lasso, and L0-relaxations, extend the same idea.