CS229 · Stanford University · 2018

Machine Learning

with Andrew Ng

Official course page →

Your progress in this browser

Lectures · 0 / 15 watched

Quiz · 0 / 8 correct

Progress is stored in this browser only — there is no account, no login, and no database. Clearing your browser data will reset it.

About the course

CS229 is the course that taught a generation of machine-learning practitioners their craft. Andrew Ng's autumn 2018 Stanford lectures are the most widely watched version: twenty hours of whiteboard, plus problem-set discussions, covering supervised learning (linear and logistic regression, GLMs, SVMs, decision trees, neural networks), learning theory (bias-variance, VC dimension, regularisation), unsupervised learning (k-means, mixtures of Gaussians, EM, PCA, ICA), reinforcement learning, and a quick tour of deep learning before it had eaten the field.

The course is mathematical without being daunting. Ng works derivations live, says out loud where the linear-algebra identity comes from, and is honest about which claims are heuristic. If you have read our linear algebra, calculus, probability, and ML-fundamentals chapters and want to see those tools applied end-to-end by a careful expositor, this is the course to watch.

Note: this is the 2018 cohort. The course at Stanford has since been split into CS229, CS230 (deep learning), CS231n (vision), and CS236 (generative models), each of which goes deeper into its area. Ng's 2018 version remains the best single starting point because it shows the connective tissue between them.

Watch the lectures

Open the full playlist on YouTube →

Syllabus

Tick lectures as you finish them. Your ticks live in this browser only.

  1. Andrew Ng

    What is supervised learning. Hypothesis class, cost function. Batch and stochastic gradient descent. Normal equations.

  2. Andrew Ng

    Locally weighted regression. Logistic regression and its connection to the Bernoulli distribution. Newton's method for logistic regression.

  3. Andrew Ng

    Exponential family distributions, the GLM recipe, softmax regression as multi-class GLM.

  4. Andrew Ng

    Gaussian discriminant analysis, naive Bayes, Laplace smoothing. The discriminative vs generative split.

  5. Andrew Ng

    Margin geometry, the optimisation problem, kernels, the kernel trick. SMO.

  6. Andrew Ng

    The bias-variance decomposition, the union bound, VC dimension. Why uniform convergence works.

  7. Andrew Ng

    Cross-validation, $L_1$ and $L_2$ regularisation, feature selection.

  8. Andrew Ng (guest)

    CART, random forests, AdaBoost. Why ensembles work.

  9. Andrew Ng

    Feed-forward networks, backpropagation, vanishing gradients, ReLU. Convolutional layers.

  10. Andrew Ng

    Hard clustering vs soft clustering. The EM algorithm — derivation as coordinate ascent on a lower bound.

  11. Andrew Ng

    PCA from the variance-maximisation view and from the reconstruction-error view. The connection to the SVD.

  12. Andrew Ng

    ICA — non-Gaussian sources, cocktail-party problem. Why Gaussian factors are unidentifiable.

  13. Andrew Ng

    Markov decision processes, Bellman equations, value iteration, policy iteration.

  14. Andrew Ng

    Continuous-state MDPs, value-function approximation. Linear-quadratic regulators.

  15. Andrew Ng

    What to debug when. The bias-variance triage. When to collect more data vs change the model.

Self-assessment

A short multi-choice quiz. Click an option to commit; the correct answer and an explanation appear. Your answers are remembered in this browser.

  1. Question 1. In ordinary least squares with design matrix $X$ and target $\mathbf{y}$, the normal-equation solution is:

  2. Question 2. The logistic-regression loss is the negative log-likelihood under which assumed distribution for $y \mid \mathbf{x}$?

  3. Question 3. A discriminative model directly estimates:

  4. Question 4. In a soft-margin SVM, the slack variable $\xi_i$ represents:

  5. Question 5. Bias-variance: a high-bias / low-variance model on a fixed dataset typically:

  6. Question 6. The EM algorithm for a mixture of Gaussians is best described as:

  7. Question 7. PCA's principal components are the eigenvectors of:

  8. Question 8. In a Markov decision process, the Bellman equation for the optimal value function $V^*(s)$ is:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).