Glossary

Federated Learning

Federated learning (FL), introduced by McMahan et al. (2017) at Google, is a machine learning paradigm in which a global model is trained across many decentralised clients (mobile phones, hospitals, IoT devices) without raw data being centralised. Each client computes a local model update on its private data; only the updates are transmitted to a coordinating server, which aggregates them into a new global model. FL underpins privacy-preserving features like Gboard's next-word prediction and is increasingly used in healthcare and finance.

System. A typical FL round proceeds:

  1. Server samples a subset of clients $\mathcal{S}_t$ and broadcasts the current global model $\theta_t$.
  2. Each selected client $k$ runs $E$ local epochs of SGD on its private data $\mathcal{D}_k$ to produce $\theta_t^{(k)}$.
  3. Clients send $\theta_t^{(k)}$ (or the delta $\theta_t^{(k)} - \theta_t$) to the server.
  4. Server aggregates the updates into $\theta_{t+1}$.

FedAvg. The canonical aggregation rule weights each client by dataset size:

$$\theta_{t+1} = \sum_{k \in \mathcal{S}_t} \frac{n_k}{n}\, \theta_t^{(k)}, \qquad n = \sum_{k \in \mathcal{S}_t} n_k.$$

With $E = 1$ this reduces to synchronous distributed SGD; with $E > 1$ each client takes multiple local steps, dramatically reducing communication at the cost of some accuracy loss when client distributions are non-IID.

Challenges.

  • Non-IID data. Client distributions differ in features, labels, and quantity. Local updates drift apart, slowing convergence and biasing the aggregate. Remedies: FedProx adds a $\frac{\mu}{2}\|\theta - \theta_t\|^2$ proximal term; SCAFFOLD uses control variates; FedDyn dynamically regularises towards consensus.

  • Communication. Mobile clients have bandwidth and battery constraints. Compression (sketching, quantisation, top-k sparsification) and local updates ($E > 1$) reduce traffic. Federated distillation sends logits instead of weights for very large models.

  • Systems heterogeneity. Clients have differing compute, dropout mid-round, and respond on varying schedules. Asynchronous and tiered architectures address this.

  • Privacy and security. Although raw data stays local, model updates can leak information. Defences include secure aggregation (cryptographic protocol so the server sees only sums), differential privacy (adding calibrated noise to client updates), and homomorphic encryption.

  • Robustness to malicious clients. A few corrupted clients can poison the global model. Byzantine-robust aggregators (Krum, Trimmed Mean, Median) replace the simple weighted mean.

Variants.

  • Cross-device FL. Millions of mobile devices, low availability, transient. McMahan-style Gboard.
  • Cross-silo FL. Tens to hundreds of organisations (hospitals, banks), high availability, larger compute. NVIDIA Clara, OpenFL.
  • Personalised FL. Each client gets its own fine-tuned head while sharing a global trunk; addresses statistical heterogeneity.
  • Vertical FL. Clients hold different features about the same entities (linked by ID); requires secure multiparty computation for the joint forward pass.

Convergence theory. Under bounded heterogeneity and $L$-smooth losses, FedAvg converges at rate $O(1/T)$ for strongly convex problems and $O(1/\sqrt{T})$ for nonconvex ones, with constants degrading as client distributions diverge (Li et al., 2020). Local-SGD analyses give similar bounds.

Healthcare angle. Hospitals cannot legally pool patient data but can train collaboratively under FL. Successful deployments include COVID-19 outcome prediction (Dayan et al., 2021) and brain tumour segmentation (FeTS challenge). Combined with differential privacy, FL approaches the privacy expectations required by GDPR and HIPAA.

Related terms: Differential Privacy, Continual Learning, Gradient Descent

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.