Ethics & Safety: 16.21 What an AI engineer can do

Dr Chris Paton

16.21 What an AI engineer can do

Sections 16.1 through 16.20 paint a complex and often unsettling picture: alignment is unsolved, evaluation is partial, fairness theorems collide, privacy guarantees come at a cost, and the policy regime is a half-finished scaffold of voluntary commitments and incomplete statutes. The right reaction to that picture is not paralysis. It is to ask the only question that matters at the bench: given that I am going to be sitting at a keyboard tomorrow morning, building or deploying a machine-learning system, what should I actually do with the next eight hours? §16.20 was the case for restraint. §16.21 distils that case into practice. The work below is concrete, applicable today, and within the gift of the individual engineer or researcher.

Practical contributions

The single most useful frame is to treat safety as ordinary engineering hygiene rather than as a separate moral discipline. Test, document, monitor, and refuse to ship things you cannot defend. The list below is unglamorous on purpose; the field's failures in 2025 were rarely failures of imagination, they were failures to do the cheap, obvious things.

Test your models for fairness before shipping. Disaggregate every headline metric by the demographic groups your deployment will encounter. If you are unwilling to publish the per-group table, you are unwilling to be honest about the model's behaviour. The same applies to language, age, geography, and accessibility cohorts. A two-line groupby in the evaluation script costs nothing; not running it costs the people the model will harm.
Use strong evaluations, not just accuracy. Headline accuracy is the metric that fooled an entire generation of recommender-system, recidivism-prediction, and triage teams. Pair every primary metric with calibration (Brier, ECE), per-group breakdowns, robustness to distribution shift, refusal-rate on prompts the system should not answer, and at least one capability or harm benchmark drawn from the public suites (METR, AISI, OpenAI Preparedness).
Document model cards. Mitchell et al.'s 2018 model card is the highest-leverage cheap artefact in this entire chapter. Record the training data, the intended use, the populations on which it was evaluated, the populations on which it was not, the metric values, and the known failure modes. A model without a card is a model whose authors have refused to commit to claims about it.
Red-team thoroughly. Hire adversarial users, internal or external, whose only job is to break the system. The Anthropic and OpenAI teams have repeatedly shown that determined human testers find capabilities and failures that automated evaluations miss. Budget the time, write up the findings, and act on them before launch rather than after.
Build kill-switches. Every production model should have a mechanism for an on-call engineer to disable it within minutes. This is operational hygiene, not paranoia. The systems that have caused the most public harm, automated benefit-fraud classifiers, misconfigured ranking models, runaway content generators, were ones whose operators had no rehearsed rollback path.
Use Constitutional AI and RLHF carefully. These are powerful tools and they have side effects. Reward models drift, principles encoded in natural language are interpreted unpredictably, and human raters import their own biases. Run ablations. Measure the change in capability and refusal jointly, not separately. Track per-group impact of every alignment update.
Contribute to open safety research. Many concrete projects need help: probing benchmarks, replication of interpretability findings, sparse-autoencoder feature labelling, jailbreak taxonomy, dataset-decontamination tooling, evaluation-harness contributions to lm-evaluation-harness or BigBench. A weekend a month sustained over a year is a meaningful body of work.

Career paths

For engineers and researchers who want to make safety the centre of their work rather than a side allocation, the field has structured itself enough by 2026 that there are real career paths into it. The choice is not binary between industry and academia; the surface is broader.

Frontier safety teams at the major labs. Anthropic's Alignment Science, Interpretability, Frontier Red Team, and Trust and Safety groups; OpenAI's Safety Systems and Preparedness; Google DeepMind's Frontier Safety, Scalable Alignment, and Mechanistic Interpretability; xAI's safety function. These teams sit closest to the systems being built and have the largest direct effect on what ships. They are also the most contested labour market in the field.
AI governance and standards bodies. The UK AI Security Institute (AISI), the United States AI Safety Institute (US AISI), the EU AI Office, and NIST in the United States all hire technical staff to do model evaluation, standards development, and policy translation. Civil society organisations such as the Centre for the Governance of AI, the Centre for Long-Term Resilience, and the Future of Life Institute provide adjacent paths. These roles trade compensation for consequence.
Academic alignment research. MIT CSAIL, Berkeley CHAI, the Cambridge Leverhulme Centre for the Future of Intelligence (LCFI), the Mila and Vector institutes, NYU's Alignment Research Center collaborators, and a growing list of smaller groups offer the freedom to work on problems further from product deadlines. Academic appointments also give the latitude to be publicly critical of frontier-lab choices, which is a contribution in itself.
Independent audit and evaluation firms. Apollo Research, METR (formerly ARC Evals), and Redwood Research have established themselves as third-party evaluators whose work feeds into Responsible Scaling Policies and AISI testing. The roles combine engineering with applied alignment research and are particularly suited to people who want to test models without being employed by the labs that build them.
Civil society and public-interest technology. The AI Now Institute, the Distributed AI Research Institute (DAIR), the Electronic Frontier Foundation (EFF), Algorithm Watch, and Access Now hire technologists for advocacy, audit, and litigation support. The work is often closer to the social consequences than the lab roles, and the salary cost is real.

Safety is no longer a small subfield. A serious junior engineer who wants to spend a career on it can plausibly do so without leaving the front line of practice.

Think long term

The decisions made in 2025 to 2030 will shape the trajectory of AI for decades. The norms that are codified now, transparency, documentation, evaluation, red-teaming, post-deployment monitoring, third-party audit, will either scale or fail to scale, and the question of which is settled by the engineers who do or do not follow them when it would be convenient to skip a step.

Look at the history of other safety-critical disciplines. Aviation did not become safe because regulators wrote good rules; it became safe because thousands of working pilots, mechanics, and controllers built a culture in which a near-miss is reported rather than hidden. Medicine did not become safe because hospitals adopted checklists once; it became safe, where it has, because surgeons agreed, individually and collectively, that checklists were not a slight on their judgement. Civil engineering did not become safe because the textbooks improved; it became safe because the profession adopted practices of inspection, peer review, and licensure that engineers held one another to.

AI is at an earlier point in that arc. The artefacts that will determine whether the discipline matures are not the alignment papers cited in this chapter, important as they are. They are the routine engineering practices: a model card written, a per-group metric computed, a red-team finding acted on, a rollback path rehearsed. These are not exciting decisions. They are the decisions that compound. The engineer who treats them as part of the work, every time, is building the discipline. The engineer who skips them when a deadline is tight is borrowing against the same future.

It is worth holding both the urgency of §16.19 and the epistemic humility of §16.20. The systems being built now are powerful enough to merit serious caution and uncertain enough to merit serious humility about what that caution should look like. The right answer is rarely loud. It is usually a documented model card, a calibrated per-group metric, and the willingness to delay a launch by a fortnight because the refusal-rate is wrong.

What you should take away

Safety is engineering hygiene before it is moral philosophy. The cheap, obvious practices, disaggregated metrics, model cards, red-teaming, kill-switches, staged rollouts, are routinely skipped, and that is where the field's largest near-term wins live.
The binding constraint on responsible AI in 2026 is not technical knowledge but engineering willingness. We know how to do considerably better than the field's median practice. The question is whether the individual engineer will do it.
There are real, structured careers in safety now, frontier labs, AISIs, academic alignment groups, independent auditors, civil society. A serious junior practitioner can spend a full career on the problem without leaving the front line.
The disagreements between the urgency camp and the restraint camp are real and worth understanding from primary sources. A practitioner who has read only one side is not yet calibrated; calibration is itself a contribution.
The norms codified in this decade will compound. A model card written in 2026, a per-group metric reported honestly, a red-team finding escalated rather than buried, these are the small acts that decide what the discipline looks like in 2040. Treat them accordingly.

Textbook of AI

16.21 What an AI engineer can do

Practical contributions

Career paths

Think long term

What you should take away

Further Learning