
Applying Engineering Discipline to AI Projects
Most AI projects fail for structural reasons, not technical ones. Here's what applying real engineering discipline to AI work actually looks like in practice.
Most AI projects don't fail because the model was wrong. They fail because nobody agreed on what success looked like, or the experiment that worked in a notebook couldn't be reproduced two weeks later. Applying engineering discipline to AI projects isn't about adding process overhead, it's about building the habits that make the work reliable.
The Problem With How Most AI Projects Are Run
I've been in the meeting where nobody can explain why the numbers changed. The model worked last Thursday. The demo went fine. Three days after deployment, the outputs look different, and the team is staring at a notebook that has been run and re-run in a different order than it was written, with cells missing, with a preprocessing step that might have been updated or might not have been, with no record of which version of the data produced the result everyone agreed was good. The meeting has no resolution. The knowledge is just gone.
That's not bad luck. That's a structural failure, and it's extremely common.
There are three patterns I keep seeing in AI and machine learning projects that hit this kind of wall. The first is the prototype-to-production gap: something works beautifully in a notebook, in a controlled environment, with training data that was quietly hand-cleaned, and it stops working at scale or in a new context, and nobody is sure why because the notebook was never really a reproducible artifact. The second is the absence of shared success criteria: the team optimizes a metric that turns out not to matter, or they can't agree on what "working" looks like, and the project drifts. The third is model selection treated as the whole job: enormous energy goes into comparing architectures or algorithms, while data quality, deployment infrastructure, and monitoring are treated as afterthoughts to get to later.
None of these patterns are caused by incompetence. Smart people, working hard, still produce fragile AI systems when there's no shared engineering culture around how to build them. The data on how many AI projects actually reach production, and stay there, is not encouraging, and the Stanford AI Index has been documenting this gap between adoption intentions and production outcomes for several years. The number of organizations that successfully move machine learning work from pilot to sustained production use is consistently smaller than the number that try.
The fix isn't a new tool or a better model, it's a shift in how the work is organized and understood.
What "Engineering Discipline" Actually Means Here
I want to be careful with the phrase "engineering discipline," because it can land badly. Say it in a room full of people who work in machine learning and a certain percentage will hear: waterfall process, Gantt charts, change request forms, meetings to schedule meetings. The kind of overhead that would crush the fast, iterative, experiment-driven work that makes this field interesting.
That's not what I mean. There's a distinction worth holding onto between engineering rigor and engineering ceremony. Rigor means: you know what you're testing, you can reproduce the result, you have a plan for when things go wrong. Ceremony means: you write documents nobody reads and produce artifacts whose only purpose is to prove you produced them. The argument here is for rigor. The argument is against ceremony as strongly as anyone.
Carnegie Mellon's Software Engineering Institute has been working to define what AI engineering actually means as a discipline, framing it around the application of systematic, disciplined, quantifiable approaches to the development and operation of AI systems. What I appreciate about that framing is that it doesn't pretend AI work is just regular software development with a different library. It acknowledges that artificial intelligence and machine learning introduce a kind of uncertainty that traditional software development doesn't have at its core. You can't fully specify the behavior of a system that learns from data. The model's behavior is an emergent property of the training process, the data distribution, and the deployment context, all of which can shift.
So the engineering habits that work here have to be adapted for that ambiguity, not imported wholesale from embedded systems development or enterprise software processes. Uncertainty is a first-class citizen in this field. The goal isn't to eliminate it. The goal is to build habits and instincts, around problem formulation, experiment hygiene, exit criteria, and designing for failure, that let you work responsibly inside it.
The rest of this is about what those habits look like in practice, and why each one is harder to maintain than it sounds.
Start With the Problem, Not the Model
Here's a scenario that is more real than hypothetical: a team spends three weeks running a careful comparison between gradient-boosted trees and a neural network architecture for a predictive task. The experiment is well-executed. The metrics are tracked. At week three, someone asks what the model's output will actually be used for, who sees it, what decision it informs, what happens operationally if it's wrong. It turns out the team has different, incompatible answers to that question. The model comparison was real work. It was just the wrong work.
The discipline here is to write down the success criteria before touching a model. What does the system need to do? For whom? What does failure cost, financially, operationally, in user trust? What data do you actually have versus what you assumed you'd have? These are engineering requirements management questions, and they feel awkward in artificial intelligence and machine learning contexts because there's always ambiguity at the edges. The honest answer to that awkwardness is not to skip the questions. It's to answer them provisionally, document the assumptions explicitly, and revisit them at defined points rather than letting them drift.
The relationship with traditional software engineering is worth naming here. In classical development, requirements are often treated as something you finalize before building, and the failure modes of that approach are well-documented. In AI development, you're often discovering requirements through prototyping. That's legitimate and necessary. The engineering discipline isn't to eliminate that iteration; it's to ensure that each iteration is testing something specific and documented, not just moving in a direction that feels productive.
Misaligned success metrics are one of the leading causes of AI project failure, not because teams are careless, but because the question of what the model should optimize for is often genuinely hard, and there's always pressure to start building before it's resolved. Sitting with that question longer than feels comfortable is not slowing the project down. It's avoiding the much larger slowdown of rebuilding the problem statement from scratch at week eight.
Getting this right doesn't guarantee success. But skipping it almost guarantees the wrong kind of failure.
Treat Experiments Like Code, Not Scratchpads
You are six weeks into a project. Something that worked in week two has stopped working, and you need to understand why. The notebook exists. It's been run and re-run in a nonlinear order. There are cells that may or may not reflect the current preprocessing logic, you're not sure because a colleague updated the feature engineering step and you're not certain whether the version that produced the good result was before or after that update. The hyperparameter configuration that worked might be in a comment, or in a Slack message from three weeks ago, or in your memory and your memory is imperfect. The result is unreproducible. The work of six weeks has partial institutional value at best.
This scenario is not a cautionary tale. It's just Tuesday in a lot of organizations doing machine learning development. What separates teams that escape it from teams that don't is rarely better tools, it's better habits.
What the field now calls MLOps, the operational discipline of managing AI and machine learning systems through their full lifecycle, is essentially the codification of these habits at scale. Google's MLOps architecture guidance captures a lot of the operational thinking here: versioned pipelines, tracked experiments, reproducible training runs, monitored deployments. The infrastructure is real and worth building. But the underlying habits matter more than the tooling, especially in early-stage work.
Three habits carry most of the weight. The first is experiment logging: every run should record inputs, parameters, and outputs, not just the final result, but the context in which you tried it and why. Tools like MLflow, Weights & Biases, and DVC exist in this space; the specific choice matters less than the practice. The second is versioned configurations: treat config as code. If you cannot reconstruct the exact settings that produced a result, the result is not reproducible, and unreproducible results have limited engineering value. The third, the most undervalued, is written reasoning: noting why you tried something, not just what happened. The difference between a notebook as scratchpad and a notebook as institutional record is almost entirely in this habit. The negative results, the dead ends, the "we ruled this out because" notes, these are often more valuable than the positive results, because they prevent the team from running the same experiment three months later and reaching the same dead end.
What you are building, with these habits, is not just a model. You are building institutional knowledge: an account of what the current system is, how you got there, and what you ruled out along the way. Future-you, six months from now, handed a production issue at 2 a.m., will be grateful for that record in a way that is hard to appreciate in the moment when writing it feels like overhead. That's the engineering documentation argument for applications of artificial intelligence development, adapted for the iterative and probabilistic nature of machine learning work.
Define Done Before You Start Iterating
There is a particular failure mode almost unique to this field: AI projects that never finish, not because the work is abandoned, but because there is always another experiment to run. Another hyperparameter to tune. Another data augmentation strategy that might move the metric two more points. The language models and systems built in this space can always be marginally better. Without a definition of done, this isn't laziness or indiscipline, it's a rational response to a system of incentives that has no natural stopping point.
This is genuinely different from traditional software engineering, where done is more legible. Either the feature works or it doesn't. Either the bug is fixed or it isn't. In machine learning development, the output is a distribution, the metric is a continuous variable, and "good enough" is a judgment call that can always be deferred. The infinite-iteration trap is structural, not personal.
The practical fix, and I want to be honest that it requires more organizational will than it sounds, is to set evaluation benchmarks before the first experiment runs. Not as a bureaucratic requirement, but as a navigational tool. What metric, at what threshold, on what evaluation set, would make you confident enough to ship? What constitutes good enough for this specific application, given what failure costs and what the deployment context is? These questions should be answered provisionally before iteration begins, and revisited explicitly at defined intervals rather than quietly abandoned when the pressure to ship increases.
Frameworks like NIST's AI Risk Management Framework treat structured evaluation, knowing what you're testing for and what thresholds constitute acceptable performance, as a foundational engineering practice in responsible AI development. The risk framing is useful here because it connects evaluation rigor to something stakeholders already care about: not just whether the model is technically good, but whether it's safe to deploy and trustworthy in its context of use.
That stakeholder dimension matters. "Done" in AI projects is often a negotiation, not a purely technical decision. A model that satisfies an engineering criterion may fail a business criterion. A model that looks excellent on the evaluation set may encounter distribution shift in its first week of production data. Involving stakeholders in defining done is not a concession to organizational politics, it's a recognition that the success criteria for these systems are legitimately shared between the people who build them and the people who depend on them.
The version of done you define on day one will probably change. That's fine. What's not fine is never having defined it.
Key takeaways
- Engineering discipline in AI is a set of habits and instincts, not a compliance framework, the goal is rigor, not ceremony, and the two are not the same thing.
- Problem formulation matters more than model selection. Most costly failures can be traced back to starting with the model before the success criteria and deployment context were clearly understood.
- Experiments need to be reproducible institutional records, not scratchpads, logging parameters, versioning configs, and writing down your reasoning are the habits that make AI development sustainable over time.
- Exit criteria should be defined before iteration begins. The infinite-iteration trap is structural, and the only reliable counter to it is committing to a definition of done, provisionally, with explicit revision points, before the work starts.
- Stakeholders are part of the success-criteria conversation, not recipients of a finished product. Done is a negotiation, and deferring that conversation consistently produces the wrong kind of surprise.
FAQ
Isn't engineering discipline at odds with the exploratory nature of AI work?
Not in my experience, but the framing matters. Exploration works better, not worse, when experiments are logged and reasoned about, when you have clear records of what you've ruled out and why. The argument isn't for rigid process; it's for making your exploration legible to yourself and your team six weeks later. Ceremony is the enemy of exploration. Rigor isn't. Natural language processing (NLP) in particular benefits from this kind of documentation, because the gap between what a language model does on a controlled test set and what it does in real social media or product design contexts can be large and nonobvious.
How do you define success criteria when the problem itself is ambiguous?
Provisionally, explicitly, and with a plan to revisit. The common mistake is treating the ambiguity as a reason to skip the question. A better response is to write down your current best answer, including the assumptions it rests on, and schedule a deliberate check-in where you interrogate those assumptions based on what you've learned. Ambiguity isn't a permission slip to avoid commitment; it's an argument for making your commitments revisable rather than permanent. This is especially important in domains like smart manufacturing or supply chain optimization, where the stakes of getting the success criteria wrong are high.
What's the minimum version of experiment tracking that's actually useful?
More minimal than most practitioners assume. A structured README per experiment, a logged config file, and a short note on why you ran it, all committed to version control, will get you most of the value before you touch a dedicated tracking tool. The tools are worth adopting eventually, but the habit of treating each run as a documentable decision is the thing that transfers. You can have MLflow running everywhere and still have no institutional knowledge if the reasoning isn't captured. For data mining and data quality work especially, this discipline prevents the proliferation of conflicting versions of truth about what the data actually contains.
How do you handle the stakeholder negotiation around "done" when stakeholders keep moving the target?
This is genuinely hard and I don't want to minimize it. The most useful thing I've found is to document the agreed evaluation criteria at a point in time and treat any revision as a formal change, not because bureaucracy is useful for its own sake, but because it surfaces the cost of the change and forces a real conversation about tradeoffs. Stakeholders move targets more often when they don't feel the cost of doing so. Making the revision explicit makes the cost visible. In product design contexts, this often means explicitly acknowledging that reworking an application because requirements shifted is legitimate, but it comes at a real time cost.
Where should a team start if they're trying to introduce more engineering discipline into an existing AI project mid-stream?
Experiment logging and reproducibility. Not because they're the most important in theory, but because they're the most immediately recoverable deficit and they affect everything downstream. If you can't reproduce your current results, you can't evaluate whether any change you make is an improvement or noise. Getting that foundation in place, even imperfectly, even retroactively documenting what you can, creates the ground you need to stand on for the rest of the work. This is true whether you're working on generative design systems, smart manufacturing optimization, or any other applications of artificial intelligence. You can read more about how I think about the broader practice of structured technical work on the blog, and the framing behind this kind of approach is something I return to often on the main site.
Do these principles apply the same way to small projects and large teams?
The principles apply broadly, but the implementation scales. A solo practitioner working on a three-week project doesn't need an MLOps platform; they need the habit of versioning their config and writing a sentence about why each experiment ran. A ten-person team working on a production system for a year needs more infrastructure and more explicit coordination. What doesn't scale down to zero, in my view, is the problem formulation step, the commitment to write down success criteria before iteration starts. That pays off even in a project that runs for a single week. Whether you're building natural language processing systems, doing data mining analysis, or designing a supply chain optimization tool, this discipline translates.