Model Diagnostics

Model Diagnostics

These past few weeks, I’ve been training big neural networks on massive datasets. It’s felt quite nostalgic: the final stretch of my AI degree was a six-month thesis spent building neural networks to predict European heatwaves a couple of weeks in advance. Now I’m back behind my laptop, watching the bumps and dips of my model on the dashboard, caught once again in that familiar rhythm.

The process is rewarding, in the sense that every model you train might do something better than it’s ever been done before. The process is also frustrating, in the way that one misconfigured line can silently derail hours of training, leaving you wondering if the bug is in the code or just a fundamental part of who you are.

Training a Black Box

Neural networks are often described as “black boxes,” which feels fair as far as trained models go. From the outside, they can be remarkably hard to interpret. But during training, the metaphor starts to fall short. The system isn’t static or sealed off; it shifts constantly in response to data and gradients. It reacts, destabilizes, recovers, and sometimes surprises you.

At its core, training isn’t all that mysterious. It’s a process of adjusting the internal parameters of a model so it gets slightly better at a task over time. We track that progress with a loss function: a number that measures how far the model’s predictions are from the correct answers. Lower is better. Training is the gradual process of nudging parameters so the model makes fewer mistakes, and the loss shrinks.

Lately, when I try to explain what I do to friends outside the field, I find myself leaning on a different metaphor than the famous black box: instead I talk about diagnostics. At the risk of sounding a little grandiose, I tend to explain that the workflow feels somewhat like treating a patient. You monitor vital signs, form hypotheses about what might be going wrong, run a few tests, and then, hopefully, find a way to fix it. Let me explain.

The Diagnostic Process

Step 1: Check the vitals

We may not be able to look directly inside the model, but we do have instruments to monitor its condition. These are the vitals: training loss, validation accuracy, gradient stability. Depending on the task, we might also inspect outputs more directly by visualizing predicted weather fields, plotting trajectories, or generating attention maps. Together, these signals tell us whether the model is learning something meaningful or quietly drifting into failure.

Step 2: Form a differential diagnosis

Sometimes something looks off: the loss plateaus, performance on new data drops, or the predictions look subtly wrong. That’s when hypothesis-building begins. Is the learning rate too high? Are the labels noisy? Is the model too small, or too large? This stage is about generating plausible explanations from symptoms, intuition, and prior experience. Like any diagnostic process, it requires careful observation and a willingness to be wrong.

Step 3: Test and verify

With a few hypotheses in hand, the next step is intervention, preferably small and controlled. Adjust the learning rate schedule, add regularization, simplify the architecture, rerun with a different seed. The goal isn’t just to fix the model, but to learn something about how it behaves. Each experiment helps rule in or rule out a possible cause. You don’t jump straight to surgery; you start with the least invasive checks.

Step 4: Diagnose and treat

Once a pattern emerges, you can commit to a course of action. Sometimes the fix is obvious: reduce overfitting with dropout, clean up mislabeled data, normalize the inputs. Other times, the treatment is more experimental: architectural changes, different optimizers, or even a reformulation of the problem itself. Even with the right diagnosis, success is rarely immediate. Treatment often needs adjustment before the model stabilizes. And sometimes the diagnosis points to something deeper: maybe the dataset simply doesn’t contain enough signal, or the model is pushing against the limits of what current methods can represent.

From Engineering to Craft

Of course, the metaphor has limits. Training a neural network is not the same as treating a patient; the stakes are lower, and the consequences of a mistake are usually no worse than wasted GPU hours and a bruised ego. Still, the comparison works because it captures the mindset. You’re not just engineering a system; you’re learning to read subtle signals, notice when something is off, and trust the intuition you’ve built through repetition.

Over time, that intuition becomes hard to formalize. You start to recognize the early signs of overfitting, the faint trace of a data leak, the particular way a loss curve feels when the learning rate is wrong. These are more than technical skills. They’re diagnostic habits, built through trial, error, and repetition.

That’s why this process is closer to a craft than pure engineering. Engineering gives us the tools: optimizers, architectures, libraries. The practice, however, is diagnostic: learning to read symptoms, trust intuition, and intervene before the condition worsens.

Pim Meerdink avatar
By Pim Meerdink, Cofounder, STL

SocialTechnologyLab

© 2026 Social Technology Lab
KVK: 93608411