The rise of qualitative measurement

I have a pet peeve that has followed me through every security job I’ve held, and it gets worse the more senior the room. We measure quality with numbers, and the numbers keep lying to us. Not because someone faked them. Because the thing we care about was never a number in the first place, and we pinned it to one anyway because we had no other option.

Jeff Bezos put it better than I can. In his interview with Lex Fridman, he said, “When the data and the anecdotes disagree, the anecdotes are usually right.” His example was Amazon customer service. The dashboard said callers waited under sixty seconds. The complaints said otherwise. So in a meeting he picked up the phone, dialled support, and the room sat in silence for more than ten minutes. His takeaway is the whole argument: it’s usually not that the data is collected wrong, it’s that you’re measuring the wrong thing.

Hold onto that, because security runs on the exact same mistake.

The dashboard lies at the ticket level

Take mean time to respond, the MTTR clock that runs from an alert firing to the incident being handled. The acronym gets stretched across respond, resolve, remediate, and recover, so pin the meaning down before you argue about it. Measure it and you’ve already smuggled in an assumption: that there’s an ideal time to handle an incident, and that handling two incidents of the same type should take roughly the same effort. Both halves of that are wrong, and any analyst who has worked the queue knows it in their bones.

Here’s a phishing example I’ve watched play out more times than I can count. A plain phishing page is a two-minute takedown. You confirm it, you file the abuse report, done. Now make the payload obfuscated JavaScript that only fires for a specific user agent, or a specific country, or refuses to run inside a sandbox. Suddenly you’re interactively coaxing the page into showing you the real phishing domain before you can do anything about it, and that’s an hour, maybe more. Then the takedown itself: sometimes the domain sits with a registrar who’s a partner and it’s gone in five minutes, sometimes it’s parked behind a foreign NIC that takes a week to answer. Every one of those is the same ticket on the board. Same category, same severity field, same SLA clock. One of them is two minutes of work and one is eight hours, and the dashboard cannot tell them apart.

Run that at a hundred tickets a day across a team and ask the honest question. How mature is this SOC at handling phishing? You don’t actually know. You have a gut feel that analyst X is better than analyst Y, and half the time the reason X looks worse on paper is that X is the one who refuses to close the hard ones quickly to protect the average. The SLA band you picked, the optimistic ten-to-fifteen-minute window, is mostly outliers wearing a trench coat. And that’s among the teams disciplined enough to measure per-rule at all. Most don’t, which then leaves you assuming the teams who skip the measurement are less mature than the ones who do it, which is also not true. The whole edifice is built on sand.

We built the numbers because we couldn’t read everything

Here’s the part I think people miss when they defend metrics. Every one of these numbers is a sum, an average, a minimum, or a maximum of things underneath it. That structure is a confession. It says, out loud, that nobody could ever look at every ticket, every transaction, every case and form a real opinion about the whole, so we collapsed the pile into one figure and agreed to argue about the figure instead.

That was the right call for its time. Reading everything was genuinely impossible. You’d need to double your headcount to govern the system as closely as the system deserved, and no business was going to fund an army of analysts to re-read closed tickets just to ask whether the team is actually any good. So we built maturity models and graphs and KPIs, and we measured ourselves against them, and we mostly forgot that the number was a workaround for a problem we couldn’t afford to solve properly. It was scaffolding. We started treating it as the building.

And once a number is the scorecard, you get the predictable rot. A middle manager whose performance is the number will optimise the number. Not out of malice, usually. They were never handed the authority to define what good looks like for the whole industry, so they accept the industry-standard metric and chase it, even when the metric is a poor proxy for the work. You end up with people quietly grading themselves on something they half-suspect is wrong, because being measurably on-target beats being unmeasurably excellent.

There’s a name for this. Goodhart’s law: when a measure becomes a target, it stops being a good measure. Grade a SOC on MTTR and within a quarter MTTR stops describing how fast incidents actually get handled and starts describing how the team learned to make the clock look good. Tickets get split, or closed early and reopened, or quietly recategorised into a bucket with a kinder SLA. None of that touches the underlying work. All of it moves the number. The metric doesn’t just fail to capture quality, it actively pulls effort away from quality toward whatever makes the dashboard greener.

Every field has this disease

This isn’t a security problem. It’s everywhere we substitute a number for judgment.

Recession is the cleanest example. Everyone “knows” a recession is two consecutive quarters of shrinking GDP. It’s the rule every newspaper reaches for, and it has one real virtue: it’s a number you can check in an afternoon.

The body that actually dates US recessions, the NBER, refuses to use it. They define a recession as “a significant decline in economic activity that is spread across the economy and that lasts more than a few months,” and they weigh depth, diffusion, and duration across employment, income, production, and sales. No single threshold. A committee of economists reads the evidence and makes a call.

The cases are where it bites. The 2001 recession never had two straight down quarters, so the popular rule would have missed it outright. Run it the other way and real GDP slipped in two quarters of 1947, yet the NBER still didn’t call a recession, because employment and industrial production were climbing the whole time. The shorthand would have invented one downturn and overlooked another.

That’s the move I keep coming back to. The official scorekeeper for the most consequential number in economics looked at the clean two-quarter rule, decided it didn’t carry the truth, and threw it out in favour of judgment across a pile of evidence. They got there decades ago. The only reason the two-quarter version still runs in every headline is that judgment doesn’t compress into a chart, and nobody outside the committee was ever going to read the underlying series themselves.

Averages pull the same trick, and most headline economic numbers lean on one somewhere. Put a hundred ordinary people in a room and the average net worth is some normal figure. Now a billionaire walks in. The average net worth in that room is suddenly hundreds of millions of dollars, and not one of the original hundred got a cent richer. The number went up. The reality didn’t move. Inflation and unemployment break in their own ways rather than this one: a price basket that looks nothing like the one you actually buy, a headline jobless rate that doesn’t count the people who gave up looking. Different pathologies, same underlying problem. One figure standing in for a distribution it cannot represent. Which is Bezos and the sixty-second call all over again. The data wasn’t miscollected. It was measuring the wrong thing.

What changed is that reading everything is now cheap

This is the part that genuinely excites me, and it’s why I’m writing this now instead of five years ago.

The entire justification for the metric was a resource constraint. No human can read every ticket. That constraint is dissolving. You can now put a model in front of every incident and have it actually read the thing instead of just timestamping it. Feed it the phishing ticket that took eight hours, with the artifacts attached, the obfuscated script, the sandbox-evasion checks, the foreign registrar the domain was parked behind, and it can reason about why eight hours was reasonable rather than flagging an SLA breach. Feed it the two-minute one and it can tell you that was routine.

None of that is automatic, and it’s worth being honest about what it costs. The review is only as good as what you put in front of it: the evidence the analyst actually captured, the prompt and rubric you wrote to score against, and the discipline to spot-check the model’s calls against ground truth instead of trusting them wholesale. Done sloppily it’s a confident narrator inventing a story. Done with care it gets close to what a senior analyst would say on review, at a scale no senior analyst could ever cover by hand.

In practice it looks less like a magic verdict and more like a second reviewer who never runs out of hours. You hand the model the ticket and its artifacts, give it a rubric describing what a good phishing response actually involves, and ask it to score the handling and explain the score in writing. You read a sample of those explanations against the calls you’d have made yourself, and when it drifts you fix the rubric, not the analyst. Do that across the whole queue and the question you’re answering quietly changes. You stop asking how long the tickets took and start asking whether they were handled well, which was the thing you wanted to know in the first place.

That’s the shift. Qualitative judgment at the scale that used to force us into averages. And it doesn’t only help the people at the top who used to commission the metrics. It hands a lower or middle manager something they’ve never had: the ability to ask “are we actually good at this” and get an answer grounded in the real work, instead of inheriting an industry-norm number they can only ever game. The level of thinking that used to be reserved for whoever designed the KPI is now available to the person doing the job.

Don’t swap a gameable number for a gameable judge

Now the honest part, because I’ve spent the last year benchmarking exactly this kind of AI judgment and I know where it breaks.

An AI judge is not a clean instrument. The research is blunt about it. The ICLR paper Justice or Prejudice? catalogues twelve distinct biases in LLM-as-a-judge setups, position bias, verbosity bias, authority bias, self-enhancement, and shows you can manipulate the scores across model families. A judge that prefers the longer, more confident-sounding write-up is its own kind of broken metric. If you replace a number people learned to game with a model people learn to game, you haven’t gained anything except a more expensive illusion.

And it’s Goodhart again, one layer up. Tell your analysts the AI reviewer rewards thorough, well-evidenced write-ups, and the sharp ones will learn to produce thorough, confident-sounding write-ups whether or not the work underneath earned it. A judge with a verbosity bias trains people to pad. You moved the target, you didn’t remove it. The defence is the same thing that makes any judge worth trusting: keep the hard oracle wherever one exists, and audit the judge against reality often enough that gaming it costs more than just doing the work.

I’ve run into this in my own benchmarking. On one run the automated oracle was quietly throwing false positives, and I only caught it because I went back and graded the payloads by hand. On another, the only reason I trusted the results was that the oracle wasn’t a judge at all: the binary either crashed or it didn’t. No model scoring it, no vibes. That clean ground truth was the thing that made the numbers worth anything.

So the discipline isn’t “judgment replaces measurement.” It’s narrower than that. Where you have a hard oracle, a crash, a captured flag, a failed assertion, keep it and trust it. Where there is no oracle, which is most of real security operations, that’s where AI judgment earns its place, and you validate the judge the way the good vulnerability-research harnesses already do, with adversarial agents arguing both sides before anything counts. And the dashboard stays. You keep MTTR on the wall. You just demote it from verdict to input.

So keep the dashboard (for now)

I’m not telling anyone to throw out their metrics. Keep them. They’re cheap, they’re fast, and a trend line still catches things a human reading tickets one by one would miss. The change is what you do when the number and the work disagree. For most of my career the number won that argument by default, because checking it against reality meant re-reading everything, and nobody could afford that. We can afford it now.

The metric was always a proxy for a judgment we couldn’t make at scale. That excuse is gone. When the data and the anecdotes disagree, we finally have something that can sit down and read every anecdote. So read them. And when they tell you the dashboard is lying, believe the anecdotes.