Finance Track · Team Briefing
Before any AI-generated model or memo reaches an Investment Committee, we need a dead-simple way to check it. Here's the whole idea — and why it costs us almost nothing.
The two ideas, in plain English
Eval — "is this any good?"
Write down what "good" looks like, keep a few known examples, run the AI on them, and grade the result against the standard.
Regression — "did a fix break something?"
You tweak a prompt to improve the memo — and now the balance sheet stops balancing. Nobody notices, because everyone was looking at the memo.
Together: the eval is the answer key — and re-running it after every change catches a regression before it reaches the IC.
The part that matters most
Building a clever AI skill is the easy, fun part. But without a way to measure it, every "I made it better" is just opinion — you're trusting gut feel, lap after lap.
Without a measure
Human judgement, deal by deal. No memory, no proof, no way to catch a quiet regression.
Measure what matters first
Decide what "good" means once. After that, every change either moves the number or it doesn't — and the rest improves from there.
So the first thing we build isn't the skill — it's the measure.
The whole loop
Closed deals, with the model you stood behind.
Same inputs in; it regenerates each model + memo.
Only if it passed: how long to make it IC-ready?
Balances? Ties out? Right drivers? Fail = rejected.
↻ Every lap, the AI needs fewer fixes — that's the kit improving.
The good news
Most teams struggle to define "good." We have three advantages a marketing team would envy.
A balance sheet balances or it doesn't. Cash flow ties or it doesn't. Many checks are simply yes / no — fast and certain.
Our "known good examples" are deals we've already closed, sitting in Drive. Nothing to invent.
A wrong number doesn't embarrass us — it misinforms a decision. The check is the trust gate.
What it looks like in practice
What you actually get back
Watch the loop improve
Same deal, four laps. Effort drops, judgment rises — and the gate catches the early break.
Lap 1 → Lap 4 · blocked → ready · 55 → 38 min · usable → good
How the team adopts it — almost no effort
You already have them. That's the entire setup.
Changed a prompt or upgraded the AI? Re-run the 3 deals. A check that used to pass and now fails = a regression, caught early.
Track minutes-to-IC-ready — after a change or before handing the kit to a colleague. Not constantly.
The one-sentence version
“Keep your last three deals. Let the AI redo one. First make sure it's right — balances, ties out, sound drivers. Only then ask how much time it saved.”
Finance Track · Draft for discussion