Finance Track · Team Briefing

How do we know the AI
built a good model?

Before any AI-generated model or memo reaches an Investment Committee, we need a dead-simple way to check it. Here's the whole idea — and why it costs us almost nothing.

Quality first — speed second Uses deals you've already closed No new tools to buy

The two ideas, in plain English

Eval & Regression

📋

Eval — "is this any good?"

A check, not a gut feeling

Write down what "good" looks like, keep a few known examples, run the AI on them, and grade the result against the standard.

Like an answer key. You can't grade the homework until you've written down the correct answers.
💧

Regression — "did a fix break something?"

The silent backslide

You tweak a prompt to improve the memo — and now the balance sheet stops balancing. Nobody notices, because everyone was looking at the memo.

Like fixing the sink and flooding the dishwasher. You fixed the leak but didn't check downstream.

Together: the eval is the answer key — and re-running it after every change catches a regression before it reaches the IC.

The part that matters most

Set up the measure before you polish the skill

Building a clever AI skill is the easy, fun part. But without a way to measure it, every "I made it better" is just opinion — you're trusting gut feel, lap after lap.

🎲

Without a measure

"Is it improving?" → a guess

Human judgement, deal by deal. No memory, no proof, no way to catch a quiet regression.

🎯

Measure what matters first

"Is it improving?" → a fact

Decide what "good" means once. After that, every change either moves the number or it doesn't — and the rest improves from there.

So the first thing we build isn't the skill — it's the measure.

The whole loop

It's a loop — and each lap gets better

1

Keep 3 past deals

Closed deals, with the model you stood behind.

2

Let the AI rebuild them

Same inputs in; it regenerates each model + memo.

4

Then measure effort

Only if it passed: how long to make it IC-ready?

3

Grade quality first

Balances? Ties out? Right drivers? Fail = rejected.

run after every change

↻ Every lap, the AI needs fewer fixes — that's the kit improving.

The good news

Finance makes this easier, not harder

Most teams struggle to define "good." We have three advantages a marketing team would envy.

Advantage 1

There's a right answer

A balance sheet balances or it doesn't. Cash flow ties or it doesn't. Many checks are simply yes / no — fast and certain.

Advantage 2

The test set is free

Our "known good examples" are deals we've already closed, sitting in Drive. Nothing to invent.

Advantage 3

The stakes make it worth it

A wrong number doesn't embarrass us — it misinforms a decision. The check is the trust gate.

What it looks like in practice

Small enough to fit on one page

The quality gate — what "good" means
Balance sheet balances every periodAssets = Liabilities + Equity
The three statements connectProfit flows to equity; closing cash matches the balance sheet
No hidden "plug" numbersEvery figure traces to a real input or formula
Loan schedule is consistentEnding balance, interest & principal agree
Right drivers & defensible assumptionsA human scores this 1–5 — the judgement part
Memo flags real risks, not boilerplateScored 1–5
The "answer key" folder
eval-set/ ├── deal-A/ # a closed deal │ ├── inputs/ # what we fed in │ ├── model.xlsx # model we stood behind │ └── result.md # 4/4 · 38 min to IC ├── deal-B/ └── deal-C/
6/6 ✓ checks pass
QUALITY — the gate. Fails any check → rejected, no matter how fast.
38 min
EFFORT — the dial. Counted only on models that passed · ↓ from 55 min

What you actually get back

One run = one clear verdict

RUN REPORT · deal-Arun #4
✅ READY FOR IC
Quality gate · must all pass
Balance sheet balances
Statements tie together
No hidden plug numbers
Loan schedule consistent 4 / 4 passed
Judgment · scored
Drivers chosen 🟢 good
Assumptions "downside a touch soft" 🟡 usable
Memo risk section 🟢 good
38 minto IC-ready  ·  ↓ from 55 min last run
❌ BLOCKED
Gate failed: balance sheet off by $4,200. Rejected — no time credit, however fast it ran.
A regression, caught. The check stopped a broken model before the IC ever saw it.

Watch the loop improve

Each lap: fewer fixes, higher quality

Same deal, four laps. Effort drops, judgment rises — and the gate catches the early break.

Lap 1
BLOCKED
🔴 weak
55 min
Lap 2
READY
🟡 usable
44 min
Lap 3
READY
🟡→🟢
38 min
Lap 4
READY
🟢 good

Lap 1 → Lap 4  ·  blocked → ready  ·  55 → 38 min  ·  usable → good

How the team adopts it — almost no effort

Three moves. No platform, no project.

1

Drop in 3 closed deals

You already have them. That's the entire setup.

2

Re-run the checklist after any change

Changed a prompt or upgraded the AI? Re-run the 3 deals. A check that used to pass and now fails = a regression, caught early.

3

Log one number, at decision points only

Track minutes-to-IC-ready — after a change or before handing the kit to a colleague. Not constantly.

What we will NOT do: build an eval "platform", write a giant rulebook, or invent fake test deals. Pointing at deals we've already closed is exactly what lets us skip all of that.

The one-sentence version

Keep your last three deals. Let the AI redo one. First make sure it's right — balances, ties out, sound drivers. Only then ask how much time it saved.

Finance Track · Draft for discussion

1 / 7
Use or Space · F fullscreen