Finance Track · Team Briefing

How do we know the AI
built a good model?

Before any AI-generated model or memo reaches an Investment Committee, we need a dead-simple way to check it. Here's the whole idea — and why it costs us almost nothing.

Quality first — speed second Uses deals you've already closed No new tools to buy

The two ideas, in plain English

Eval & Regression

📋

Eval — "is this any good?"

A check, not a gut feeling

Write down what "good" looks like, keep a few known examples, run the AI on them, and grade the result against the standard.

Like an answer key. You can't grade the homework until you've written down the correct answers.

💧

Regression — "did a fix break something?"

The silent backslide

You tweak a prompt to improve the memo — and now the balance sheet stops balancing. Nobody notices, because everyone was looking at the memo.

Like fixing the sink and flooding the dishwasher. You fixed the leak but didn't check downstream.

Together: the eval is the answer key — and re-running it after every change catches a regression before it reaches the IC.

The part that matters most

Set up the measure before you polish the skill

Building a clever AI skill is the easy, fun part. But without a way to measure it, every "I made it better" is just opinion — you're trusting gut feel, lap after lap.

🎲

Without a measure

"Is it improving?" → a guess

Human judgement, deal by deal. No memory, no proof, no way to catch a quiet regression.

🎯

Measure what matters first

"Is it improving?" → a fact

Decide what "good" means once. After that, every change either moves the number or it doesn't — and the rest improves from there.

So the first thing we build isn't the skill — it's the measure.

The whole loop

It's a loop — and each lap gets better

1

Keep 3 past deals

Closed deals, with the model you stood behind.

2

Let the AI rebuild them

Same inputs in; it regenerates each model + memo.

4

Then measure effort

Only if it passed: how long to make it IC-ready?

3

Grade quality first

Balances? Ties out? Right drivers? Fail = rejected.

↻run after every change

→ ↓ ← ↑

↻ Every lap, the AI needs fewer fixes — that's the kit improving.

The good news

Finance makes this easier, not harder

Most teams struggle to define "good." We have three advantages a marketing team would envy.

Advantage 1

There's a right answer

A balance sheet balances or it doesn't. Cash flow ties or it doesn't. Many checks are simply yes / no — fast and certain.

Advantage 2

The test set is free

Our "known good examples" are deals we've already closed, sitting in Drive. Nothing to invent.

Advantage 3

The stakes make it worth it

A wrong number doesn't embarrass us — it misinforms a decision. The check is the trust gate.

What it looks like in practice

Small enough to fit on one page

The quality gate — what "good" means

✓

Balance sheet balances every periodAssets = Liabilities + Equity

✓

The three statements connectProfit flows to equity; closing cash matches the balance sheet

✓

No hidden "plug" numbersEvery figure traces to a real input or formula

✓

Loan schedule is consistentEnding balance, interest & principal agree

★

Right drivers & defensible assumptionsA human scores this 1–5 — the judgement part

★

Memo flags real risks, not boilerplateScored 1–5

The "answer key" folder

eval-set/ ├── deal-A/ # a closed deal │ ├── inputs/ # what we fed in │ ├── model.xlsx # model we stood behind │ └── result.md # 4/4 · 38 min to IC ├── deal-B/ └── deal-C/

6/6 ✓ checks pass

QUALITY — the gate. Fails any check → rejected, no matter how fast.

38 min

EFFORT — the dial. Counted only on models that passed · ↓ from 55 min

What you actually get back

One run = one clear verdict

RUN REPORT · deal-Arun #4

✅ READY FOR IC

Quality gate · must all pass

✓ Balance sheet balances

✓ Statements tie together

✓ No hidden plug numbers

✓ Loan schedule consistent 4 / 4 passed

Judgment · scored

Drivers chosen 🟢 good

Assumptions "downside a touch soft" 🟡 usable

Memo risk section 🟢 good

38 minto IC-ready · ↓ from 55 min last run

❌ BLOCKED

Gate failed: balance sheet off by $4,200. Rejected — no time credit, however fast it ran.

A regression, caught. The check stopped a broken model before the IC ever saw it.

Watch the loop improve

Each lap: fewer fixes, higher quality

Same deal, four laps. Effort drops, judgment rises — and the gate catches the early break.

—

Lap 1

BLOCKED

🔴 weak

55 min

Lap 2

READY

🟡 usable

44 min

Lap 3

READY

🟡→🟢

38 min

Lap 4

READY

🟢 good

Lap 1 → Lap 4 · blocked → ready · 55 → 38 min · usable → good

How the team adopts it — almost no effort

Three moves. No platform, no project.

1

Drop in 3 closed deals

You already have them. That's the entire setup.

2

Re-run the checklist after any change

Changed a prompt or upgraded the AI? Re-run the 3 deals. A check that used to pass and now fails = a regression, caught early.

3

Log one number, at decision points only

Track minutes-to-IC-ready — after a change or before handing the kit to a colleague. Not constantly.

What we will NOT do: build an eval "platform", write a giant rulebook, or invent fake test deals. Pointing at deals we've already closed is exactly what lets us skip all of that.

The one-sentence version

“Keep your last three deals. Let the AI redo one. First make sure it's right — balances, ties out, sound drivers. Only then ask how much time it saved.”

Finance Track · Draft for discussion

How do we know the AIbuilt a good model?

Eval & Regression

A check, not a gut feeling

The silent backslide

Set up the measure before you polish the skill

"Is it improving?" → a guess

"Is it improving?" → a fact

It's a loop — and each lap gets better

Keep 3 past deals

Let the AI rebuild them

Then measure effort

Grade quality first

Finance makes this easier, not harder

There's a right answer

The test set is free

The stakes make it worth it

Small enough to fit on one page

One run = one clear verdict

Quality gate · must all pass

Judgment · scored

Each lap: fewer fixes, higher quality

Three moves. No platform, no project.

Drop in 3 closed deals

Re-run the checklist after any change

Log one number, at decision points only

How do we know the AI
built a good model?