Meet the Council: Giving AI Agents Real Identities, an ELO Leaderboard, and the Ability to Learn

Blog post #21


An Agent Council leaderboard with ELO bars and a reflection feedback loop.

I’ve been building Questbox for a while now.

The core idea: AI generates high-quality educational products — treasure hunts, quizzes, diplomas — for kids. Multiple AI agents run in a “council” and compete to produce the best content. The winner gets saved. The losers give feedback.

It worked. But the agents were anonymous. Interchangeable. Disposable.

Today I gave them names.


Seven Agents. Three Tiers. One Council.

Up until now, every agent was just a model plus a prompt. Nothing distinguished them except maybe a temperature setting. The council ran, content was scored, and that was that.

Today I built a proper identity system.

Economy tier — fast, cheap, good enough for high volume:

  • Spark ⚡ — sharp and punchy
  • Echo 🎵 — rhythmic and pattern-focused
  • Blink 💨 — fast takes, no fluff

Premium tier — slower, richer, used when quality matters most:

  • Forge 🔥 — strong structure, built to last
  • Sage 🦉 — nuanced, contextual, wise
  • Nova ✨ — creative leaps, unexpected angles

Image tier:

  • Lens 📷 — visual language, product imagery

Each agent lives in the database. Each has an editable system prompt. Each has a tier. Each has an ELO rating that started at 1600.


An ELO Leaderboard for Generated Content

The council has always produced a winner. But now that outcome actually means something.

After every council run, the winning agent gains ELO points — calculated pairwise against every agent it beat, with K=32. Lose? You drop. Win consistently? You climb.

The leaderboard lives at /admin/agents. You can see each agent’s current ELO, win rate, tier, and a full activity log of every council run they’ve participated in. Click on an agent and you get their full feedback history — every strength and improvement noted, every time they competed.

ELO only updates when three or more agents compete. Single runs don’t count. I didn’t want noise polluting the rankings.


The Part I’m Most Excited About: Agents That Learn

This is the piece that changes everything.

After each council run, the agents in the losing positions give peer feedback to the winner — strengths, improvements, a quality score from 1–100. That feedback used to disappear after the run.

Now it doesn’t.

Every piece of feedback an agent receives gets appended to their reflection_notes — a running log of what other agents have said about their work. The last 10 entries are kept. When that agent runs next time, those notes are injected into their system prompt as “Learnings from previous council runs.”

The agent reads its own history. It knows what it got criticized for. It knows what landed well. And it adjusts.

This isn’t fine-tuning. It’s not retraining weights. It’s something simpler and maybe more interesting: structured self-reflection baked into the context. The agent gets better not because we changed the model — but because we’re giving it memory of its own mistakes.


What This Looks Like in Practice

The admin panel now shows each agent’s detail page. You can see:

  • Their current ELO and win rate
  • Every council run they participated in
  • The feedback they received per run (strengths + improvements from peers)
  • Their aggregated strengths over time
  • The reflection notes they carry into every new run

Forge might have a note that says: “Your structures are strong but transitions feel abrupt — the other agents flagged this three times in a row.” Next time Forge generates content, that context is in the room.


The Bigger Picture

What I’m building here is a system where quality compounds over time.

Not through more compute. Not through better base models (though those matter). But through accumulated context — agents that have seen their own weaknesses and carry that forward.

It’s a small thing. But it mirrors something real: the best people I’ve worked with weren’t the smartest in the room. They were the ones who actually internalized feedback and changed.

I want Questbox’s agents to do the same.


Next up: I want to see what happens after 50–100 council runs. Which agents climb? Which plateau? Does the reflection actually change the content in measurable ways?

I’ll report back when the data is interesting.

— Stefan