I Built a Team of AI Agents for Bug Bounty Hunting. Three Weeks of Iteration Taught Me More Than Three Years of Reading About AI.

The question is not whether AI can assist with bug bounty hunting. It can. The more interesting question — the one nobody talks about honestly — is how badly it fails first, and what you have to do to make it useful.

I spent three weeks building, running, scoring, and rebuilding a team of AI agents for bug bounty hunting using OpenClaw, a self-hosted agent framework I run locally. This is the account of what the data actually showed.

Why a Team of Agents, Not a Single AI

The first instinct when applying AI to bug bounty hunting is to prompt a capable model and ask it to find vulnerabilities. The problem is not intelligence — modern LLMs reason well. The problem is scope, attention, and accountability.

A single AI agent trying to cover reconnaissance, web application testing, API analysis, infrastructure assessment, and result triage simultaneously does none of them well. It context-switches constantly, loses depth, and hallucinates findings it cannot substantiate because it is trying to be everything at once.

The solution I settled on was specialisation — the same principle that makes a real penetration testing team effective. You do not ask your network specialist to also review source code and write the final report. You separate roles, define handoffs, and create accountability at each step.

This is the architecture I built using OpenClaw. Six agents, one chain of command.

The Architecture: Six Agents, One Chain of Command

The team runs in a strict hierarchy. Nothing flows upstream without going through the coordinator first.

HuntMaster is the coordinator. Every run begins and ends here. It receives scope, allocates tasks to specialists, and is responsible for the final output. No specialist communicates with another directly. No finding reaches me without passing through HuntMaster.

GhostRecon handles reconnaissance exclusively — attack surface mapping, subdomain enumeration, service fingerprinting. It feeds structured data to the other specialists. It does not interpret or assess; it observes and reports.

PixelPoke covers web application testing. Every UI-facing endpoint, form, and authentication flow falls within its scope.

APIAlchemist focuses on APIs — REST, GraphQL, authentication mechanisms, and business logic at the API layer.

NetNinja handles infrastructure and network-level analysis — open ports, misconfigurations, exposure.

TruthSeeker is the triage layer. Every finding from every specialist passes through TruthSeeker before HuntMaster sees it. Its only job: is this real, is it reproducible, is there evidence, does it meet the threshold for reporting?

The team runs 24 hours a day, seven days a week. Reports surface through my primary OpenClaw agent directly to my Telegram. I wake up to findings, not a blank screen.

The First Two Weeks: What the Scores Actually Said

I built a scoring system from day one. Every week, each agent is evaluated on a 1-to-10 scale covering output quality, signal-to-noise ratio, and finding validity. Without measurement, improvement is guesswork.

After two weeks, the scores looked like this:

Agent Performance Scores · Week 2 Review

NetNinja

8/10

GhostRecon

7/10

TruthSeeker

7/10

HuntMaster

6/10

APIAlchemist

4/10

PixelPoke

3/10

Scores evaluate output quality, signal-to-noise ratio, and finding validity across each agent's domain.

The pattern is visible immediately. Infrastructure tasks (NetNinja) and structured data gathering (GhostRecon) suit AI agents well — the output is binary, the scope is narrow, and there is little room for creative hallucination. The more reasoning-intensive the task, the worse the initial scores.

PixelPoke at 3/10 is a failing grade. But more importantly, it was a pipeline problem — not just a PixelPoke problem.

The Triage Problem — and Why It Changed Everything

In my initial architecture, findings from specialists flowed directly to HuntMaster. That meant low-quality output from PixelPoke and APIAlchemist was being absorbed by the coordinator, processed, and occasionally surfaced as plausible-looking reports. HuntMaster’s 6/10 score was partly explained by the noise it was receiving from downstream.

TruthSeeker existed from the beginning, but I had positioned it as optional — an agent that ran only on findings HuntMaster flagged as uncertain. That was the architectural mistake.

I made one change: TruthSeeker became mandatory. Every single finding from every specialist, regardless of the agent’s confidence, now passes through TruthSeeker before HuntMaster sees it. No exceptions.

Signal Quality Before vs After Mandatory Triage

Before — Optional Triage

38% Signal

62% Noise

After — Mandatory Triage

79% Signal

21%

Signal = findings with evidence, reproducibility, and reporting threshold met. Noise = unsubstantiated, duplicated, or out-of-scope output eliminated before reaching HuntMaster.

The effect was immediate. HuntMaster stopped processing noise. The coordinator’s decision quality improved because the inputs improved. And TruthSeeker, now handling real volume rather than occasional edge cases, got sharper as its memory accumulated patterns.

The triage layer is not optional infrastructure. It is the most critical component in the entire chain.

What “Improving an AI Agent” Actually Means

People assume that improving an AI agent means retraining the model or switching to a more capable one. That is rarely the lever that matters.

In OpenClaw, each agent has a set of core files: a SOUL.md that defines its identity and operating principles, a MEMORY.md that accumulates learnings from each run, and structured task definitions that govern output format. When PixelPoke scored 3/10, I rewrote its SOUL.md to narrow scope — fewer, higher-confidence findings rather than exhaustive low-confidence enumeration. I updated MEMORY.md with explicit examples of valid output versus noise. I restructured the task format to require evidence fields before a finding could be submitted.

The agent did not get smarter. It got more disciplined.

Disciplined output from a less capable model consistently outperforms undisciplined output from a more capable one. This pattern held across every agent I tuned.

The quality of the constraint determines the quality of the output.

What I Would Do Differently From Day One

Make TruthSeeker mandatory from the first run. Not optional, not conditional. Every finding goes through triage. The pipeline cost is worth it at every scale.

Score weekly, not monthly. Weekly scoring surfaces problems before they compound. An agent drifting in the wrong direction for a month is much harder to correct than one you catch in week one.

Start narrow. The temptation is to give each agent broad scope. Broad scope produces broad, shallow output. Narrow and constrained — even artificially so — produces focused, useful output. Widen it only once the agent performs consistently in the constrained environment.

Write MEMORY.md before the first run, not after. I seeded each agent’s memory reactively — after a bad run, I would add what I had observed. Starting with opinionated memory entries from your own domain knowledge produces better first runs and faster improvement curves.

Where It Goes From Here

The team is running. Three weeks of iteration have moved the weakest performers toward something usable. The triage layer is now producing structured, evidence-backed findings that I can actually act on.

The next phase is improving HuntMaster’s coordination logic — specifically, how it allocates work based on prior specialist performance rather than static role assignment. If a specialist has underperformed on a particular class of problem, HuntMaster should weight accordingly.

Another performance review is scheduled shortly. The target benchmark is 8/10 across the board. We are not there yet.

The Honest Takeaway

AI agents for bug bounty hunting are not a shortcut. They are a multiplier — but only if you build the architecture correctly, measure relentlessly, and treat every low score as a structural problem rather than a model problem.

The triage layer is not optional. The scoring system is not optional. The iteration — rewriting SOUL.md because an agent spent an entire run generating noise — is the actual work.

An AI agent team that runs 24/7 while you sleep is genuinely valuable. Getting it to that point takes the same rigour you would apply to any security tooling: build it, measure it, and fix what the data tells you to fix.