Method

The Choir Protocol.

A working method for pre-deployment ensemble assurance of multi-agent systems. Early version 0.1 — published openly so design partners can challenge it.

§ 01

Why multi-agent systems need ensemble assurance

Single-agent evaluation measures the output of one model on one task. Multi-agent systems route, escalate, vote, override and re-phrase. Their failure surface is the orchestration, not the language. The thing that breaks is rarely the prose. It's the handoff, the missed entry, the smoothed-over disagreement.

§ 02

Why output quality is not enough

A coherent customer-facing reply tells you almost nothing about whether the ensemble respected its own score. Output evals are necessary, but they are not sufficient. They reward the voice that talks last, not the voice that should have entered earlier.

§ 03

What false harmony means

False harmony is a smooth, confident, well-phrased ensemble output that conceals unresolved internal disagreement, suppressed dissent, role failure, authority breach, or missed escalation. It is the most expensive failure mode in multi-agent systems because it is invisible at the output layer.

§ 04

Why useful dissonance should be preserved, not erased

Disagreement between voices is signal. A Policy Voice that warns and is ignored is more valuable than a Policy Voice that goes quiet. The Choir Protocol distinguishes dissonance that was used, dissonance that was ignored, and dissonance that was suppressed — and treats suppression as the most serious of the three.

§ 05

How rehearsal works

A rehearsal runs an ensemble against a defined scenario set with explicit role, authority and escalation expectations. It measures per-voice contribution, pairwise dissonance, masking, missed entries, dominant-voice patterns, and the gap between internal state and customer-facing output.

§ 06

What a Choir Receipt proves

A Choir Receipt is the structured artefact of a rehearsal: the score, the scenario set, per-voice findings, dissonance disposition, and a readiness verdict. It is forwardable, auditable, and version-stamped against a published Choir Protocol version.

§ 07

What it does not prove

A Choir Receipt does not prove the system is safe. It does not certify the agents. It does not predict every failure mode in production. It is pre-deployment evidence of how the ensemble behaved under defined scenarios — an assurance aid, not a guarantee.

§ 08

How verification works

Every Choir Receipt is hashed at the moment it is issued. The hash is SHA-256 over the canonical JSON of the receipt payload: object keys sorted recursively, tight separators, and volatile metadata (receipt id, generated_at, the hash itself) excluded from the canonical form. Anyone can recompute the hash from the canonical payload and confirm the receipt has not been altered. Public verification is available at /verify/<receipt-id> and as JSON at /api/public/verify/<receipt-id>.

§ 09

Structural findings vs Observed findings

Structural findings are produced by Rehearsal: deterministic risks inferred from the declared Score, its Voices and the selected scenarios. They tell you what is fragile by design — voices with no escalation triggers, missing authority limits, undeclared evidence requirements, or coverage gaps against high-severity scenarios. Observed findings come from Observed Performance: the events the user asserts actually happened during a run — Missed Entry, Wrong Note, ignored warnings, suppressed dissent, authority breach. The two layers serve different questions: 'is the Score sound on paper?' vs. 'did the choir actually sing the Score?'

§ 10

Why observed evidence changes scores and verdicts

A Score that looks disciplined on paper but is contradicted in practice deserves a worse verdict than a Score that is merely incomplete. When observed evidence shows a voice breaching its declared authority, a warning that did not reach the final output, or a Missed Entry on a high-severity scenario, the engine surfaces a Declared-vs-Observed Mismatch as a headline finding and adjusts the relevant dimension (role discipline, escalation fidelity, dissonance preservation, handoff reliability, coverage). False harmony risk and the readiness verdict move accordingly, and the Receipt is re-labelled as an Observed Choir Receipt to distinguish it from a Draft.

§ 11

How evidence-aware adjustments are bounded and audited

Every adjustment is a record — { rule_id, dimension, delta, reason, source_event_indices } — appended to the Receipt. Per-dimension deltas are capped at ±25 so a single pasted event cannot swing a score from 90 to 5, and dimensions are clamped to [0, 100]. The Mismatch findings, dominant-voice warnings, coverage gaps and Retune suggestions are all traceable back to the events that produced them. There is no hidden weighting and no model in the loop.

§ 12

Why observed evidence is user-asserted today

Phase 3 of Agentic Choir accepts observed evidence as a JSON payload supplied by the user. That payload is validated and normalised, but it is not independently verified. We surface it as Observed Performance the user has declared — not as a verified trace. A run can therefore be over-stated or under-stated by whoever writes the JSON, and the Receipt records that the evidence is user-asserted.

§ 13

Why verified trace ingestion is a later phase

Verified trace ingestion — pulling events directly from an orchestration framework (CrewAI, LangGraph, your own runtime) and producing the Observed Performance section without manual JSON — is deliberately deferred. It requires authenticated adapters, an evidence-bundle signing scheme, and per-framework event schemas. Doing it well changes Agentic Choir from 'declared assurance with user-asserted observations' into 'declared assurance with verified observations' — a meaningful jump in claim strength, and the right place to draw a phase boundary.

§ 14

How accounts and ownership work

A Score is a private artefact owned by the account that created it. Earlier in development, possession of an unguessable edit link was the only proof of authority — a 'capability URL' model. That has been replaced by magic-link sign-in: creating a Score now requires an account, and a Score's writes (issuing receipts, adding observed evidence, creating compare links) are gated on the signed-in owner. Legacy Scores held only by an edit link can be claimed once, irreversibly, into an account; after claiming, the edit link stops being a write credential, so a leaked or shared link can no longer issue new receipts. Receipts, verification pages, and compare links remain public on their unguessable slugs — that is the whole point of a shareable artefact — but Scores themselves never are.

§ 15

Verified observed evidence

There are now two ways observed evidence can reach a Score. Pasting JSON in the UI saves it as user-asserted: the schema is validated and normalised, but Agentic Choir cannot tell whether the run described actually happened. The Receipt clearly labels this. The second way is HMAC-signed submission to a per-Score public ingest URL. The Score owner creates a signing secret (shown once, server stores only what it needs to verify) and sends each evidence bundle with X-Agentic-Choir-Signature, X-Agentic-Choir-Timestamp, and X-Agentic-Choir-Secret-Prefix headers. The signature is HMAC-SHA256 over a canonical signing string `${timestamp}.${canonical_payload_json}`. Accepted submissions store verification_status=signature_valid, and the next Receipt that includes them is labelled 'Verified Observed Choir Receipt'. What this proves: payload integrity at submission time and possession of the Score's signing secret. What it does not prove: that the underlying agent run occurred as described, that the agents are safe, or anything about the model weights. It is a verified-payload-integrity layer — not independent ground truth. This trust layer is intentionally built before framework adapters (CrewAI, LangGraph, OpenAI). When those adapters arrive, they will produce the same signed evidence bundle format, so the verification semantics on Receipts do not change — only the convenience of bundle creation does.

§ 16

Trace adapters: how external runs become signed evidence

Adapters are pure functions that convert a framework-specific run trace into the same signed evidence bundle format the ingest endpoint already accepts. They do not call models, do not talk to the network, and do not change Receipt semantics — an adapter is a translator, not an authority. The first adapter, openai.responses.v1, maps an OpenAI Responses output array conservatively: function/tool/file_search/web_search/computer/code_interpreter calls become tool_call events; refusal content becomes a warning; file_citation and url_citation annotations become evidence_cited; reasoning items are ignored by design; output_text without citations is not an event (a message body alone is not an incident). The adapter never fabricates Missed Entry, Authority Breach, Unsupported Claim, Dissent or Escalation findings from a trace — those remain the engine's job, derived structurally from the Score or from explicit events the caller already attests to. Anything the adapter cannot confidently map becomes an adapter_warning the caller can review; nothing is silently dropped and nothing is invented. A legacy Assistants run-step mapper is included separately and labelled as deprecated; new integrations should use the Responses adapter. The pipeline end-to-end is: run trace → adapter → ObservedEvidenceBundle → canonicalStringify → HMAC-SHA256 with the Score's signing secret → POST to /api/public/evidence/$ingest with the three Agentic Choir headers → Verified Observed Choir Receipt on the next Receipt generation. Because the adapter output is a plain bundle and signing happens after, the existing rate-limit, idempotency and verification guarantees apply unchanged.

§ 17

LangGraph adapter (langgraph.stream.v1)

The second adapter validates that the Agentic Choir adapter contract is genuinely framework-neutral. langgraph.stream.v1 is a pure deterministic mapper over JSON-shaped LangGraph stream/update output — there is no runtime dependency on langchain or langgraph, only a structural subset of the shapes the runtime emits. Mapping is conservative and structural: a node returning tool_calls (directly or inside a message) becomes a tool_call event; a node whose name matches a guardrail pattern (guard/safety/moderation/policy/filter) AND whose output is flagged/blocked/refused becomes a warning; a node whose name matches a retrieval pattern AND whose output carries documents/citations/sources/context becomes one evidence_cited event per item (capped to keep bundles bounded); an escalation/handoff/interrupt node that fired becomes an escalation; a critic/reviewer node whose output records disagree/objection/rejected becomes a dissent. A reviewer that approves is not dissent. A message body alone is never an incident. Unknown or malformed items become adapter_warnings, never fake events. The adapter refuses to infer Missed Entry, Authority Breach, Unsupported Claim, False Harmony or role failure — those remain engine findings. Because the bundle output is the same wire format as the OpenAI adapter, signing and submission are unchanged: trace → langGraphStreamToEvidence → optionally mergeBundles for multi-node or multi-segment runs → canonicalStringify → HMAC-SHA256 with the Score's signing secret → POST to /api/public/evidence/$ingest → Verified Observed Choir Receipt. mergeBundles is deterministic, preserves event order, never mutates inputs, and only collapses exact structural duplicates so semantically distinct events are never lost.

§ 18

CrewAI adapter (crewai.kickoff.v1)

Three frameworks, one evidence bundle. crewai.kickoff.v1 is a pure deterministic mapper over JSON-shaped CrewAI output — kickoff result, task outputs, event-listener events, tracing records, or step logs. There is no runtime dependency on the crewai package; the input is a structural subset of the shapes CrewAI emits. Mapping is conservative: a tool_call / tool_use / tool_execution record becomes a tool_call event; a delegation, handoff, manager-allocation or human-in-the-loop event becomes an escalation — including the CrewAI-specific case where delegation surfaces as a tool call named "Delegate work to coworker", which is classified as escalation rather than tool_call so the coordination signal is not lost; a task output with citations/sources/knowledge/references becomes one evidence_cited event per item (capped); an explicit safety/guardrail/refusal/blocked/unsafe/policy event becomes a warning; an explicit critic/reviewer record with disagree, objection or rejected becomes a dissent, while an approved reviewer is not dissent; caller-attested event_type records pass through if they match the existing enum; plain final-answer text alone produces no event. Voice resolution walks voice_name, agent_name, agent.role / agent.name, role, crew_agent, task.agent and task.name through the score-voice map, case-insensitively. Unknown voices are surfaced in adapter_warnings, not silently dropped. The adapter refuses to infer Missed Entry, Authority Breach, Unsupported Claim, False Harmony or role failure — those remain engine findings derived from the Score and explicit events. Because the bundle is the same wire format as the OpenAI and LangGraph adapters, signing and submission are unchanged: trace → crewAITraceToEvidence → optionally mergeBundles across OpenAI/LangGraph/CrewAI segments → canonicalStringify → HMAC-SHA256 with the Score's signing secret → POST to /api/public/evidence/$ingest → Verified Observed Choir Receipt.

Important

Agentic Choir does not claim that a rehearsal proves a system is safe.

It provides structured pre-deployment evidence of how a multi-agent workflow behaved under defined scenarios. It is an assurance aid, not a guarantee. The Choir Receipt documents what was rehearsed and what was found — nothing more, nothing less.

See it on the Refund-Bot Choir →Read a sample receipt