The future of AI is plural.

Grounded in peer-reviewed work on multi-agent reasoningⓘ, QPlural runs a panel of vendor-distinct frontier models over the same question — with live web retrieval, blind cross-critique, atomic claim verification, and a synthesis paired with sceptic dissent. Or watch the panel debate a dilemma aloud.

ChatGPT

Claude

Gemini

Grok

DeepSeek

The problem

One model, asked once, is a confident guess.

Every frontier language model was trained on overlapping data, tuned with similar techniques, and optimised for similar benchmarks. Ask the same hard question to any one of them and you get a fluent, assertive answer that often sounds more certain than it has any right to be. Hallucinations, stale facts, blind spots, subtle bias — all of it comes out wearing the same confident voice.

The standard fixes — better prompting, more retrieval, bigger models — reduce mistakes but don’t surface the ones that remain. If the model is wrong, you don’t usually find out until you act on the answer.

The research answer

Have the models disagree in the open, then verify.

The last three years of multi-agent debate research — at ICML, ICLR, ACL and EMNLP — have substantially sharpened the picture. The foundational result (Du et al., 2023¹) showed the basic mechanism: independent models that cross-examine each other’s reasoning catch errors a single model would defend. One model alone will assert a wrong answer confidently; a panel reading each other’s working will often surface the flaw.

Since then the programme has tightened. Heterogeneity — models from different labs, not copies of the same one² — matters more than sheer agent count. Handing each agent a different slice of the retrieved evidence beats letting all of them anchor on the same sources⁹. Hiding peer confidence prevents over-confidence cascades⁷. Auditing disagreement points in the transcript recovers correct minority answers that majority voting loses entirely⁸. And factuality is best assessed atomically — long-form answers are tangles of supported and unsupported sub-claims, and the sub-claims must be checked against the cited evidence one by one⁶.

QPlural implements these findings together — and adds one more: a recent preprint⁵ calls it “architectural heterogeneity” and argues it’s what prevents consensus collapse, the failure mode where a panel of models from the same lab confidently converge on the same wrong answer because they inherited the same biases in training.

What we do

A research contract, a panel, atomic verification, then synthesis with dissent.

When you ask QPlural a hard question, a small preflight sets the research contract; the evidence ledger is built from the URLs you named plus targeted live retrieval; vendor-distinct frontier models answer in parallel; they critique each other anonymously; the critique surfaces gaps that drive a second round of targeted research and revision; every factual claim in every revised brief is decomposed and graded against its cited sources; a primary synthesiser writes the answer and an independent sceptic writes a dissent; and a final citation audit gates what reaches you. Every stage is visible in the UI — you can audit any of it — but what you read is the synthesis, not a transcript to reconcile yourself.

1
Set the research contract
A small preflight model reads your question and configures the rest of the pipeline — what kind of question this is, which source types matter, how strict the citation gate should be, and any URLs you named that should be fetched directly. It does not answer the question; it sets the rules.
2
Acquire evidence
If you named specific URLs, those pages are fetched first and added as primary sources. Then live web search runs in three angled framings — neutral, supportive, and challenging — and the results are deduplicated, scored for authority and recency, and assembled into a controlled evidence ledger that every downstream stage cites by source ID.
3
Independent briefs
Vendor-distinct frontier models answer in parallel against the ledger. Each takes a deterministic lens — empiricist, sceptic, theorist, pragmatist, risk auditor — so the panel covers complementary angles by construction rather than coincidence. No model sees any other model’s answer at this stage.
4
Blind cross-critique
Each model is then shown the other models’ briefs with their authors hidden and labels shuffled. They write structured attacks: unsupported claims, missing primary sources, citation mismatches, the strongest counterargument. Anonymising the briefs is what keeps the critique honest — the model can’t defer to a “famous” lab.
5
Evidence repair + revision
Critique gaps are clustered into targeted retrieval tasks. The ledger grows. Every model then revises its brief with sight of its peers’ first-round work, the critiques aimed at it, and the augmented evidence. This is the debate literature’s core loop running live: independent proposal, peer review, revise.
6
Atomic claim verification
Each revised brief is decomposed into atomic factual claims. Two independent verifier models grade entailment for every claim against its cited sources. Deterministic checks run alongside — URL liveness, date sanity, quote match. Each claim ends up with a support grade: high, medium, low, or unsupported.
7
Synthesis + sceptic dissent
Two parallel calls read the verified claims and full transcript. A primary synthesiser writes the user-facing decision memo. A separate sceptic call writes a structured dissent. They are blind to each other — when they disagree materially that is an honest disagreement, and we surface both. Disagreement is information, not a failure mode.
8
Citation audit
Before the answer reaches you, a final auditor reads the synthesised memo and checks every factual claim against the ledger. Invented citations, weak entailment, internal contradictions, and stale model names are flagged. Claims that don’t pass are removed, downgraded, or kept with an explicit “unsupported judgement” disclaimer.

Cross-critique is the debate literature’s core loop. Atomic verification turns “sounds supported” into “is supported by the cited source”. Dissent shown alongside the answer turns disagreement from a failure mode into a signal. Every claim in the final answer lands with a citation back into the evidence ledger.

Why it matters

Disagreement is information.

When the panel converges — from different priors, looking at different sources, reading each other’s strongest objections — that is much stronger evidence than a single model’s confident assertion. When it does not converge, the second round of research is aimed precisely where the disagreement lives, and any unresolved disagreement is preserved in the answer rather than smoothed away.

QPlural is for the questions where you’d rather know the panel is uncertain than be told a confident wrong thing.

References

Where the research comes from.

Peer-reviewed here means accepted at ICML / ICLR / ACL / EMNLP / NeurIPS — not “on arXiv.” arXiv preprints that haven’t cleared a conference are labelled emerging.

1
ICML 2024 · peer-reviewed
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Li, Torralba, Tenenbaum & Mordatch
The foundational result: multiple models reading each other’s reasoning catch errors any single model would defend. Establishes that factuality and reasoning improve when independent models cross-examine each other rather than answer alone.
Read on arXiv
2
ACL 2024 · peer-reviewed
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
Chen, Saha & Bansal
Shows consensus quality is higher when agents are drawn from different model families rather than repeated instances of the same model, and that a transcript-level judge outperforms majority voting. Underwrites the heterogeneous-panel design.
Read on arXiv
3
EMNLP 2024 · peer-reviewed
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang et al.
Motivates debate as the corrective for the Degeneration-of-Thought problem that emerges when a single model becomes locked into its initial reasoning path.
Read on arXiv
4
arXiv 2026 · emerging
Demystifying Multi-Agent Debate
Zhu et al.
Shows performance improves when the initial debate pool is made more diverse and when agents communicate calibrated confidence during revision. Influences the QPlural design choice that lens roles are deterministic and confidence is verified, not asserted.
Read on arXiv
5
arXiv 2026 · emerging
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
HDE paper
Argues that architectural heterogeneity — models from different labs — prevents “consensus collapse”, where homogeneous panels share the same training biases and confidently converge on the same wrong answer.
Read on arXiv
6
FActScore (EMNLP 2023) · peer-reviewed
FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long-Form Generation
Min et al.
Shows that long-form answers which look supported are often a tangle of supported and unsupported atomic facts. Justifies decomposing each brief into atomic claims and verifying entailment claim-by-claim — the basis of QPlural’s Stage 5.
Read on arXiv
7
arXiv 2025 · emerging
Enhancing Multi-Agent Debate System Performance via Confidence Expression
Wu et al.
Finds that when debating agents see each other’s confidence scores the panel drifts toward over-confidence and loses signal. Informs the QPlural design choice that cross-critique turns on reasoning and disconfirmation conditions, not assertiveness.
Read on arXiv
8
arXiv 2026 · emerging
Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge
AgentAuditor paper
Shows that adjudicating at divergence points — by comparing localised branch evidence — beats both majority vote and generic LLM-as-judge, recovering correct minority answers where voting loses them entirely. Underpins the QPlural design choice to surface dissent alongside the synthesis instead of voting it away.
Read on arXiv
9
arXiv 2025 · emerging
Retrieval-Augmented Generation with Conflicting Evidence (MADAM-RAG)
Wang, Prasad, Stengel-Eskin & Bansal
Assigns each agent a different subset of the retrieved evidence, then lets them debate. Reports factuality gains of 11–16 percentage points on benchmarks with ambiguous or conflicting documents. Basis for the per-analyst evidence partitioning: agreement reached by analysts reading different sources is much stronger evidence than agreement when everyone read the same article.
Read on arXiv

Independent models from OpenAI, Anthropic, Google DeepMind, xAI and DeepSeek.

Your questions are never shared. Your answers are private to you.