๐Ÿ† Accepted to ACL 2026 (Findings)

ACL 2026 ยท Findings Code-Driven Multimodal & Structured-Artifact Generation

PairCoder++: Pair Programming as a Universal Paradigm for Verified
Code-Driven Multimodal and Structured-Artifact Generation

Junhao Chen1 Xiang Li2 Mingjin Chen3 Boran Zhang4 Henghaofan Zhang5 Yibin Xu6 Yuehan Cui7 Fangsheng Weng8 Fei Ma9 Qi Tian9 Ruqi Huang1 Hao Zhao1,10
1Tsinghua University  ·  2Peking University  ·  3The Hong Kong Polytechnic University
4University of Science and Technology of China  ·  5University of Electronic Science and Technology of China
6Tongji University  ·  7Tianjin University  ·  8Independent Researcher
9Guangming Lab  ·  10BAAI
PairCoder improves code generation across artifact domains
PairCoder improves code generation across every artifact domain, not only Python program synthesis. Left: single model (light) versus PairCoder (dark) on all 17 public benchmarks, with the absolute gain in percentage points. Right: cases where the baseline also renders an artifact, so the comparison isolates artifact quality — the baseline drifts from the target while PairCoder matches it closely.
17
public benchmarks
7
models, 3 vendors
0.20→0.78
Blender executability
+10–30
TikZ compile points
2
agents: Driver + Navigator

Abstract

Code is the medium through which large language models generate structured artifacts: charts, scientific figures, vector graphics, CAD models, 3D scenes, and hardware designs are all produced by writing programs. In this regime single-pass inference is brittle, because the compiler, renderer, or simulator that decides whether the artifact exists is invisible to the model. We present PairCoder, which grounds review in the toolchain and realizes it as two-agent pair programming: a Driver agent writes the program, a Navigator agent reviews it against verification evidence (diagnostics, execution results, and renderings of the current artifact beside the target), and the two switch roles when errors persist. Across 17 public benchmarks and seven models from three vendors, PairCoder improves essentially every benchmark whose artifact is verifiable, on full official metric suites rather than execution alone (for example, Blender scene executability 0.20→0.78; TikZ compile rate up 10 to 30 points on every model), at 2.9 to 9.2 times single-model cost (about 7 times overall). The improvements concentrate where the toolchain provides an informative oracle and the baseline leaves headroom, and the method ties or mildly regresses where the oracle is weak, so we frame pair programming as a reliable recipe for code-driven generation of structured data and multimodal content wherever a trustworthy verification oracle exists.

The PairCoder Framework

A minimal two-agent loop that turns code-driven generation into a toolchain-verified dialogue. The Driver writes and revises the program; the Navigator reviews each candidate against concrete verification evidence; control of the keyboard follows the diagnosis.

PairCoder framework: Driver and Navigator collaboration with role switching
Iterative collaboration between the Driver (code generation) and Navigator (review and feedback) agents, with the automatic role-switching mechanism.
1

Driver & Navigator with Self-Mirror prompting

The Driver generates and revises while the Navigator reviews, each under a role-specific system prompt over the shared task and memory, preserving two genuinely different perspectives on the same code.

2

Verification-grounded review

Each candidate is compiled, executed, or rendered by the benchmark's own toolchain. The evidence — diagnostics, test outcomes, and a rendering of the artifact beside the target — is attached to the review. The Navigator must cite a concrete error or accept with [NOERROR], which prevents churn on already-correct programs.

3

Error-triggered role switching

Persistent errors swap the roles so the agent that diagnosed the fault repairs it. An exhausted budget returns the candidate with the best verified quality rather than the last one.

Multimodal Results

Each cell is single model → PairCoder at gpt-5.4 (thinking disabled). PairCoder improves every official metric on all seven multimodal benchmarks — executability, visual similarity, and geometry alike.

BenchmarkMetricSinglePairCoder
DaTikZ (60)valid render ↑0.7170.967
SSIM / CLIP / DINO ↑.363/.566/.530.512/.770/.698
Plot2Code (132)execution rate ↑0.9620.977
SSIM / CLIP ↑.476/.916.495/.920
PandasPlotBench (175)execution rate ↑0.9660.994
SSIM / CLIP ↑.567/.909.610/.944
ChartMimic (60)execution rate ↑0.9170.983
SSIM ↑.533.582
StarVector (60)render rate ↑0.9001.000
SSIM / CLIP / DINO ↑.776/.859/.820.873/.954/.906
GenCAD-Code (60)execution rate ↑0.8670.983
Chamfer (agg.) ↓0.2590.155
3DCodeBench (60)executability ↑0.4330.783
SigLIP-2 / DINO (agg.) ↑.263/.129.842/.512
Chamfer (agg.) ↓4.661.66

Aggregate scores count non-executing generations as worst case. Green marks a PairCoder improvement.

Cost versus benefit per benchmark
Cost vs. benefit at gpt-5.4-mini. Per-benchmark token multiplier against gain on the primary metric. Spend concentrates where it converts — the largest multipliers coincide with the largest gains.
Role switching policy comparison
Role-switching policy. Official metric (top) and PairCoder token cost (bottom) for four policies. Switching on each error is best or tied-best on all four benchmarks at a modest token premium.

Explore All Results

The full main table, interactive: 17 benchmarks × 7 models from 3 vendors, single model versus PairCoder. The default view shows every number at once, color-coded by improvement; switch to a chart to drill into one benchmark or one model.

improves regresses tie n/a

Cross-Model Generality

The same tasks rendered across models from three vendors, baseline versus PairCoder. The headline transfers: domains with a strong toolchain signal improve on nearly every applicable model.

Cross-model Plot2Code
Plot2Code across models.
Cross-model PandasPlotBench
PandasPlotBench across models.
Cross-model ChartMimic
ChartMimic across models.
Cross-model StarVector
StarVector across models.
Cross-model 3DCodeBench
3DCodeBench across models.

BibTeX

PairCoder++ extends our ACL 2026 Findings paper, PairCoder: Pair Programming-Inspired Two-Agent Collaboration for Code Generation. Please cite the published paper:

@inproceedings{chen2026paircoder,
  title     = {PairCoder: Pair Programming-Inspired Two-Agent
               Collaboration for Code Generation},
  author    = {Chen, Junhao and Li, Xiang and Xu, Yibin and Cui, Yuehan and
               Weng, Fangsheng and Zhao, Hao and Ma, Fei and Tian, Qi},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  pages     = {3043--3058},
  year      = {2026}
}