Training Versatile Coding Agents in
Synthetic Environments

Yiqi Zhu¹

Apurva Gandhi²

Graham Neubig²

¹Tsinghua University ²Carnegie Mellon University

Figure 1: Overview. Comparison of our SWE-Playground against previous methods on diverse coding benchmarks. As the results indicate, though our model falls slightly short on SWE-bench Verified (using substantially fewer training trajectories against R2E-Gym and SWE-smith), it outperforms all baselines on the other benchmarks and metrics, demonstrating that our proposed SWE-Playground is capable of training versatile coding agents.

Abstract

Prior works on training software engineering agents have explored utilizing existing resources such as issues on GitHub repositories to construct software engineering tasks and corresponding test suites. These approaches face key limitations: their reliance on static repositories limits flexibility, and their primary focus on issue resolution restricts the learning of versatile skills required for real-world engineering, such as reproducing issues or building libraries.

To overcome these challenges, we introduce SWE-Playground, a pipeline for generating environments and trajectories which supports the training of versatile coding agents. Unlike prior efforts, SWE-Playground synthetically generates projects and tasks from scratch with strong language models and agents, eliminating reliance on external data sources. This allows us to tackle a much wider variety of coding tasks. We demonstrate the effectiveness of this approach on three distinct benchmarks, and results indicate that SWE-Playground produces trajectories with dense training signal, enabling agents to reach comparable performance with significantly fewer trajectories than previous works.

How It Works

Main Data Generation Pipeline

Our fully automated pipeline constructs software engineering environments from scratch.

Figure 2: The Main Generation Pipeline. From proposal to validated code. (1) Project Proposal; (2) Task Decomposition; (3) Repository Setup; (4) Unit Test Generation; (5) Implementation and Mutual Verification.

1. Project Proposal: An LLM generates a comprehensive proposal steered by strict constraints—such as CLI-based interaction, high algorithmic density, and the exclusion of simple CRUD applications—to ensure diversity and complexity.

2. Task Proposal: The project is decomposed hierarchically into phases, modules, and executable tasks. A detailed checklist explicitly specifies requisite unit tests, standard cases, and assertions to serve as a reliable reward signal.

3. Repository Setup: An agent establishes the foundational code structure, delineating necessary files, utilities, function stubs, and Docker environments without implementing core logic, preventing disorganized development.

4. Unit Test Generation: An active agent (OpenHands) generates strict unit tests based on the checklist. It executes its own tests to verify imports and dependencies, ensuring that tests are functionally valid before implementation begins.

5. Functionality Implementation: A model implements the functionality using the generated unit tests as a guide. This forms a mutual verification loop: an issue in either the test or the implementation leads to failure, adding robustness to the process. To prevent reward hacking, the final verification uses the original, unmodified test suite.

Capability-Specific Adaption

We adapt the pipeline to simulate three distinct coding benchmarks, demonstrating extensibility.

Figure 3: Task Adaptation. Adapting the synthetic environment for issue resolution (SWE-bench), issue reproduction (SWT-Bench), and library generation from scratch (Commit-0).

Issue Resolution (SWE-bench): We inject specific bugs into the functional repository using an agent. The existing test suites directly verify the correctness of the fix, where a failing test serves as the success signal for the injection.

Issue Reproduction (SWT-Bench): Beyond injecting a bug, we require the agent to modify the test suites to hide the bug. The training agent must then write a new test script to expose the faulty behavior, effectively reproducing the issue.

Library Generation from Scratch (Commit-0): We simulate building a library from scratch by replacing all function bodies in the generated repository with pass statements. The agent attempts the full task, providing trajectories for pure distillation.

Versatile Performance

We compare our model against Qwen2.5-Coder (Base), SWE-Gym, R2E-Gym, and SWE-smith. While prior methods focus heavily on SWE-bench, our results highlight the importance of versatility.
Values in parentheses represent the delta relative to the base model.

Model	Data Size	SWE-bench Verified		SWT-Bench		Commit-0
Model	Data Size	Resolved %	Empty Patch %	Resolved %	Cov Delta %	Resolved %
7B Models
Qwen2.5-Coder-7B	-	1.8	45.8	0.72	7.55	1.82
SWE-Gym-7B	491	10.6 (+8.8)	33.8 (-12.0)	1.45 (+0.73)	5.19 (-2.36)	2.51 (+0.69)
R2E-Gym-7B	3.3k	19.0 (+17.2)	--	0.72 (+0.0)	2.66 (-4.89)	2.54 (+0.72)
SWE-smith-7B	2.0k	15.2 (+13.4)	--	0.00 (-0.72)	12.30 (+4.75)	2.62 (+0.80)
SWE-Play-mix-7B	704	17.0 (+15.2)	6.4 (-39.4)	3.26 (+2.54)	30.69 (+23.1)	2.95 (+1.13)
32B Models
Qwen2.5-Coder-32B	-	7.0	9.5	9.42	24.81	2.31
SWE-Gym-32B	491	20.6 (+13.6)	13.8 (+4.3)	3.26 (-6.16)	10.34 (-14.5)	2.51 (+0.20)
R2E-Gym-32B	3.3k	34.4 (+27.4)	--	3.26 (-6.16)	15.04 (-9.77)	3.30 (+0.99)
SWE-smith-32B	5.0k	40.2 (+33.2)	--	13.77 (+4.35)	42.15 (+17.3)	2.62 (+0.31)
SWE-Play-mix-32B	704	31.2 (+24.2)	6.8 (-2.7)	18.12 (+8.70)	44.24 (+19.4)	3.64 (+1.33)

Table 1: Main Results. Performance comparison of our SWE-Play-mix against baselines.

Key Findings
                    Strong Performance Across Benchmarks: SWE-Play-mix consistently surpasses base models on diverse coding benchmarks. It secures top performance on 4/5 metrics for 7B models and achieves similar gains for 32B models, despite using a significantly smaller dataset.
                
                    Limited Generalization of Baselines: Agents trained on prior environments (SWE-Gym, R2E-Gym, SWE-smith) often fail to generalize. While they perform competitively on SWE-bench, they show limited or degraded performance on SWT-Bench (Issue Reproduction) and Commit-0 (Library Generation).
                
                    Inconsistent Scaling: 7B and 32B models exhibit distinct behaviors. For instance, while prior methods might slightly help 7B models on SWT-Bench, they actively degrade the performance of 32B models, suggesting that these methods may hurt the capabilities of larger models on complex tasks such as issue reproduction.

Discussions

Impact of Trajectory Composition

To investigate how different trajectory types influence model performance, we conducted an ablation study with two additional 7B models: SWE-Play-general (with 280 general-purpose trajectories) and SWE-Play-swe (with 213 issue resolution trajectories).

Models	SWE-bench Verified		SWT-Bench		Commit-0
Models	Resolved	Empty	Resolved	Cov Delta	Resolved
Qwen2.5-Coder	1.8	45.8	0.72	7.55	1.82
SWE-Play-mix	17.0	6.4	3.26	30.69	2.95
SWE-Play-general	8.6	22.6	2.54	7.55	2.62
SWE-Play-swe	11.0	10.6	0.72	4.80	2.38

Table 2: Ablation Study. Performance of different trajectory compositions.

Results indicate that generalist training effectively develops core coding abilities. We attribute this to the intrinsic design of the SWE-Playground pipeline: models build projects from blank templates (similar to Commit-0) and naturally resolve bugs during implementation. The mixture of all trajectory types (SWE-Play-mix) yields the highest performance, confirming the effectiveness of cross-scenario transferability.

Investigation into Data Efficiency

Our results demonstrate remarkable data efficiency. SWE-Play-mix is only outperformed by R2E-Gym on SWE-bench Verified, despite R2E-Gym using a dataset nearly five times larger.

As shown in Table 3, our trajectories contain 2-3x more tokens and messages on average. We characterize this as higher learning density. The significantly higher proportion of bash execution actions (41.7%) indicates that our agents learn a robust development methodology based on execution and iterative verification, rather than simple code generation.

Metric	SWE-Gym	R2E-Gym	SWE-Play-mix
Total Count	491	3.3k	704
Message Count	40.2	33.2	72.9
Token Count	18,377	13,795	39,121
Lines Edited	17.5	38.6	103.5
Bash Proportion	27.8%	26.7%	41.7%

Table 3: Trajectory Statistics. Comparison of dataset density.

Future Work

Broader Coding Capabilities

Incorporating a broader range of coding capabilities and benchmarks, such as SWE-bench Multimodal and SWE-perf, would further validate the extensibility of our framework.

Reinforcement Learning

Our pipeline creates a scenario where unit test generation and code implementation mutually verify each other. Training an agent to master both skills through RL will allow us to explore self-verifying and self-improving coding agents.

Citation

@misc{zhu2025trainingversatilecodingagents, title={Training Versatile Coding Agents in Synthetic Environments}, author={Yiqi Zhu and Apurva Gandhi and Graham Neubig}, year={2025}, eprint={2512.12216}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2512.12216}, }

Training Versatile Coding Agents inSynthetic Environments