Large language models (LLMs) for code generation and understanding can be boosted by training on search-like reasoning traces and by self-correction mechanisms. Recent work has framed code tasks as iterative decision processes, using fine-tuning or reinforcement learning (RL) to teach models to identify errors and explore alternative solutions. For example, Stream-of-Search (SoS) and Tree-of-Thoughts (ToT) reasoning organizes intermediate “thoughts” into a search tree, allowing LMs to branch on different solution paths and backtrack when needed[1][4]. As shown in the ToT schematic design, multiple partial solution “thoughts” are generated and evaluated, greatly improving performance on tasks requiring planning or search. Similarly, code models can be trained on annotated search traces, e.g. execution logs or error-flagged drafts, to internalize debugging strategies.
Our team, the Research & Innovation team at Palo Alto led by Yaad Oren and Alexander Schaefer, collaborate with Dr. Noah Goodman's CoCoLab from Stanford HAI institution, exploring and bringing the pioneering work on the foundational reasoning capability enhancement of LLMs to SAP. Recently we got one paper accepted by NeurIPS 2025 Foundations of Reasoning workshop, titled by "Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards" (arxiv link). The following is our abstract as a teaser:
Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards.
Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.
A key idea is to supervise models on entire search trajectories instead of only correct solutions. For example, Qin et al. train a backtracking model on DFS-generated reasoning traces and a direct model on final solutions[2]. They find this trace-based SFT can bias models into particular strategies: on CountDown puzzles the backtracking model overfits the training search path and underperforms, while on Sudoku it generalizes better. In general, learning from explicit annotated chains-of-thought (CoT) can encourage rich reasoning, but may also lock in suboptimal habits (e.g. verbosity or fixed search biases). Qin et al. show two pitfalls: (1) Prescribed search bias – fixed traces can constrain exploration, and (2) Excessive verbosity – writing out every step can discourage internal reasoning[2]. Notably, they observe that reinforcement learning fine-tuning can overcome these issues: a backtracking model fine-tuned with RL discovers new strategies and improves substantially, whereas a direct model improves one-shot accuracy but fares worse under parallel sampling[2].
Other supervised approaches include synthetically augmenting data with error annotations or ‘hints’. For instance, Sahni et al. train models with error-infused examples (marking lines with potential bugs) to improve debugging[3]. In code repair tasks, models have been fine-tuned on (input, buggy code, fix) triplets to internalize common corrections. However, most of these rely on proprietary data or strong verifiers. Self-contained supervised traces for code are still scarce; researchers often generate them on-the-fly (e.g. solving puzzles to get traces) or use execution feedback. In our work, we also used Claude and Phi to generate similar reasoning hints to augment our training data.
To leverage dynamic feedback, many recent works use reinforcement learning on self-generated data. One line of work (e.g. Stream-of-Search[4], SCoRe[5], CoCoS[6]) explicitly frames code generation as a multi-turn Markov decision process where the model can iteratively revise its code. These methods train the code model end-to-end using rewards that encourage fixing mistakes.
These RL methods set examples for online self-improvement: the model generates its own traces (code attempts, feedback, or self-reflections), evaluates them via an automatic reward (tests or proxy reward models), and uses that to update its policy. In all cases, models get reward signals across multiple turns – for example, RLEF only awards a final reward but uses intermediate test feedback as observations, while CoCoS and SCoRe explicitly shape rewards at each revision step. This multi-turn RL contrasts with one-off prompting: the model learns that reflecting on errors and branching can yield higher expected reward.
Overall, trace-based training and self-correction in the mentioned work yield significant empirical improvements across diverse benchmarks:
However, results also reveal nuances and limitations. The “backtracking vs direct” study shows that task structure matters: on some problems (CountDown) naive backtracking hurt performance[2]. Many methods still rely on test or verifier oracles (e.g. to score branches), which will invalid this whole method when such oracles don’t exist. Also, RL training can be sample-inefficient and sensitive to reward hacking.
Annotating search and correction steps enables deep reasoning in code LMs, but challenges remain. Key gaps include robust reward design, scaling to very large models, and generalizing for tasks with unclear oracles. Most existing methods focus on code generation. We hope future work in the community would apply more effort on code understanding (e.g. program analysis, code review, translation, documentation). It is also important to study the trade-offs: extensive backtracking can be quadratically expensive, and some tasks favor direct answering. Moreover, as the “when to backtrack” analysis shows[2], blindly encouraging self-correction is not always beneficial.
Nonetheless, models that learn from their own mistakes via annotated traces or RL often outperform “single shot” models. By combining supervision with tree-based reasoning traces through RL, systems can achieve autoregressive search capabilities without human prompts[9]. Overall, these advances point toward autonomous code agents that can plan, check, and correct – closing the gap toward human-like programming workflows.
[1] Tree of Thoughts: Deliberate Problem Solving with Large Language Models
https://ar5iv.labs.arxiv.org/html/2305.10601v2
[2] To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning
https://arxiv.org/html/2504.07052v1
[3] Effective Large Language Model Debugging with Best-first Tree Search
https://arxiv.org/html/2407.19055v1
[4] Stream of Search (SoS): Learning to Search in Language
https://arxiv.org/abs/2404.03683
[5] Training Language Models to Self-Correct via Reinforcement Learning
https://arxiv.org/pdf/2409.12917
[6] Self-Correcting Code Generation Using Small Language Models
https://arxiv.org/html/2505.23060v1
[7] GitHub - jeonghun3572/CoCoS: This is the official implementation of the paper "Self-Correcting Code Generation Using Small Language Models"
https://github.com/jeonghun3572/CoCoS
[8] [2410.02089] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
https://ar5iv.labs.arxiv.org/html/2410.02089v2
[9] GitHub - satori-reasoning/Satori: [ICML 2025] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
https://github.com/satori-reasoning/Satori
[10] Data × LLM: From Principles to Practices
https://arxiv.org/html/2505.18458v2
[11] Evaluating Large Language Models Trained on Code
https://arxiv.org/abs/2107.03374
[12] Measuring Coding Challenge Competence With APPS
https://arxiv.org/abs/2105.09938
[13] TACO: Topics in Algorithmic COde generation dataset
https://arxiv.org/abs/2312.14852
[14] SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
https://arxiv.org/abs/2310.06770
[15] MATH-500
https://huggingface.co/datasets/HuggingFaceH4/MATH-500
[16] Training Verifiers to Solve Math Word Problems
https://arxiv.org/abs/2110.14168
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 37 | |
| 9 | |
| 9 | |
| 8 | |
| 7 | |
| 6 | |
| 6 | |
| 6 | |
| 5 | |
| 5 |