Artificial Intelligence Blogs Posts
cancel
Showing results for 
Search instead for 
Did you mean: 
GraceQin
Associate
Associate
521

Large language models (LLMs) for code generation and understanding can be boosted by training on search-like reasoning traces and by self-correction mechanisms. Recent work has framed code tasks as iterative decision processes, using fine-tuning or reinforcement learning (RL) to teach models to identify errors and explore alternative solutions. For example, Stream-of-Search (SoS) and Tree-of-Thoughts (ToT) reasoning organizes intermediate “thoughts” into a search tree, allowing LMs to branch on different solution paths and backtrack when needed[1][4]. As shown in the ToT schematic design, multiple partial solution “thoughts” are generated and evaluated, greatly improving performance on tasks requiring planning or search. Similarly, code models can be trained on annotated search traces, e.g. execution logs or error-flagged drafts, to internalize debugging strategies.

 

Our Work

Our team, the Research & Innovation team at Palo Alto led by Yaad Oren and Alexander Schaefer, collaborate with Dr. Noah Goodman's CoCoLab from Stanford HAI institution, exploring and bringing the pioneering work on the foundational reasoning capability enhancement of LLMs to SAP. Recently we got one paper accepted by NeurIPS 2025 Foundations of Reasoning workshop, titled by "Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards" (arxiv link). The following is our abstract as a teaser:

Generating high-quality code remains a challenge for Large Language Models (LLMs). For the evolution of reasoning models on this task, reward models are a necessary intermediate step. These models judge outcomes or intermediate steps. Decoder-only transformer models can be turned into reward models by introducing a regression layer and supervised fine-tuning. While it is known that reflection capabilities generally increase with the size of a model, we want to investigate whether state-of-the-art small language models like the Phi-4 family can be turned into usable reward models blending the consideration of process rewards and outcome rewards.
Targeting this goal, we construct a dataset of code samples with correctness labels derived from the APPS coding challenge benchmark. We then train a value-head model to estimate the success probability of intermediate outputs. Our evaluation shows that small LLMs are capable of serving as effective reward models or code evaluation critics, successfully identifying correct solutions among multiple candidates. Using this critic, we achieve over a 20% improvement in the search capability of the most accurate code out of multiple generations.

 

Supervised Fine-Tuning on Search Traces

A key idea is to supervise models on entire search trajectories instead of only correct solutions. For example, Qin et al. train a backtracking model on DFS-generated reasoning traces and a direct model on final solutions[2]. They find this trace-based SFT can bias models into particular strategies: on CountDown puzzles the backtracking model overfits the training search path and underperforms, while on Sudoku it generalizes better. In general, learning from explicit annotated chains-of-thought (CoT) can encourage rich reasoning, but may also lock in suboptimal habits (e.g. verbosity or fixed search biases). Qin et al. show two pitfalls: (1) Prescribed search bias – fixed traces can constrain exploration, and (2) Excessive verbosity – writing out every step can discourage internal reasoning[2]. Notably, they observe that reinforcement learning fine-tuning can overcome these issues: a backtracking model fine-tuned with RL discovers new strategies and improves substantially, whereas a direct model improves one-shot accuracy but fares worse under parallel sampling[2].

Other supervised approaches include synthetically augmenting data with error annotations or ‘hints’. For instance, Sahni et al. train models with error-infused examples (marking lines with potential bugs) to improve debugging[3]. In code repair tasks, models have been fine-tuned on (input, buggy code, fix) triplets to internalize common corrections. However, most of these rely on proprietary data or strong verifiers. Self-contained supervised traces for code are still scarce; researchers often generate them on-the-fly (e.g. solving puzzles to get traces) or use execution feedback. In our work, we also used Claude and Phi to generate similar reasoning hints to augment our training data.

 

Reinforcement Learning for Self-Correction

To leverage dynamic feedback, many recent works use reinforcement learning on self-generated data. One line of work (e.g. Stream-of-Search[4], SCoRe[5], CoCoS[6]) explicitly frames code generation as a multi-turn Markov decision process where the model can iteratively revise its code. These methods train the code model end-to-end using rewards that encourage fixing mistakes.

  • Stream-of-Search (Kanishk et al. 2024) proposes a unified search trace design, demonstrated through the game of Countdown. The streams of search logs the state transitions, actions and reward scores when making each computation step in the game. The highlight is that this scoring always considers the impact from different branching choices through backtracking.  Through pre-training a transformer-based language model on streams of search, and then applying further fine-tuning with policy improvement methods led to a significant improvement of correct solutions. 
  • SCoRe (Kumar et al. 2024) introduces a two-stage RL process for code and math. A base LLM first learns to produce correction traces by iterative self-evaluation (often in multiple turns) and then undergoes policy optimization using its own feedback. Simply fine-tuning on offline “correction” traces does not work well, so SCoRe employs online PPO with shaped rewards for partial correctness[5]. They report substantial gains on MATH and HumanEval pass@1 after RL. This shows small LMs can learn to “self-correct” if trained under their own error distribution.
  • CoCoS (Cho et al. 2025) focuses on small open-source models. They propose an RL fine-tuning objective where each turn’s reward reflects cumulative progress: the model is encouraged to maintain correct outputs and also to improve any incorrect outputs incrementally[6]. Specifically, CoCoS uses an accumulated reward (discounted sum over turns) and a progressive reward for fine-grained improvements, rather than just binary success. In experiments on MBPP and HumanEval with 1B models, CoCoS yields large improvements over baselines. The authors release code and RL data (on GitHub) enabling reproducibility[7].
  • RLEF (Gehring et al. 2024) frames code synthesis as an interactive RL problem. The model repeatedly proposes code, receives execution feedback (unit test results) as the environment response, and is updated via PPO to maximize passing tests. This end-to-end RL causes models to ground in test feedback rather than ignore it[8]. RLEF training on the CodeContests benchmark produces state-of-the-art solve rates with 8B and 70B models, while drastically reducing the number of samplings needed. Critically, RLEF models learn to adapt code based on errors – e.g. using caches or fixing off-by-one errors – without human hints. The gains generalize: after RLEF, models also improve on HumanEval+ and MBPP+ by leveraging feedback.
  • Satori (Shen et al. 2025) combines supervised format-tuning with RL. It first imitates a “Chain-of-Action-Thought” (COAT) format by training on synthetic multi-agent demonstrations, teaching the model to use special tokens like <|reflect|> or <|explore|>[9]. Then it applies RL with a Restart-and-Explore (RAE) strategy: the model is periodically reset to intermediate states (including failed ones) and trained to explore new branches, with rewards for eventual correctness and penalties for repeating mistakes[9][10]. The final 7B Satori model (based on Qwen) achieves SOTA on challenging math reasoning tasks, showing that an LLM can “internalize search”[9]. The full Satori code, model checkpoints, and RL data (including rollouts from RAE) are publicly released for community use.

These RL methods set examples for online self-improvement: the model generates its own traces (code attempts, feedback, or self-reflections), evaluates them via an automatic reward (tests or proxy reward models), and uses that to update its policy. In all cases, models get reward signals across multiple turns – for example, RLEF only awards a final reward but uses intermediate test feedback as observations, while CoCoS and SCoRe explicitly shape rewards at each revision step. This multi-turn RL contrasts with one-off prompting: the model learns that reflecting on errors and branching can yield higher expected reward.

 

Empirical Gains and Benchmarks

Overall, trace-based training and self-correction in the mentioned work yield significant empirical improvements across diverse benchmarks:

However, results also reveal nuances and limitations. The “backtracking vs direct” study shows that task structure matters: on some problems (CountDown) naive backtracking hurt performance[2]. Many methods still rely on test or verifier oracles (e.g. to score branches), which will invalid this whole method when such oracles don’t exist. Also, RL training can be sample-inefficient and sensitive to reward hacking.

 

Discussion and Outlook

Annotating search and correction steps enables deep reasoning in code LMs, but challenges remain. Key gaps include robust reward design, scaling to very large models, and generalizing for tasks with unclear oracles. Most existing methods focus on code generation. We hope future work in the community would apply more effort on code understanding (e.g. program analysis, code review, translation, documentation). It is also important to study the trade-offs: extensive backtracking can be quadratically expensive, and some tasks favor direct answering. Moreover, as the “when to backtrack” analysis shows[2], blindly encouraging self-correction is not always beneficial.

Nonetheless, models that learn from their own mistakes via annotated traces or RL often outperform “single shot” models. By combining supervision with tree-based reasoning traces through RL, systems can achieve autoregressive search capabilities without human prompts[9]. Overall, these advances point toward autonomous code agents that can plan, check, and correct – closing the gap toward human-like programming workflows.

 

References

[1] Tree of Thoughts: Deliberate Problem Solving with Large Language Models

https://ar5iv.labs.arxiv.org/html/2305.10601v2

[2] To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning

https://arxiv.org/html/2504.07052v1

[3] Effective Large Language Model Debugging with Best-first Tree Search

https://arxiv.org/html/2407.19055v1

[4] Stream of Search (SoS): Learning to Search in Language

https://arxiv.org/abs/2404.03683

[5] Training Language Models to Self-Correct via Reinforcement Learning 

https://arxiv.org/pdf/2409.12917

[6] Self-Correcting Code Generation Using Small Language Models

https://arxiv.org/html/2505.23060v1

[7] GitHub - jeonghun3572/CoCoS: This is the official implementation of the paper "Self-Correcting Code Generation Using Small Language Models"

https://github.com/jeonghun3572/CoCoS

[8] [2410.02089] RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

https://ar5iv.labs.arxiv.org/html/2410.02089v2

[9] GitHub - satori-reasoning/Satori: [ICML 2025] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

https://github.com/satori-reasoning/Satori

[10] Data × LLM: From Principles to Practices

https://arxiv.org/html/2505.18458v2

[11] Evaluating Large Language Models Trained on Code

https://arxiv.org/abs/2107.03374

[12] Measuring Coding Challenge Competence With APPS

https://arxiv.org/abs/2105.09938

[13] TACO: Topics in Algorithmic COde generation dataset

https://arxiv.org/abs/2312.14852

[14] SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

https://arxiv.org/abs/2310.06770

[15] MATH-500

https://huggingface.co/datasets/HuggingFaceH4/MATH-500

[16] Training Verifiers to Solve Math Word Problems

https://arxiv.org/abs/2110.14168

 

1 Comment
tarunnagar
Explorer
0 Likes

Wow, this is really interesting. The idea of improving code reasoning using search and self-correction feels like a big step forward. It’s not just about spitting out code anymore—it’s actually trying to understand the logic behind it and fix mistakes along the way, kind of like how an experienced developer would work.
If this keeps improving, I can see AI tools becoming way more reliable for debugging and handling complex code. Pretty exciting stuff for anyone in software or AI.