CRM and CX Blog Posts by SAP
cancel
Showing results for 
Search instead for 
Did you mean: 
santhosini_K
Product and Topic Expert
Product and Topic Expert
756

Teaching AI applications to Self-Critique

My Experiments with Self-Refining prompts

Most of us treat a prompt like a one-shot instruction. You ask, the model answers, and… that’s it. But in practice, good work is iterative: we write a draft, critique it, and then refine. Self-Refine applies that same loop to AI — the model acts as writer, critic, and editor in quick succession until the result feels right.

In this post, I’ll share what clicked for me, a simple three-step pattern, and a few mini-experiments you can try (and screenshot) to see the lift for yourself.

The 3-Step Loop (that changed how I prompt)

1) Generate – Ask for the initial output.
2) Feedback – Ask the model to critique its own output with specific, actionable notes.
3) Refine – Tell it to rewrite using only that feedback.
Repeat until it says, “no further refinement.”

This is simple. The magic is in good feedback prompts (“point out tone, missing details, structure” vs. vague “improve it”).

How it works (in 10 seconds)

  • Generate: Draft the answer
  • Critique: The model reviews itself (JSON feedback + score)
  • Refine: It rewrites using that feedback
  • Stop: when the score plateaus or hits your bar

Task-specific effectiveness: Self-refinement works better for some task types (code generation, technical documentation) than others (creative writing). Measures vary for each task category.​

Mini-Experiments

Building a self-improving web search Agent 

Building a Web Search Agent with Self-Improving Prompts
The foundation combines SAP AI Core's Generative AI Hub with LangChain's Google Search wrapper:
```python
# SAP AI Core client setup
ai_core_client = AICoreV2Client(
    base_url=BASE_URL, auth_url=AUTH_URL,
    client_id=CLIENT_ID, client_secret=CLIENT_SECRET,
)
llm = init_llm(MODEL_NAME, temperature=0.0, max_tokens=512)

# Google Search integration
search = GoogleSearchAPIWrapper(k=5)
google_tool = Tool.from_function(
    name="google_search",
    description="Search Google and return the first results",
    func=search.run
)
agent = initialize_agent(tools=[google_tool], llm=llm, agent=AgentType.OPENAI_FUNCTIONS)

#This gives us both a base LLM and an agent that can search the web when needed.
The Three-Prompt Pattern
The self-refinement uses three distinct prompts with clear roles:
Generator Prompt: Creates initial draft without second-guessing
GEN_PROMPT = """You are a careful, precise assistant.
TASK: {task}
Produce the best possible answer. Do not include any self-critique here."""
```
Critic Prompt: Evaluates with structured feedback
CRITIC_PROMPT = """You are a meticulous reviewer. Given the TASK and the DRAFT answer,
1) list EXACTLY 3 concrete issues or improvement opportunities,
2) give a numbered improvement plan,
3) rate current quality from 0–100,
4) decide if refinement is still needed.

Respond with a SINGLE JSON object and NOTHING ELSE:
{{"issues": ["...", "...", "..."], "plan": ["step 1", "step 2", "step 3"], "score": 0, "stop": false}}"""
```
Refiner Prompt: Applies improvements systematically
REFINE_PROMPT = """You are a senior editor. Apply the plan below to improve the DRAFT.
Keep all correct content; fix issues; increase clarity, correctness, and usefulness.
Return ONLY the improved answer (no commentary)."""```
The key is separating generation from critique to avoid the model sabotaging its own initial work.
The Self-Refine Loop
def self_refine(
    task: str,
    max_iters: int = 3,
    min_delta: int = 2,
    use_agent_for_initial: bool = False
) -> Dict[str, Any]:
    """
    Self-refine loop: GENERATE -> CRITIQUE -> REFINE until stop or plateau.
    Returns final answer and iteration history.
    """
    history: List[Dict[str, Any]] = []

    # 1) Initial draft
    draft = call_llm_text(GEN_PROMPT.format(task=task), use_agent=use_agent_for_initial)
    history.append({"iteration": 0, "draft": draft, "score": None, "feedback": None})

    best = draft
    best_score = -math.inf

    for i in range(1, max_iters + 1):
        # 2) Critique
        critic_raw = call_llm_text(CRITIC_PROMPT.format(task=task, draft=draft))
        critic = extract_json_block(critic_raw)

        # Safety coerce
        issues = critic.get("issues", [])[:3]
        plan = critic.get("plan", [])
        score = int(critic.get("score", 50))
        stop_flag = bool(critic.get("stop", False))

        history.append({
            "iteration": i,
            "draft": draft,
            "score": score,
            "feedback": {"issues": issues, "plan": plan, "raw": critic_raw}
        })

        # Plateau / stop condition
        if stop_flag and i > 1:
            break
        if best_score != -math.inf and (score - best_score) < min_delta and i > 1:
            # No meaningful improvement
            break

        # Track best
        if score > best_score:
            best = draft
            best_score = score

        # 3) Refine
        plan_str = "\n".join(f"- {p}" for p in plan) if plan else "- Improve clarity and correctness."
        refined = call_llm_text(REFINE_PROMPT.format(task=task, draft=draft, plan=plan_str))
        draft = refined

    # Final scoring pass to attach a last score
    final_critic_raw = call_llm_text(CRITIC_PROMPT.format(task=task, draft=draft))
    final_critic = extract_json_block(final_critic_raw) or {"score": best_score if best_score != -math.inf else None}
    final_score = int(final_critic.get("score", best_score if best_score != -math.inf else 0))

    return {
        "task": task,
        "final": draft,
        "final_score": final_score,
        "history": history,
        "final_critic_raw": final_critic_raw
    }
Usage Example
# runner + example usage
def run_task_self_refine(
    task: str,
    max_iters: int = 3,
    min_delta: int = 2,
    use_agent_for_initial: bool = False,
    verbose: bool = True
) -> Dict[str, Any]:
    result = self_refine(
        task=task,
        max_iters=max_iters,
        min_delta=min_delta,
        use_agent_for_initial=use_agent_for_initial
    )
    if verbose:
        for rec in result["history"]:
            it = rec["iteration"]
            pretty_print_blocks(f"Iteration {it} — Draft", rec["draft"])
            if rec["feedback"]:
                fb = rec["feedback"]
                pretty_print_blocks(f"Iteration {it} — Feedback (score={rec['score']})", json.dumps({
                    "issues": fb["issues"],
                    "plan": fb["plan"]
                }, indent=2))
        pretty_print_blocks(f"FINAL (score={result['final_score']})", result["final"])
    return result

# Example task (swap in anything you like)
Search_Query = "How to integrate SAP Shopping Assistant with SAP Commerce Cloud?"

result = run_task_self_refine(
    Search_Query, 
    max_iters=3, 
    min_delta=2, 
    use_agent_for_initial=True  # Use web search for initial draft
)
Setting `use_agent_for_initial=True` leverages the Google Search agent for the first draft when current information might be needed.
Using LLM, I compared all three outputs I got. Here is the result
ItemIteration 0Iteration 1 (Feedback)Iteration 2Iteration 2 (Feedback)FINAL
UI NavigationGenericAsk for detailed pathsAdds named tabs/panelsRequest more explicit stepsClear, step-by-step paths
PerformanceSeparate checklistAsk to integrate in stepsStill separateAsk for actionable guidanceEmbedded, actionable tips
ValidationBrief & vagueAsk to expand per componentIncompleteAsk for full, concrete checks10 concrete, testable checks
PitfallsShort listSameAsk for fixesPitfalls paired with fixes
CompletenessOKCuts offAsk to finishComplete & ready to run
Key Benefits
-Quality Improvement: Typical 15-25 point score increases over 2-3 iterations
-Cost Control: Plateau detection prevents wasteful refinement cycles
-Flexibility: Can use plain LLM or web-search agent based on task needs
-Transparency: Full iteration history shows the improvement process
Quality Improvement Analysis ( So far, the best score that I achieved)
Score Progress  
Iteration 0:45/100Basic initial draft
Iteration 1:68/100+23 Points improvement
Iteration 2:87/100+19 Points improvement
 
Total Improvement: +42 points (93.3% increase)
 The Answers transformed from vague steps to specific, executable instructions

The simulation shows the typical self-refinement pattern where the biggest jump occurs in iteration 1 (+23 points), with diminishing but still valuable returns in iteration 2 (+19 points).

But here's the reality check: This represents a best-case scenario. During my experiments, I also encountered:

Complete refinement failures where systems produced identical content across iterations despite clear feedback, Content truncation issues where later iterations became incomplete, cutting off mid-sentence, Score plateaus where quality metrics stayed flat (75/100) even after visible improvements


The approach works especially well for tasks requiring accuracy and completeness, like coding, technical procedures, troubleshooting guides, and detailed explanations, where initial drafts often miss important details.
 
Self Improving Prompts - Example  -- This was the most common behavior 

What surprised me

  • The biggest gains come in the first 1–2 rounds. After that, returns fade.
  • Weak models can struggle to give good feedback. Strong instruction-following models do best. [ I call them the most obedient models:-)]
  • If feedback is vague, the rewrite is useless. If feedback is dimensioned and specific, the quality of the output excels.

How to use Self-Refine in a Customer Support Chatbot

Self-Refine is perfect for support bots because real customers are often vague (“my order didn’t arrive,” “can’t login”).

Goals

  • Clarify fast when the query is vague and has very less details.
  • Ground answers in your KB/CRM (no hallucinations).
  • Converge in 1–2 refinement rounds to keep latency low.
  • Escalate cleanly when confidence stays low.

 How can we model this?

Picture3.pngPicture2.pngTLDR for practitioners

Treat prompts like drafts. Ask the model to critique with structure, then refine under constraints. Screenshot the three steps (Generate → Feedback → Refine) to make the improvement visible. Once you see it, it’s hard to go back to one-shot prompting.

Self-refinement works better for some task types (code generation, technical documentation) than others (creative writing). 

Cost vs. quality tradeoff: Calculate the token cost per quality point gained. For some applications, a single well-crafted prompt may be more cost-effective than multiple iteration refinements.
Bottom line: self-refining prompts make your bot think twice—so customers don’t have to ask twice.