Teaching AI applications to Self-Critique
My Experiments with Self-Refining prompts
Most of us treat a prompt like a one-shot instruction. You ask, the model answers, and… that’s it. But in practice, good work is iterative: we write a draft, critique it, and then refine. Self-Refine applies that same loop to AI — the model acts as writer, critic, and editor in quick succession until the result feels right.
In this post, I’ll share what clicked for me, a simple three-step pattern, and a few mini-experiments you can try (and screenshot) to see the lift for yourself.
The 3-Step Loop (that changed how I prompt)
1) Generate – Ask for the initial output.
2) Feedback – Ask the model to critique its own output with specific, actionable notes.
3) Refine – Tell it to rewrite using only that feedback.
Repeat until it says, “no further refinement.”
This is simple. The magic is in good feedback prompts (“point out tone, missing details, structure” vs. vague “improve it”).
How it works (in 10 seconds)
Task-specific effectiveness: Self-refinement works better for some task types (code generation, technical documentation) than others (creative writing). Measures vary for each task category.
Mini-Experiments
Building a self-improving web search Agent
```python
# SAP AI Core client setup
ai_core_client = AICoreV2Client(
base_url=BASE_URL, auth_url=AUTH_URL,
client_id=CLIENT_ID, client_secret=CLIENT_SECRET,
)
llm = init_llm(MODEL_NAME, temperature=0.0, max_tokens=512)
# Google Search integration
search = GoogleSearchAPIWrapper(k=5)
google_tool = Tool.from_function(
name="google_search",
description="Search Google and return the first results",
func=search.run
)
agent = initialize_agent(tools=[google_tool], llm=llm, agent=AgentType.OPENAI_FUNCTIONS)
#This gives us both a base LLM and an agent that can search the web when needed.GEN_PROMPT = """You are a careful, precise assistant.
TASK: {task}
Produce the best possible answer. Do not include any self-critique here."""
```CRITIC_PROMPT = """You are a meticulous reviewer. Given the TASK and the DRAFT answer,
1) list EXACTLY 3 concrete issues or improvement opportunities,
2) give a numbered improvement plan,
3) rate current quality from 0–100,
4) decide if refinement is still needed.
Respond with a SINGLE JSON object and NOTHING ELSE:
{{"issues": ["...", "...", "..."], "plan": ["step 1", "step 2", "step 3"], "score": 0, "stop": false}}"""
```REFINE_PROMPT = """You are a senior editor. Apply the plan below to improve the DRAFT.
Keep all correct content; fix issues; increase clarity, correctness, and usefulness.
Return ONLY the improved answer (no commentary)."""```def self_refine(
task: str,
max_iters: int = 3,
min_delta: int = 2,
use_agent_for_initial: bool = False
) -> Dict[str, Any]:
"""
Self-refine loop: GENERATE -> CRITIQUE -> REFINE until stop or plateau.
Returns final answer and iteration history.
"""
history: List[Dict[str, Any]] = []
# 1) Initial draft
draft = call_llm_text(GEN_PROMPT.format(task=task), use_agent=use_agent_for_initial)
history.append({"iteration": 0, "draft": draft, "score": None, "feedback": None})
best = draft
best_score = -math.inf
for i in range(1, max_iters + 1):
# 2) Critique
critic_raw = call_llm_text(CRITIC_PROMPT.format(task=task, draft=draft))
critic = extract_json_block(critic_raw)
# Safety coerce
issues = critic.get("issues", [])[:3]
plan = critic.get("plan", [])
score = int(critic.get("score", 50))
stop_flag = bool(critic.get("stop", False))
history.append({
"iteration": i,
"draft": draft,
"score": score,
"feedback": {"issues": issues, "plan": plan, "raw": critic_raw}
})
# Plateau / stop condition
if stop_flag and i > 1:
break
if best_score != -math.inf and (score - best_score) < min_delta and i > 1:
# No meaningful improvement
break
# Track best
if score > best_score:
best = draft
best_score = score
# 3) Refine
plan_str = "\n".join(f"- {p}" for p in plan) if plan else "- Improve clarity and correctness."
refined = call_llm_text(REFINE_PROMPT.format(task=task, draft=draft, plan=plan_str))
draft = refined
# Final scoring pass to attach a last score
final_critic_raw = call_llm_text(CRITIC_PROMPT.format(task=task, draft=draft))
final_critic = extract_json_block(final_critic_raw) or {"score": best_score if best_score != -math.inf else None}
final_score = int(final_critic.get("score", best_score if best_score != -math.inf else 0))
return {
"task": task,
"final": draft,
"final_score": final_score,
"history": history,
"final_critic_raw": final_critic_raw
}# runner + example usage
def run_task_self_refine(
task: str,
max_iters: int = 3,
min_delta: int = 2,
use_agent_for_initial: bool = False,
verbose: bool = True
) -> Dict[str, Any]:
result = self_refine(
task=task,
max_iters=max_iters,
min_delta=min_delta,
use_agent_for_initial=use_agent_for_initial
)
if verbose:
for rec in result["history"]:
it = rec["iteration"]
pretty_print_blocks(f"Iteration {it} — Draft", rec["draft"])
if rec["feedback"]:
fb = rec["feedback"]
pretty_print_blocks(f"Iteration {it} — Feedback (score={rec['score']})", json.dumps({
"issues": fb["issues"],
"plan": fb["plan"]
}, indent=2))
pretty_print_blocks(f"FINAL (score={result['final_score']})", result["final"])
return result
# Example task (swap in anything you like)
Search_Query = "How to integrate SAP Shopping Assistant with SAP Commerce Cloud?"
result = run_task_self_refine(
Search_Query,
max_iters=3,
min_delta=2,
use_agent_for_initial=True # Use web search for initial draft
)| Item | Iteration 0 | Iteration 1 (Feedback) | Iteration 2 | Iteration 2 (Feedback) | FINAL |
| UI Navigation | Generic | Ask for detailed paths | Adds named tabs/panels | Request more explicit steps | Clear, step-by-step paths |
| Performance | Separate checklist | Ask to integrate in steps | Still separate | Ask for actionable guidance | Embedded, actionable tips |
| Validation | Brief & vague | Ask to expand per component | Incomplete | Ask for full, concrete checks | 10 concrete, testable checks |
| Pitfalls | Short list | — | Same | Ask for fixes | Pitfalls paired with fixes |
| Completeness | OK | — | Cuts off | Ask to finish | Complete & ready to run |
| Score Progress | ||
| Iteration 0: | 45/100 | Basic initial draft |
| Iteration 1: | 68/100 | +23 Points improvement |
| Iteration 2: | 87/100 | +19 Points improvement |
But here's the reality check: This represents a best-case scenario. During my experiments, I also encountered:
Complete refinement failures where systems produced identical content across iterations despite clear feedback, Content truncation issues where later iterations became incomplete, cutting off mid-sentence, Score plateaus where quality metrics stayed flat (75/100) even after visible improvements
What surprised me
How to use Self-Refine in a Customer Support Chatbot
Self-Refine is perfect for support bots because real customers are often vague (“my order didn’t arrive,” “can’t login”).
Goals
How can we model this?
TLDR for practitioners
Treat prompts like drafts. Ask the model to critique with structure, then refine under constraints. Screenshot the three steps (Generate → Feedback → Refine) to make the improvement visible. Once you see it, it’s hard to go back to one-shot prompting.
Self-refinement works better for some task types (code generation, technical documentation) than others (creative writing).
Cost vs. quality tradeoff: Calculate the token cost per quality point gained. For some applications, a single well-crafted prompt may be more cost-effective than multiple iteration refinements.
Bottom line: self-refining prompts make your bot think twice—so customers don’t have to ask twice.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 11 | |
| 2 | |
| 2 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 | |
| 1 |