Root Cause Debugging Assistant

Category development
Subcategory debugging
Difficulty intermediate
Target models: claude-sonnet, gpt, gemini-pro
Variables: {{language}} {{error_message}} {{reproduction_steps}} {{recent_changes}} {{environment}} {{workflow_surface}} {{available_artifacts}}
debugging root-cause logs incident troubleshooting agent-workflows
Updated April 23, 2026

The Prompt

You are a senior {{language}} incident investigator. Help me find the most likely root cause of a production issue through a {{workflow_surface}} workflow.

Error signal:
{{error_message}}

How to reproduce:
{{reproduction_steps}}

Recent changes:
{{recent_changes}}

Environment details:
{{environment}}

Available artifacts:
{{available_artifacts}}

Output in this exact format:
1) Incident summary (3-5 bullets)
2) Ranked hypotheses (top 5, with confidence level and why)
3) Fastest evidence path
   - what to inspect first
   - commands, files, logs, or traces to check
   - what evidence would falsify the top two hypotheses
4) Instrumentation and logging upgrades (specific metrics/log fields/traces to add)
5) Likely fix path (minimal-risk patch first, then long-term fix)
6) Verification checklist (pre-deploy, canary, post-deploy checks)
7) If blocked, list exactly what extra data is required

Constraints:
- Do not jump to a fix before ranking hypotheses.
- Prefer reversible and low-blast-radius actions first.
- Explicitly call out assumptions.
- Include at least one edge case that could invalidate the top hypothesis.
- If the workflow surface is a planning/review app, return a clean handoff packet for a CLI or IDE agent instead of pretending direct repo access.
- Do not use fake numerical precision when the evidence is thin.

When to Use

Use this when a bug is real, urgent, and not obvious from a quick read. It is especially useful when logs are noisy, several recent changes overlap, or the error only appears in one environment.

Good scenarios:

  • A production regression after a deployment
  • An intermittent failure with low reproduction reliability
  • A system where several services could be involved
  • A “works locally, fails in staging/prod” mismatch
  • A case where a coding agent needs a better debugging brief before it starts changing files

This template helps avoid random trial-and-error by forcing a hypothesis-first process. You get a prioritized path that starts with fast evidence gathering and low-risk checks, then moves toward targeted fixes.

Variables

VariableDescriptionGood input examples
languageMain implementation languageTypeScript, Python, Go, Rust
error_messageExact error text, stack traces, alerts”TypeError: Cannot read properties of undefined”, Datadog alert excerpt
reproduction_stepsClear sequence to reproduce or trigger conditions”Open checkout, apply coupon, submit payment”
recent_changesRelevant deploys, config toggles, dependency updates”Upgraded Prisma 5.9 -> 5.12, enabled caching flag”
environmentRuntime context and constraints”Kubernetes, Node 20, Redis 7, only EU region impacted”
workflow_surfaceWhere the investigation output will be used"CLI agent with shell access", "IDE agent with file access", "app-based planning and review"
available_artifactsThe evidence already on handlogs, failing test output, diff, trace IDs, dashboards, screenshots

Tips & Variations

  • Add a timeline: prepend incident timestamps to improve causal analysis.
  • Add blast radius: include affected users, regions, or endpoints.
  • For flaky issues, ask for “three competing hypotheses with disconfirming tests.”
  • For distributed systems, require a trace-based investigation section.
  • If the terminal session is getting noisy, move the evidence into a separate planning app for hypothesis ranking, then bring the best path back into the repo.
  • Include the suspect commit range or recent diff when agent-generated changes may have introduced the regression.
  • After root cause is found, run a second pass: “Draft a postmortem from this analysis.”
  • If the first run jumps to a patch too fast, rerun it with “rank environment, config, data, and code causes separately before proposing a fix.”

If your logs are weak, this prompt still works well because it requests instrumentation upgrades early instead of pretending certainty.

Example Output

Ranked hypothesis #1 (high confidence): Null response from upstream profile service causes unguarded property access in checkout handler.

Fastest evidence path: correlate failed checkout requests with upstream profile-service 5xx spikes in the same minute and inspect the last commit touching the fallback mapper.

Minimal-risk fix: Add null-guard and fallback path in checkout handler, then canary at 5% traffic.

Verification: Error rate below 0.1% for 30 minutes, no latency regression over p95 baseline.