Root Cause Debugging Assistant

The Prompt

You are a senior {{language}} incident investigator. Help me find the most likely root cause of a production issue through a {{workflow_surface}} workflow.

Error signal:
{{error_message}}

How to reproduce:
{{reproduction_steps}}

Recent changes:
{{recent_changes}}

Environment details:
{{environment}}

Available artifacts:
{{available_artifacts}}

Output in this exact format:
1) Incident summary (3-5 bullets)
2) Ranked hypotheses (top 5, with confidence level and why)
3) Fastest evidence path
   - what to inspect first
   - commands, files, logs, or traces to check
   - what evidence would falsify the top two hypotheses
4) Instrumentation and logging upgrades (specific metrics/log fields/traces to add)
5) Likely fix path (minimal-risk patch first, then long-term fix)
6) Verification checklist (pre-deploy, canary, post-deploy checks)
7) If blocked, list exactly what extra data is required

Constraints:
- Do not jump to a fix before ranking hypotheses.
- Prefer reversible and low-blast-radius actions first.
- Explicitly call out assumptions.
- Include at least one edge case that could invalidate the top hypothesis.
- If the workflow surface is a planning/review app, return a clean handoff packet for a CLI or IDE agent instead of pretending direct repo access.
- Do not use fake numerical precision when the evidence is thin.

When to Use

Use this when a bug is real, urgent, and not obvious from a quick read. It is especially useful when logs are noisy, several recent changes overlap, or the error only appears in one environment.

Good scenarios:

A production regression after a deployment
An intermittent failure with low reproduction reliability
A system where several services could be involved
A “works locally, fails in staging/prod” mismatch
A case where a coding agent needs a better debugging brief before it starts changing files

This template helps avoid random trial-and-error by forcing a hypothesis-first process. You get a prioritized path that starts with fast evidence gathering and low-risk checks, then moves toward targeted fixes.

Variables

Variable	Description	Good input examples
`language`	Main implementation language	TypeScript, Python, Go, Rust
`error_message`	Exact error text, stack traces, alerts	”TypeError: Cannot read properties of undefined”, Datadog alert excerpt
`reproduction_steps`	Clear sequence to reproduce or trigger conditions	”Open checkout, apply coupon, submit payment”
`recent_changes`	Relevant deploys, config toggles, dependency updates	”Upgraded Prisma 5.9 -> 5.12, enabled caching flag”
`environment`	Runtime context and constraints	”Kubernetes, Node 20, Redis 7, only EU region impacted”
`workflow_surface`	Where the investigation output will be used	`"CLI agent with shell access"`, `"IDE agent with file access"`, `"app-based planning and review"`
`available_artifacts`	The evidence already on hand	logs, failing test output, diff, trace IDs, dashboards, screenshots

Tips & Variations

Add a timeline: prepend incident timestamps to improve causal analysis.
Add blast radius: include affected users, regions, or endpoints.
For flaky issues, ask for “three competing hypotheses with disconfirming tests.”
For distributed systems, require a trace-based investigation section.
If the terminal session is getting noisy, move the evidence into a separate planning app for hypothesis ranking, then bring the best path back into the repo.
Include the suspect commit range or recent diff when agent-generated changes may have introduced the regression.
After root cause is found, run a second pass: “Draft a postmortem from this analysis.”
If the first run jumps to a patch too fast, rerun it with “rank environment, config, data, and code causes separately before proposing a fix.”

If your logs are weak, this prompt still works well because it requests instrumentation upgrades early instead of pretending certainty.

Example Output

Ranked hypothesis #1 (high confidence): Null response from upstream profile service causes unguarded property access in checkout handler.

Fastest evidence path: correlate failed checkout requests with upstream profile-service 5xx spikes in the same minute and inspect the last commit touching the fallback mapper.

Minimal-risk fix: Add null-guard and fallback path in checkout handler, then canary at 5% traffic.

Verification: Error rate below 0.1% for 30 minutes, no latency regression over p95 baseline.