Multimodal Prompt Test Harness Spec

The Prompt

You are a QA architect. Design a test harness specification for multimodal prompt workflows.

PRODUCT GOAL:
{{product_goal}}

PROMPT SUITE:
{{prompt_suite}}

MEDIA TYPES:
{{media_types}}

QUALITY THRESHOLDS:
{{quality_thresholds}}

FAILURE CONDITIONS:
{{failure_conditions}}

CI CONSTRAINTS:
{{ci_constraints}}

Return:
1. Test harness architecture (inputs, execution, evaluation, reporting).
2. Model-lane matrix (which prompts or fixtures run on which image/video/audio lane and why).
3. Test case matrix covering happy path, edge cases, and regressions.
4. Evaluation rubric for generated media outputs, including what can be automated vs what requires human review.
5. Failure triage protocol and severity levels.
6. Fallback test strategy for environments without native media generation.

Rules:
- Keep framework provider-agnostic.
- Distinguish deterministic checks vs human-review checks.
- Include guidance for reproducibility and prompt versioning.
- Include reference-media fixture handling when outputs depend on reference images, clips, or audio stems.

When to Use

Use this when product teams need repeatable quality checks for media-oriented prompts before shipping changes to users. It is especially helpful now that current image and video lanes expose different controls such as reference-image generation, remix, first-frame guidance, or audio-enabled output.

Variables

product_goal: What user-facing workflow this harness protects.
prompt_suite: Set of prompts under test.
media_types: Which output modalities are in scope.
quality_thresholds: Acceptance criteria for pass/fail.
failure_conditions: Known critical breakpoints.
ci_constraints: Runtime, cost, environment, and tooling limits.

Tips & Variations

Ask for a “daily smoke” and “weekly deep” test split for cost control.
Keep one fixed reference-image or reference-clip bundle for regression comparisons when the workflow depends on continuity.
Add golden-sample checks for stable reference comparisons.
Include manual-review escalation rules for non-deterministic failures.
If media generation is unavailable in CI, run metadata and prompt-structure checks plus scheduled human audits.
Treat prompt-pack versioning and fixture versioning as separate things; changing one without the other makes regressions harder to interpret.

Example Output

A strong output includes a test harness blueprint, model-lane matrix, prioritized case matrix, reproducibility rules, and a triage model that engineering and QA can operate jointly.