Writing Scenarios

Scenarios are the core of AXIS testing. Each scenario defines a task for the agent, criteria for judging success, and optional setup and teardown steps. Well-written scenarios produce consistent, meaningful scores.

Anatomy of a Scenario

A scenario is an object with a name, prompt, rubric, and optional lifecycle hooks. The same shape applies whether you author it as JSON, a JavaScript or TypeScript module, or an inline entry in axis.config.{js,ts}. See Authoring Scenarios below for the available formats.

Field Type Required Description
name string Yes Human-readable title shown in reports and CLI output.
prompt string Yes The task description sent to the agent.
rubric string | object[] Yes Success criteria: a plain string or array of weighted checks.
setup object[] No Lifecycle actions run before the agent starts.
teardown object[] No Lifecycle actions run after scoring completes.
agents string[] No Override which agents run this scenario. Defaults to all configured agents.
skills string[] No Scenario-specific skills, merged with top-level and agent-level skills.
mcp_servers object No Scenario-specific MCP servers, merged with top-level MCP servers (scenario wins on name conflict).
limits object No Time and token limits for this scenario. Overrides default scenario limits. Fields: time_minutes, tokens.
artifacts string[] No Glob patterns of files to capture from the workspace after teardown. Captured files are stored under .axis/reports/<id>/scenarios/<key>/<agent>/artifacts/ and embedded in the HTML report so they can be previewed and downloaded directly. Patterns support *, **, ?, and character classes. Merged with top-level artifacts.
variants object[] No Run multiple configurations of the same scenario. See Variants.

Authoring Scenarios

Scenarios live in the configured scenarios directory or are listed inline in axis.config.{js,ts}. The three formats below are interchangeable for the shape itself; pick the one that matches how you want to manage the scenario.

As a JSON file The simplest form. axis init scaffolds this by default.

JSON scenarios live in your scenarios directory. The filename (without .json) becomes the scenario key. Do not include a key field, since the loader derives it from the file path.

{
  "name": "Debug and fix a broken script",
  "prompt": "There is a JavaScript file at src/add.js that has a bug. Find it, fix it, and verify the fix by running the test.",
  "rubric": [
    { "check": "Agent identified the bug (subtraction instead of addition)", "weight": 0.3 },
    { "check": "Agent fixed the bug so add(a, b) returns a + b", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.3 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src && echo 'function add(a,b) { return a-b; }\nmodule.exports = { add };' > src/add.js" },
    { "action": "run_script", "command": "mkdir -p test && echo 'const {add} = require(\"../src/add\");\nconsole.log(add(2,3) === 5 ? \"PASS\" : \"FAIL\");' > test/add.test.js" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf src test" }
  ]
}
As a JavaScript or TypeScript module Default-export a scenario object or function for typed authoring.

Drop a .ts, .js, .mjs, .cjs, .mts, or .cts file into the scenarios directory. The default export must resolve to a scenario object: either the object itself, or a (sync or async) function returning one. The key is path-derived; an explicit key field is allowed but must match.

// scenarios/refactor-add.ts
import type { ScenarioInput } from "@netlify/axis";

export default {
  name: "Refactor a buggy add() function",
  prompt: "There is a bug in src/add.js. Find it, fix it, and verify the fix.",
  rubric: [
    { check: "Agent identified the bug", weight: 0.3 },
    { check: "Fix is correct", weight: 0.4 },
    { check: "Verification was run", weight: 0.3 },
  ],
} satisfies ScenarioInput;

A function default export is handy for env-driven prompts or fixture-derived rubrics:

// scenarios/dynamic.ts
export default async () => ({
  name: "Dynamic scenario",
  prompt: "Run task for " + (process.env.TARGET_ENV ?? "staging"),
  rubric: [{ check: "Task ran cleanly" }],
});
Helpers and shared logic

Module files in the scenarios directory whose default export is missing or is not a scenario object are silently skipped, so helpers, fixtures, and shared utilities can live alongside scenarios without any special configuration.

Inline in axis.config Generate scenarios programmatically (table-driven or env-parameterized).

In axis.config.{js,ts}, the scenarios field can mix directory paths, single-file paths, and inline scenario objects. Inline objects require an explicit key since there is no path to derive one from.

// axis.config.ts
import type { AxisConfig } from "@netlify/axis";
import refactorAdd from "./scenarios/refactor-add.js";

export default {
  scenarios: [
    "./scenarios",          // directory walked for .json/.ts/.js scenarios
    refactorAdd,             // module-imported scenario
    {                        // pure inline literal; key required
      key: "smoke-test",
      name: "Smoke test",
      prompt: "Verify the build runs cleanly.",
      rubric: [{ check: "Build succeeded" }],
    },
  ],
  agents: ["claude-code"],
} satisfies AxisConfig;

Reach for inline scenarios when you want to generate them from a fixture set, share base configuration across many entries, or parameterize by environment.

Writing Effective Prompts

The prompt is what the agent sees as its task. The quality of your prompt directly affects the consistency and usefulness of your scores.

Designing Rubrics

The rubric defines what "success" means. A judge LLM reads the agent's full transcript and evaluates each check on a 0 to 10 scale. Well-designed rubrics produce scores that reflect real quality differences.

Start simple: string rubrics

The simplest rubric is a plain string. The judge reads the transcript and gives a single 0 to 10 score for how well the agent met the description.

{
  "name": "Create an Express server",
  "prompt": "Create a working Express server that listens on port 3000.",
  "rubric": "The agent should create a working Express server on port 3000"
}

String rubrics work well for simple scenarios where you just want a holistic pass/fail judgment. The downside is that you get a single score with no visibility into what went right or wrong.

Add structure: check arrays

For more granular scoring, use an array of checks. Each check is evaluated independently, so you can see exactly which criteria the agent met and which it missed.

"rubric": [
  { "check": "Server starts on port 3000" },
  { "check": "GET / returns a 200 response" },
  { "check": "Server has error handling middleware" }
]

When no weight is specified, AXIS distributes weight equally across all checks. In this example, each check is worth one-third of the Goal Achievement score.

Control importance: weighted checks

Add weight to each check to control how much it contributes to the score. This is the recommended approach for most scenarios because it lets you express which outcomes matter most.

"rubric": [
  { "check": "Server starts on port 3000", "weight": 0.4 },
  { "check": "GET / returns a 200 response", "weight": 0.3 },
  { "check": "Server has error handling middleware", "weight": 0.3 }
]

In this example, the server starting is weighted highest because it is the core outcome. You can also mix weighted and unweighted checks -AXIS distributes the remaining weight equally across any checks without an explicit weight.

Writing good checks

Strong vs Weak Checks

Weak: "Agent did a good job" -subjective, hard for the judge to score consistently.
Weak: "Agent used git" -checks behavior, not outcome. What if git was unnecessary?
Strong: "File output.csv contains at least 10 rows of valid CSV data" -concrete, verifiable.
Strong: "The test suite passes with npm test" -clear success criterion.

Setup and Teardown

Setup actions run before the agent starts. Use them to create the starting state that the scenario depends on: files to edit, databases to seed, servers to start.

Teardown actions run after scoring completes. Use them to clean up resources that should not persist between runs.

"setup": [
  { "action": "run_script", "command": "cp -R \"$AXIS_CONFIG_DIR/scenario-fixtures/my-project/.\" ." }
],
"teardown": [
  { "action": "run_script", "command": "rm -rf /tmp/my-project-artifacts" }
]
Lifecycle Details

Each action runs sequentially with a 30-second timeout. Setup failures abort the job and mark it as failed. Teardown failures are logged but do not block subsequent jobs or affect scores.

Lifecycle environment variables

Setup and teardown scripts run with the workspace as their working directory. The following environment variables are available inside run_script commands:

Variable Value Use it for
AXIS_CONFIG_DIR Absolute path to the directory containing axis.config.json. Referencing fixtures, helper scripts, or other files versioned alongside your project.
AXIS_OUTPUT Path to a per-phase markdown file. Anything written here surfaces in the report. Capturing setup or teardown observations — see Capturing notes for the report below.
AXIS_WORKSPACE Absolute path to the per-job workspace (same as pwd). Inspecting files when the script's cwd may have changed.
AXIS_PHASE Either setup or teardown. Sharing one script between phases that branches on the current phase.
HOME Set to the workspace temp directory (same as pwd). Workspace isolation — tools that read $HOME stay scoped to the run.
PWD The fresh per-job workspace temp directory. Writing files the agent will see — relative paths land here.
System passthroughs PATH, USER, SHELL, LANG, TERM, TMPDIR. Standard shell tooling. Other host vars are not passed unless listed in env in the config.

The AXIS_CONFIG_DIR variable enables the recommended pattern of keeping fixture data in a versioned scenario-fixtures/ directory next to axis.config.json, then copying just the bits a scenario needs into its workspace:

scenario-fixtures/
  my-project/
    package.json
    src/index.js
    .axis/baselines/main.json   # hidden files copied via the trailing /. on cp -R

scenarios/
  my-scenario.json              # setup: cp -R "$AXIS_CONFIG_DIR/scenario-fixtures/my-project/." .

Capturing notes for the report

Setup and teardown each receive their own $AXIS_OUTPUT file path. Anything written to that file is captured as GitHub-flavored markdown and surfaced as a "Setup notes" or "Teardown notes" panel in the agent's expanded detail row in the HTML report. Multiple actions in the same phase share one output file, so each script can append.

"teardown": [
  {
    "action": "run_script",
    "command": "{ echo '## Teardown'; echo; if [ -f summary.md ]; then echo \"- **summary.md**: $(wc -c < summary.md | tr -d ' ') bytes\"; else echo '_no summary.md was written_'; fi; } >> \"$AXIS_OUTPUT\""
  }
]

The notes panel for that run renders the markdown — bullet list with bold filenames and the captured byte count. Useful patterns:

Capture limits

Output is capped at 256 KB per phase — longer notes are truncated with a marker line. Output is captured even if a script fails, so partial notes still reach the report. Setup output is recorded before the agent runs and is not visible to the agent or the LLM judge.

Common patterns

Artifacts

Use artifacts to capture files the agent produced during the run. Each entry is a glob pattern relative to the scenario's workspace; matching files are copied into the report after teardown so you can inspect them later without keeping the workspace around.

{
  "name": "Generate API summary",
  "prompt": "Fetch the docs and write summary.md.",
  "rubric": "summary.md exists and covers the docs.",
  "artifacts": [
    "summary.md",
    "*.log",
    "out/**",
    "screenshots/*.png"
  ]
}

Captures run after teardown, so any cleanup scripts you write don't need to dance around the files you want to keep. You can also set an artifacts array at the top level of axis.config.json to capture the same patterns from every scenario; the two lists are merged.

Glob syntax

Where captured files live

Each captured file is copied to disk under the report directory, preserving its path relative to the workspace:

.axis/reports/<id>/scenarios/<scenario-key>/<agent>/artifacts/...

The same files are also embedded (base64) into the report manifest so the HTML report can preview and download them entirely client-side — no local server needed, even when the HTML report is opened directly from disk.

Viewing artifacts in the HTML report

Each scenario row gets an Artifacts panel when files were captured. Click Show artifacts to reveal a collapsible file tree, the eye icon to open a modal preview (text-like files render as text, images as images), or the download arrow for a single file. Download all (.zip) bundles every captured file into one archive in the browser.

Mind the size

Artifact contents are embedded in the report HTML, so capturing very large files (multi-MB logs, full builds, video) inflates every shared report. Capture targeted, diagnostic files — not entire build outputs.

Scenario Organization

The filename (without its extension) becomes the scenario key used in reports, CLI commands, and baseline comparisons. Nested directories create namespaced keys.

Use directories to group related scenarios. Agents can be configured to run only specific groups using glob patterns in the agent configuration:

{
  "agent": "claude-code",
  "scenarios": ["cms/*", "api/*"]
}

Agent-specific scenarios

Use the agents field in a scenario to restrict which agents run it. This is useful when a scenario depends on capabilities specific to one agent, or when you want to test different agents on different tasks.

{
  "name": "Use Claude Code MCP integration",
  "prompt": "Use the filesystem MCP server to list files in /tmp",
  "rubric": "Agent successfully used the MCP filesystem tool",
  "agents": ["claude-code"]
}

Variants

Variants let you run the same scenario under different configurations -different skills, MCP servers, prompts, or agent restrictions -without duplicating the entire scenario file. When variants is defined, the base scenario becomes a template: only the variants execute, each inheriting all fields from the parent. To also run the unmodified scenario as a control, add a variant with no overrides (the baseline pattern).

{
  "name": "Create a blog post",
  "prompt": "Create a new blog post titled 'Hello World' on the CMS.",
  "rubric": [
    { "check": "Blog post was created successfully", "weight": 0.5 },
    { "check": "Title matches 'Hello World'", "weight": 0.5 }
  ],
  "variants": [
    {
      "name": "with-netlify-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-custom-skill",
      "skills": ["./skills/blog-helper"]
    },
    {
      "name": "alt-prompt",
      "prompt": "Create a draft blog post titled 'Hello World' without publishing."
    }
  ]
}

This produces three scenario keys: create-post@with-netlify-mcp, create-post@with-custom-skill, and create-post@alt-prompt. Each variant inherits the parent's rubric, setup, and other fields, then applies its own overrides. The variant name is appended to the scenario key with an @ separator.

Variant fields

Only name is required on a variant. All other fields are optional and inherit from the parent when omitted.

Field Type Behavior
name string Required. Must match /^[a-zA-Z0-9_-]+$/. Used in the scenario key.
prompt string Replaces the parent prompt.
rubric string | object[] Replaces the parent rubric.
skills string[] Replaces the parent's scenario-level skills (top-level and agent-level skills still merge in).
mcp_servers object Merged with parent's scenario-level MCP servers (variant wins on name conflict).
agents string[] Replaces the parent agent restriction.
setup / teardown object[] Replaces the parent lifecycle actions.
limits object Replaces the parent's limits. Fields: time_minutes, tokens.
artifacts string[] Replaces the parent's scenario-level artifact globs (top-level artifacts still merge in).
skip boolean Overrides the parent skip flag.
Filtering Variants

CLI filters and agent-level scenarios globs work with variant keys. Filtering by the base key matches all its variants: --scenario create-post runs all three variants above. Use the full key to target a specific variant: --scenario create-post@with-netlify-mcp.

Example Scenarios

File creation Generate a README from scratch with no setup required.
{
  "name": "Create a README",
  "prompt": "Create a README.md file for a Node.js project called 'my-api'. Include a title, description, install instructions, and usage example.",
  "rubric": [
    { "check": "README.md exists" },
    { "check": "Contains a project title and description" },
    { "check": "Contains npm install instructions" },
    { "check": "Contains a usage or getting started example" }
  ]
}
Bug fix with verification Find and fix a bug in a pre-seeded project, then confirm the test passes.
{
  "name": "Fix failing test",
  "prompt": "The test in test/math.test.js is failing. Find the bug in src/math.js, fix it, and verify the test passes.",
  "rubric": [
    { "check": "Agent identified the root cause", "weight": 0.2 },
    { "check": "Agent fixed src/math.js correctly", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.4 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src test" },
    { "action": "run_script", "command": "echo 'exports.multiply = (a, b) => a + b;' > src/math.js" },
    { "action": "run_script", "command": "echo 'const {multiply} = require(\"../src/math\"); console.assert(multiply(3,4) === 12, \"Expected 12\"); console.log(\"PASS\");' > test/math.test.js" }
  ]
}
Multi-step with setup and teardown Add a new API endpoint, write tests, and clean up.
{
  "name": "Add API endpoint with tests",
  "prompt": "Add a GET /api/health endpoint to the Express app in src/app.js. It should return { status: 'ok', uptime: process.uptime() }. Write a test in test/health.test.js.",
  "rubric": [
    { "check": "GET /api/health endpoint exists", "weight": 0.3 },
    { "check": "Returns JSON with status and uptime", "weight": 0.3 },
    { "check": "Test file exists and covers the endpoint", "weight": 0.2 },
    { "check": "All tests pass", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && npm install express" },
    { "action": "run_script", "command": "mkdir -p src test" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf node_modules package.json src test" }
  ]
}
Multi-variant scenario Test the same task with different tool configurations.
{
  "name": "Deploy a site",
  "prompt": "Deploy the project in the current directory to production.",
  "rubric": [
    { "check": "Site was deployed successfully", "weight": 0.5 },
    { "check": "Agent confirmed the deploy URL", "weight": 0.3 },
    { "check": "No errors in the deployment log", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && echo '<h1>Hello</h1>' > index.html" }
  ],
  "variants": [
    {
      "name": "baseline"
    },
    {
      "name": "with-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-deploy-skill",
      "skills": ["./skills/deploy"]
    }
  ]
}

The first variant, baseline, has no overrides -it runs the scenario exactly as defined, giving you a control run to compare against. The other variants layer on different tool configurations. This produces three keys: deploy@baseline, deploy@with-mcp, and deploy@with-deploy-skill.