Writing Scenarios

Scenarios are the core of AXIS testing. Each scenario defines a task for the agent, criteria for judging success, and optional setup and teardown steps. Well-written scenarios produce consistent, meaningful scores.

Anatomy of a Scenario

A scenario is an object with a name, prompt, rubric, and optional lifecycle hooks. The same shape applies whether you author it as JSON, a JavaScript or TypeScript module, or an inline entry in axis.config.{js,ts}. See Authoring Scenarios below for the available formats.

Field	Type	Required	Description
`name`	`string`	Yes	Human-readable title shown in reports and CLI output.
`prompt`	`string`	Yes	The task description sent to the agent.
`rubric`	`string \| object[]`	Yes	Success criteria: a plain string or array of weighted checks.
`setup`	`object[]`	No	Lifecycle actions run before the agent starts.
`teardown`	`object[]`	No	Lifecycle actions run after scoring completes.
`agents`	`string[]`	No	Override which agents run this scenario. Defaults to all configured agents.
`skills`	`string[]`	No	Scenario-specific skills, merged with top-level and agent-level skills.
`mcp_servers`	`object`	No	Scenario-specific MCP servers, merged with top-level MCP servers (scenario wins on name conflict).
`limits`	`object`	No	Time and token limits for this scenario. Overrides default scenario limits. Fields: `time_minutes`, `tokens`.
`artifacts`	`string[]`	No	Glob patterns of files to capture from the workspace after teardown. Captured files are stored under `.axis/reports/<id>/scenarios/<key>/<agent>/artifacts/` and embedded in the HTML report so they can be previewed and downloaded directly. Patterns support ``, `*`, `?`, and character classes. Merged with top-level `artifacts`.
`variants`	`object[]`	No	Run multiple configurations of the same scenario. See Variants.

Authoring Scenarios

Scenarios live in the configured scenarios directory or are listed inline in axis.config.{js,ts}. The three formats below are interchangeable for the shape itself; pick the one that matches how you want to manage the scenario.

As a JSON file The simplest form. axis init scaffolds this by default.

JSON scenarios live in your scenarios directory. The filename (without .json) becomes the scenario key. Do not include a key field, since the loader derives it from the file path.

{
  "name": "Debug and fix a broken script",
  "prompt": "There is a JavaScript file at src/add.js that has a bug. Find it, fix it, and verify the fix by running the test.",
  "rubric": [
    { "check": "Agent identified the bug (subtraction instead of addition)", "weight": 0.3 },
    { "check": "Agent fixed the bug so add(a, b) returns a + b", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.3 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src && echo 'function add(a,b) { return a-b; }\nmodule.exports = { add };' > src/add.js" },
    { "action": "run_script", "command": "mkdir -p test && echo 'const {add} = require(\"../src/add\");\nconsole.log(add(2,3) === 5 ? \"PASS\" : \"FAIL\");' > test/add.test.js" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf src test" }
  ]
}

As a JavaScript or TypeScript module Default-export a scenario object or function for typed authoring.

Drop a .ts, .js, .mjs, .cjs, .mts, or .cts file into the scenarios directory. The default export must resolve to a scenario object: either the object itself, or a (sync or async) function returning one. The key is path-derived; an explicit key field is allowed but must match.

// scenarios/refactor-add.ts
import type { ScenarioInput } from "@netlify/axis";

export default {
  name: "Refactor a buggy add() function",
  prompt: "There is a bug in src/add.js. Find it, fix it, and verify the fix.",
  rubric: [
    { check: "Agent identified the bug", weight: 0.3 },
    { check: "Fix is correct", weight: 0.4 },
    { check: "Verification was run", weight: 0.3 },
  ],
} satisfies ScenarioInput;

A function default export is handy for env-driven prompts or fixture-derived rubrics:

// scenarios/dynamic.ts
export default async () => ({
  name: "Dynamic scenario",
  prompt: "Run task for " + (process.env.TARGET_ENV ?? "staging"),
  rubric: [{ check: "Task ran cleanly" }],
});

Helpers and shared logic

Module files in the scenarios directory whose default export is missing or is not a scenario object are silently skipped, so helpers, fixtures, and shared utilities can live alongside scenarios without any special configuration.

Inline in axis.config Generate scenarios programmatically (table-driven or env-parameterized).

In axis.config.{js,ts}, the scenarios field can mix directory paths, single-file paths, and inline scenario objects. Inline objects require an explicit key since there is no path to derive one from.

// axis.config.ts
import type { AxisConfig } from "@netlify/axis";
import refactorAdd from "./scenarios/refactor-add.js";

export default {
  scenarios: [
    "./scenarios",          // directory walked for .json/.ts/.js scenarios
    refactorAdd,             // module-imported scenario
    {                        // pure inline literal; key required
      key: "smoke-test",
      name: "Smoke test",
      prompt: "Verify the build runs cleanly.",
      rubric: [{ check: "Build succeeded" }],
    },
  ],
  agents: ["claude-code"],
} satisfies AxisConfig;

Reach for inline scenarios when you want to generate them from a fixture set, share base configuration across many entries, or parameterize by environment.

Writing Effective Prompts

The prompt is what the agent sees as its task. The quality of your prompt directly affects the consistency and usefulness of your scores.

Be specific about what to do. "Fix the bug" is vague. "Find and fix the bug in src/add.js where subtraction is used instead of addition" gives the agent a clear target. Specific prompts produce more consistent results across runs.
Specify the expected output. If you want a file created, say what it should be called and what it should contain. If you want a test to pass, say which test command to run. Leaving the end state implicit forces the judge to guess what "success" means.
Scope appropriately. A prompt that asks the agent to "set up a full CI/CD pipeline" tests many things at once and makes it hard to isolate what went wrong. Smaller, focused scenarios produce more actionable scores.
Avoid giving the agent the answer. The point of testing is to observe how the agent discovers and solves the problem. If you tell the agent exactly which line to change, you are testing its ability to follow instructions, not its ability to debug.

Designing Rubrics

The rubric defines what "success" means. A judge LLM reads the agent's full transcript and evaluates each check on a 0 to 10 scale. Well-designed rubrics produce scores that reflect real quality differences.

Start simple: string rubrics

The simplest rubric is a plain string. The judge reads the transcript and gives a single 0 to 10 score for how well the agent met the description.

{
  "name": "Create an Express server",
  "prompt": "Create a working Express server that listens on port 3000.",
  "rubric": "The agent should create a working Express server on port 3000"
}

String rubrics work well for simple scenarios where you just want a holistic pass/fail judgment. The downside is that you get a single score with no visibility into what went right or wrong.

Add structure: check arrays

For more granular scoring, use an array of checks. Each check is evaluated independently, so you can see exactly which criteria the agent met and which it missed.

"rubric": [
  { "check": "Server starts on port 3000" },
  { "check": "GET / returns a 200 response" },
  { "check": "Server has error handling middleware" }
]

When no weight is specified, AXIS distributes weight equally across all checks. In this example, each check is worth one-third of the Goal Achievement score.

Control importance: weighted checks

Add weight to each check to control how much it contributes to the score. This is the recommended approach for most scenarios because it lets you express which outcomes matter most.

"rubric": [
  { "check": "Server starts on port 3000", "weight": 0.4 },
  { "check": "GET / returns a 200 response", "weight": 0.3 },
  { "check": "Server has error handling middleware", "weight": 0.3 }
]

In this example, the server starting is weighted highest because it is the core outcome. You can also mix weighted and unweighted checks -AXIS distributes the remaining weight equally across any checks without an explicit weight.

Writing good checks

Make checks observable. The judge evaluates checks by reading the transcript and examining the workspace. "Agent understood the problem" is hard to verify. "Agent modified src/add.js to use addition" is concrete and observable.
One assertion per check. "Agent fixed the bug and ran the test" is two things. If the agent fixes the bug but skips the test, it is unclear how to score this check. Split compound assertions into separate checks with their own weights.
Weight by importance. The core outcome (did it work?) should carry more weight than peripheral concerns (did it clean up after itself?). If every check has equal weight, a cosmetic failure impacts the score as much as a functional failure.

Strong vs Weak Checks

Weak: "Agent did a good job" -subjective, hard for the judge to score consistently.
Weak: "Agent used git" -checks behavior, not outcome. What if git was unnecessary?
Strong: "File output.csv contains at least 10 rows of valid CSV data" -concrete, verifiable.
Strong: "The test suite passes with npm test" -clear success criterion.

Setup and Teardown

Setup actions run before the agent starts. Use them to create the starting state that the scenario depends on: files to edit, databases to seed, servers to start.

Teardown actions run after scoring completes. Use them to clean up resources that should not persist between runs.

"setup": [
  { "action": "run_script", "command": "cp -R \"$AXIS_CONFIG_DIR/scenario-fixtures/my-project/.\" ." }
],
"teardown": [
  { "action": "run_script", "command": "rm -rf /tmp/my-project-artifacts" }
]

Lifecycle Details

Each action runs sequentially with a 30-second timeout. Setup failures abort the job and mark it as failed. Teardown failures are logged but do not block subsequent jobs or affect scores.

Lifecycle environment variables

Setup and teardown scripts run with the workspace as their working directory. The following environment variables are available inside run_script commands:

Variable	Value	Use it for
`AXIS_CONFIG_DIR`	Absolute path to the directory containing `axis.config.json`.	Referencing fixtures, helper scripts, or other files versioned alongside your project.
`AXIS_OUTPUT`	Path to a per-phase markdown file. Anything written here surfaces in the report.	Capturing setup or teardown observations — see Capturing notes for the report below.
`AXIS_WORKSPACE`	Absolute path to the per-job workspace (same as `pwd`).	Inspecting files when the script's `cwd` may have changed.
`AXIS_PHASE`	Either `setup` or `teardown`.	Sharing one script between phases that branches on the current phase.
`HOME`	Set to the workspace temp directory (same as `pwd`).	Workspace isolation — tools that read `$HOME` stay scoped to the run.
`PWD`	The fresh per-job workspace temp directory.	Writing files the agent will see — relative paths land here.
System passthroughs	`PATH`, `USER`, `SHELL`, `LANG`, `TERM`, `TMPDIR`.	Standard shell tooling. Other host vars are not passed unless listed in `env` in the config.

The AXIS_CONFIG_DIR variable enables the recommended pattern of keeping fixture data in a versioned scenario-fixtures/ directory next to axis.config.json, then copying just the bits a scenario needs into its workspace:

scenario-fixtures/
  my-project/
    package.json
    src/index.js
    .axis/baselines/main.json   # hidden files copied via the trailing /. on cp -R

scenarios/
  my-scenario.json              # setup: cp -R "$AXIS_CONFIG_DIR/scenario-fixtures/my-project/." .

Capturing notes for the report

Setup and teardown each receive their own $AXIS_OUTPUT file path. Anything written to that file is captured as GitHub-flavored markdown and surfaced as a "Setup notes" or "Teardown notes" panel in the agent's expanded detail row in the HTML report. Multiple actions in the same phase share one output file, so each script can append.

"teardown": [
  {
    "action": "run_script",
    "command": "{ echo '## Teardown'; echo; if [ -f summary.md ]; then echo \"- **summary.md**: $(wc -c < summary.md | tr -d ' ') bytes\"; else echo '_no summary.md was written_'; fi; } >> \"$AXIS_OUTPUT\""
  }
]

The notes panel for that run renders the markdown — bullet list with bold filenames and the captured byte count. Useful patterns:

Snapshot the workspace after the agent finishes (file list, sizes, line counts).
Probe an external resource the agent created (HTTP status, DB row counts).
Record diagnostic context (commit SHAs, env versions) that scoring should not see but a human reviewer will want.

Capture limits

Output is capped at 256 KB per phase — longer notes are truncated with a marker line. Output is captured even if a script fails, so partial notes still reach the report. Setup output is recorded before the agent runs and is not visible to the agent or the LLM judge.

Common patterns

Create test fixtures: Use setup to write files that the agent will need to read, edit, or debug.
Seed data: Populate a database or create configuration files the agent should work with.
Initialize a project: Clone a repo, install dependencies, or set up a specific project state.
Clean up: Remove temp directories, stop background processes, or reset state in teardown.

Artifacts

Use artifacts to capture files the agent produced during the run. Each entry is a glob pattern relative to the scenario's workspace; matching files are copied into the report after teardown so you can inspect them later without keeping the workspace around.

{
  "name": "Generate API summary",
  "prompt": "Fetch the docs and write summary.md.",
  "rubric": "summary.md exists and covers the docs.",
  "artifacts": [
    "summary.md",
    "*.log",
    "out/**",
    "screenshots/*.png"
  ]
}

Captures run after teardown, so any cleanup scripts you write don't need to dance around the files you want to keep. You can also set an artifacts array at the top level of axis.config.json to capture the same patterns from every scenario; the two lists are merged.

Glob syntax

* matches any characters within a single path segment.
** matches any number of path segments (including zero).
? matches a single character within a segment.
[abc] / [a-z] match character classes; [!abc] negates.

Where captured files live

Each captured file is copied to disk under the report directory, preserving its path relative to the workspace:

.axis/reports/<id>/scenarios/<scenario-key>/<agent>/artifacts/...

The same files are also embedded (base64) into the report manifest so the HTML report can preview and download them entirely client-side — no local server needed, even when the HTML report is opened directly from disk.

Viewing artifacts in the HTML report

Each scenario row gets an Artifacts panel when files were captured. Click Show artifacts to reveal a collapsible file tree, the eye icon to open a modal preview (text-like files render as text, images as images), or the download arrow for a single file. Download all (.zip) bundles every captured file into one archive in the browser.

Mind the size

Artifact contents are embedded in the report HTML, so capturing very large files (multi-MB logs, full builds, video) inflates every shared report. Capture targeted, diagnostic files — not entire build outputs.

Scenario Organization

The filename (without its extension) becomes the scenario key used in reports, CLI commands, and baseline comparisons. Nested directories create namespaced keys.

scenarios/hello-world.json → key hello-world
scenarios/cms/create-post.ts → key cms/create-post
scenarios/api/auth/login.js → key api/auth/login

Use directories to group related scenarios. Agents can be configured to run only specific groups using glob patterns in the agent configuration:

{
  "agent": "claude-code",
  "scenarios": ["cms/*", "api/*"]
}

Agent-specific scenarios

Use the agents field in a scenario to restrict which agents run it. This is useful when a scenario depends on capabilities specific to one agent, or when you want to test different agents on different tasks.

{
  "name": "Use Claude Code MCP integration",
  "prompt": "Use the filesystem MCP server to list files in /tmp",
  "rubric": "Agent successfully used the MCP filesystem tool",
  "agents": ["claude-code"]
}

Variants

Variants let you run the same scenario under different configurations -different skills, MCP servers, prompts, or agent restrictions -without duplicating the entire scenario file. When variants is defined, the base scenario becomes a template: only the variants execute, each inheriting all fields from the parent. To also run the unmodified scenario as a control, add a variant with no overrides (the baseline pattern).

{
  "name": "Create a blog post",
  "prompt": "Create a new blog post titled 'Hello World' on the CMS.",
  "rubric": [
    { "check": "Blog post was created successfully", "weight": 0.5 },
    { "check": "Title matches 'Hello World'", "weight": 0.5 }
  ],
  "variants": [
    {
      "name": "with-netlify-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-custom-skill",
      "skills": ["./skills/blog-helper"]
    },
    {
      "name": "alt-prompt",
      "prompt": "Create a draft blog post titled 'Hello World' without publishing."
    }
  ]
}

This produces three scenario keys: create-post@with-netlify-mcp, create-post@with-custom-skill, and create-post@alt-prompt. Each variant inherits the parent's rubric, setup, and other fields, then applies its own overrides. The variant name is appended to the scenario key with an @ separator.

Variant fields

Only name is required on a variant. All other fields are optional and inherit from the parent when omitted.

Field	Type	Behavior
`name`	`string`	Required. Must match `/^[a-zA-Z0-9_-]+$/`. Used in the scenario key.
`prompt`	`string`	Replaces the parent prompt.
`rubric`	`string \| object[]`	Replaces the parent rubric.
`skills`	`string[]`	Replaces the parent's scenario-level skills (top-level and agent-level skills still merge in).
`mcp_servers`	`object`	Merged with parent's scenario-level MCP servers (variant wins on name conflict).
`agents`	`string[]`	Replaces the parent agent restriction.
`setup` / `teardown`	`object[]`	Replaces the parent lifecycle actions.
`limits`	`object`	Replaces the parent's limits. Fields: `time_minutes`, `tokens`.
`artifacts`	`string[]`	Replaces the parent's scenario-level artifact globs (top-level `artifacts` still merge in).
`skip`	`boolean`	Overrides the parent skip flag.

Filtering Variants

CLI filters and agent-level scenarios globs work with variant keys. Filtering by the base key matches all its variants: --scenario create-post runs all three variants above. Use the full key to target a specific variant: --scenario create-post@with-netlify-mcp.

Example Scenarios

File creation Generate a README from scratch with no setup required.

{
  "name": "Create a README",
  "prompt": "Create a README.md file for a Node.js project called 'my-api'. Include a title, description, install instructions, and usage example.",
  "rubric": [
    { "check": "README.md exists" },
    { "check": "Contains a project title and description" },
    { "check": "Contains npm install instructions" },
    { "check": "Contains a usage or getting started example" }
  ]
}

Bug fix with verification Find and fix a bug in a pre-seeded project, then confirm the test passes.

{
  "name": "Fix failing test",
  "prompt": "The test in test/math.test.js is failing. Find the bug in src/math.js, fix it, and verify the test passes.",
  "rubric": [
    { "check": "Agent identified the root cause", "weight": 0.2 },
    { "check": "Agent fixed src/math.js correctly", "weight": 0.4 },
    { "check": "Agent ran the test and it passed", "weight": 0.4 }
  ],
  "setup": [
    { "action": "run_script", "command": "mkdir -p src test" },
    { "action": "run_script", "command": "echo 'exports.multiply = (a, b) => a + b;' > src/math.js" },
    { "action": "run_script", "command": "echo 'const {multiply} = require(\"../src/math\"); console.assert(multiply(3,4) === 12, \"Expected 12\"); console.log(\"PASS\");' > test/math.test.js" }
  ]
}

Multi-step with setup and teardown Add a new API endpoint, write tests, and clean up.

{
  "name": "Add API endpoint with tests",
  "prompt": "Add a GET /api/health endpoint to the Express app in src/app.js. It should return { status: 'ok', uptime: process.uptime() }. Write a test in test/health.test.js.",
  "rubric": [
    { "check": "GET /api/health endpoint exists", "weight": 0.3 },
    { "check": "Returns JSON with status and uptime", "weight": 0.3 },
    { "check": "Test file exists and covers the endpoint", "weight": 0.2 },
    { "check": "All tests pass", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && npm install express" },
    { "action": "run_script", "command": "mkdir -p src test" }
  ],
  "teardown": [
    { "action": "run_script", "command": "rm -rf node_modules package.json src test" }
  ]
}

Multi-variant scenario Test the same task with different tool configurations.

{
  "name": "Deploy a site",
  "prompt": "Deploy the project in the current directory to production.",
  "rubric": [
    { "check": "Site was deployed successfully", "weight": 0.5 },
    { "check": "Agent confirmed the deploy URL", "weight": 0.3 },
    { "check": "No errors in the deployment log", "weight": 0.2 }
  ],
  "setup": [
    { "action": "run_script", "command": "npm init -y && echo '<h1>Hello</h1>' > index.html" }
  ],
  "variants": [
    {
      "name": "baseline"
    },
    {
      "name": "with-mcp",
      "mcp_servers": {
        "netlify": { "type": "http", "url": "https://mcp.netlify.com" }
      }
    },
    {
      "name": "with-deploy-skill",
      "skills": ["./skills/deploy"]
    }
  ]
}

The first variant, baseline, has no overrides -it runs the scenario exactly as defined, giving you a control run to compare against. The other variants layer on different tool configurations. This produces three keys: deploy@baseline, deploy@with-mcp, and deploy@with-deploy-skill.