RFC: A convention for CLI tools to output LLM-optimized results

Abstract

AI coding agents (Claude Code, Codex, OpenCode, etc.) increasingly rely on CLI tools to perform tasks — running tests, linting, building, deploying. These tools produce output designed for human consumption: colored text, progress bars, decorative formatting, verbose stack traces. This wastes tokens, injects noise, and degrades agent performance.

This RFC proposes a lightweight, voluntary convention: a --llm flag and LLM_OUTPUT environment variable that CLI tools can adopt to emit LLM-optimized output — concise, plain-text, high-signal, and free of decoration.

This follows the precedent of --verbose and --quiet: no formal spec body, no compliance authority, just an obviously useful convention that spreads through adoption.

Motivation

The problem

When an AI agent runs jest, it might receive:

 PASS  src/utils.test.ts
 PASS  src/helpers.test.ts
 PASS  src/format.test.ts
 FAIL  src/auth.test.ts
  ● should reject expired tokens

    expect(received).toBe(expected)

    Expected: 401
    Received: 200

      40 |   const res = await request(app).get('/protected').set('Authorization', expiredToken);
      41 |
    > 42 |   expect(res.status).toBe(401);
         |                      ^
      43 |
      44 | });

      at Object.<anonymous> (src/auth.test.ts:42:22)
      at processTicksAndRejections (node:internal/process/task_queues:95:5)

  ● should refresh token on 403

    TypeError: refreshToken is not a function

      13 |   if (response.status === 403) {
      14 |     const newToken = await refreshToken();
    > 15 |     return retry(config, newToken);
         |                         ^
      16 |   }

      at handleResponse (src/api.ts:15:25)
      at processTicksAndRejections (node:internal/process/task_queues:95:5)

Test Suites: 1 failed, 3 passed, 4 total
Tests:       2 failed, 48 passed, 50 total
Snapshots:   0 total
Time:        3.842 s
Ran all test suites.

The agent doesn’t need the banner, the PASS lines, the decorative arrows, the code context window, or the summary table. It needs to know what failed and why — nothing more.

Why `--json` isn’t the answer

Many tools offer --json output. This is designed for machine parsing by scripts and pipelines. It’s a poor fit for LLMs because:

It’s verbose. JSON representations of test results can be 10-50x larger than a concise text summary.
It’s structured for traversal, not comprehension. An LLM doesn’t iterate over keys; it reads sequentially.
It includes everything. JSON output typically contains all data — passing tests, timing, metadata — because scripts might need any of it. LLMs almost never do.

Why `--quiet` isn’t the answer

Quiet mode suppresses too much. An agent that gets only an exit code can tell that something failed, but not what or why. It still needs diagnostic information — just not decorative information.

The gap

There is currently no convention for output that is:

Concise — minimal tokens, maximum signal
Diagnostic — enough to understand and fix the problem
Plain text — no ANSI codes, no box drawing, no spinners
Stable — consistent enough to pattern-match across runs
Selective — omits success details, focuses on failures and actionable info

This RFC fills that gap.

Specification

1. The `--llm` flag

CLI tools SHOULD support a --llm flag that switches output to LLM-optimized mode.

jest --llm
eslint --llm
cargo build --llm

The flag name is --llm because:

It is specific and self-documenting.
It has no known conflicts with existing CLI tools.
It accurately describes the target consumer of the output.
It is extensible to future values (see §5).

2. The `LLM_OUTPUT` environment variable

Tools SHOULD also check for the LLM_OUTPUT environment variable. This allows AI agent harnesses to set the mode globally for all subprocesses without modifying individual commands.

export LLM_OUTPUT=1
jest          # behaves as if --llm was passed
eslint        # behaves as if --llm was passed

Precedence: The --llm flag, if explicitly passed, MUST take priority over the environment variable. Explicitly passing a standard flag (e.g., --json, --verbose) SHOULD take priority over LLM_OUTPUT.

3. Output style guide

Tools implementing --llm SHOULD follow these principles:

3.1 Strip decoration

Remove all elements that exist for human visual comfort:

ANSI color/style codes
Progress bars and spinners
Box-drawing characters and ASCII art
Banners, logos, version headers
Decorative separators and blank lines

3.2 Report failures, not successes

For tools that check or validate (test runners, linters, type checkers, builds):

DO report each failure with location and reason.
DO NOT enumerate passing/successful items.
DO include a one-line summary count (e.g., FAIL 2/50).

3.3 Compress stack traces

Stack traces are useful but typically over-detailed. In --llm mode:

DO include the error message and type.
DO include the top 1-3 frames originating in user code.
DO NOT include frames from node_modules, standard library internals, or runtime machinery.
DO NOT include code context windows (the > arrow source display).

3.4 Use a flat, scannable structure

Prefer a consistent, line-oriented format:

STATUS summary
--- location description
  detail
  detail
--- location description
  detail

Specific formatting is left to each tool (see §4), but the output should be immediately understandable by reading top-to-bottom with no nesting beyond one level.

3.5 Preserve essential information

--llm is not --quiet. The output MUST contain enough information for an LLM to take corrective action. The test: could a competent developer fix the problem given only this output and access to the codebase?

3.6 Plain text only

Output MUST be valid UTF-8 plain text. No ANSI escape sequences, no terminal control characters.

4. Reference: Jest example {#reference-jest-example}

Standard output — as shown in the Motivation section above.

jest --llm output:

FAIL 2/50

--- src/auth.test.ts:42 "should reject expired tokens"
expected: 401
received: 200

--- src/auth.test.ts:67 "should refresh token on 403"
TypeError: refreshToken is not a function
  at handleResponse (src/api.ts:15)

Design notes:

Decision	Rationale
No PASS lines	Agent only needs to act on failures
No code context	Agent can read the file itself if needed
One-level stack	Top user-code frame is almost always sufficient
Quoted test name	Helps the agent match output to specific test
`expected`/`received` preserved	Core diagnostic info for assertion failures

5. Extensibility

The --llm flag and LLM_OUTPUT variable start as simple booleans. Future evolution MAY support values for different profiles:

jest --llm=compact     # minimal (default, equivalent to --llm)
jest --llm=verbose     # more context, still LLM-optimized
jest --llm=json        # structured JSON, but filtered for relevance

LLM_OUTPUT=compact
LLM_OUTPUT=verbose

Tools SHOULD treat --llm with no value and LLM_OUTPUT=1 as equivalent to the default compact mode. Tools SHOULD treat unrecognized values as the default compact mode rather than erroring.

This extensibility is reserved for future use. Implementations today SHOULD only support the boolean form.

Adoption path

This convention succeeds through voluntary adoption, not standardization. The suggested path:

Publish this RFC as a GitHub repository.
Build reference implementations — Jest, ESLint, and pytest are high-impact starting points due to their frequent use in AI-assisted development.
Propose upstream PRs to popular tools, or ship as wrapper plugins.
Agent harnesses adopt LLM_OUTPUT — Claude Code, OpenCode, Codex, and similar tools set the env var by default.

Community contributions for additional tools are welcome.

FAQ

Why not just post-process existing output? Agent harnesses could strip output, but that wastes tokens (the LLM still receives and processes the noise) and is fragile against format changes. The tool itself knows best what’s signal and what’s decoration.

Why a new flag instead of --json with filtering? JSON is designed for programmatic traversal. LLM output is designed for sequential reading. They are different targets with different optimal formats. Bolting LLM optimization onto JSON would produce worse results than a purpose-built text format.

Will this fragment CLI output into too many modes? Most tools already support 2-4 output modes (default, --verbose, --quiet, --json). Adding one more is marginal complexity for significant value. The modes also map to clearly distinct consumers: humans (default/verbose), scripts (json/quiet), and LLMs (llm).

Why --llm and not --ai? --ai is too broad and likely to conflict. The target consumer is specifically large language models reading output as text context. --llm is precise, has no known conflicts, and is immediately understood.

What about non-LLM AI agents? The output style guide optimizes for text-in-context consumption, which applies to any system reading plain text — LLMs, future reasoning engines, or human-in-the-loop pipelines. The name --llm is chosen for clarity today; the convention is useful beyond the name.