Debugging a run

How to read a Grexal run's logs and diagnose an agent that hangs, fails, or doesn't produce output.

When a deployed agent doesn't behave the way it does locally, the run page (or grexal runs logs <id> --follow) is the primary diagnostic surface. This page describes what each line on that surface tells you and how to use it to track down problems in your agent.

The four log streams

Every line in a run's log is tagged with one of four streams. The tag tells you where the line came from.

Stream	Source	What it means
`[system]`	Grexal itself	Lifecycle events for the run — created, started, completed, timed out, cancelled.
`[agent]`	Your `ctx.log()` calls	Whatever your code emitted via the SDK. Batched and flushed roughly every 500ms.
`[stdout]`	Your process's stdout	Anything your code wrote to `console.log` (TS) or `print` (Python). Streamed live.
`[stderr]`	Your process's stderr	Anything written to `console.error`/`stderr` from your code, plus a couple of platform-emitted readiness lines (`[grexal] runtime ready`, `[grexal] starting your agent`). Streamed live.

The [grexal] lines on [stderr] are platform readiness signals. You don't need to act on them, but their presence confirms that Grexal got far enough to start your code — useful when you're trying to figure out whether a problem is in your code or somewhere before it.

Tailing logs

grexal runs logs <runId> --follow

streams the run's logs to your terminal in real time. The --follow flag tails until the run reaches a terminal state (completed, failed, or cancelled) and exits. Use grexal runs list to find recent run IDs.

Common scenarios

My agent ran fine locally but produces no `[agent]` logs in production

Look at the [stderr] stream first.

You see [grexal] starting your agent but no [agent] logs. Your entrypoint module is failing to load. This is almost always a code-level issue: a top-level await that never resolves, a side effect that throws or blocks at import time, a missing dependency, or an export shape the runtime doesn't recognize. Any thrown error from the import will be on [stderr] — read it.
You see neither [grexal] runtime ready nor [grexal] starting your agent after several seconds. Grexal didn't get far enough to start your process. This is a platform-side issue, not your code. Contact support with the run ID.

My agent is stuck — logs stop landing partway through

The last [agent] line tells you the last point your code reached. Add ctx.log() calls around the work that follows it to narrow down further.

If you're calling out to a third-party API (LLM, database, web service), the most likely cause is a request that never resolves. Bound it with whatever timeout primitive that library exposes, or set runtime.timeout_seconds in your manifest so the platform kills the run instead of leaving it suspended.

My agent times out

When runtime.timeout_seconds (default 300) elapses, the run is killed and you'll see:

[stderr] <last few KB of stdout/stderr captured before kill>
[system] Run exceeded timeout_seconds (300) — sandbox killed

Two options:

Raise timeout_seconds in grexal.json if your agent legitimately needs more time. Maximum is 86400 (24 hours), capped by a 65-minute platform ceiling.
Find the hang. The captured [stderr] tail right before the kill is your last clue about what your code was doing.

My agent crashed

A crash (uncaught exception in run, or any non-zero exit) shows up as:

[agent] runner error: <message>
        <stack trace>
[system] Process exited with code <N>

The stack trace points to the line in your code that threw. The error message and stack are also written to [stderr] as a backstop in case the SDK couldn't deliver them on the agent stream.

I cancelled the run

The platform captures a final tail of your process's [stdout]/[stderr] before reaping the sandbox, so cancelled runs still show what your agent was doing right before you cancelled.

Run timeouts

Configured via runtime.timeout_seconds in your manifest (default 300, range 10s to 24h). The platform schedules the kill at timeout_seconds + 60s — the extra 60 seconds is grace time for any queued ctx.log entries to flush before the sandbox is reaped. There is also an absolute 65-minute ceiling that overrides anything higher.

When to suspect the platform

Most production weirdness traces back to the agent's own code. The exceptions — cases where the right move is to contact support — are:

The [grexal] runtime ready and [grexal] starting your agent lines never appear after several seconds.
The [system] sequence stalls between two of its standard steps for more than a few seconds (e.g., long pause between "Creating sandbox" and "Sandbox ready").
Runs are intermittently failing with no [stderr] output and no [agent] activity, and the same code runs reliably under grexal dev.

For everything else, the run log has enough information to narrow the problem to a specific line of your code.