How to Set Up Structured Logging for Agents

Updated May 2026
Setting up structured logging for an AI agent means defining a consistent JSON schema for all events, wrapping every LLM call and tool invocation with instrumentation that emits events in that schema, configuring log levels adapted for agent verbosity, adding decision context capture that records why the agent chose each action, and implementing tiered retention that keeps full detail for failures while managing storage cost for routine successes.

Structured logging is the foundation that everything else in agent observability is built on: metrics are computed from logs, traces are assembled from logs, dashboards display log-derived data, and investigations query logs directly. Getting the logging layer right from the start makes every subsequent observability capability easier to build. Getting it wrong, or deferring it, means retrofitting instrumentation into a running system while simultaneously losing the data you would have collected in the meantime.

Define the Log Schema

Create a JSON schema that every log event must conform to. The required fields for all event types are: timestamp (ISO 8601 with millisecond precision), session_id (a unique identifier for the user interaction or task), step_index (an integer that orders events within a session), event_type (a string from a fixed enumeration: llm_call, tool_call, tool_response, planning, evaluation, error, completion), and level (DEBUG, INFO, WARN, ERROR).

Each event type defines additional required fields in its payload. For llm_call: model, input_tokens, output_tokens, latency_ms, and at DEBUG level, prompt_summary and response_text. For tool_call: tool_name, arguments (as a JSON object), and at DEBUG level the full arguments if they are large. For tool_response: tool_name, status (success, error, timeout), latency_ms, response_summary, and at DEBUG level the full response. For error: error_type, error_message, failing_step, and always the full payload of the failing step regardless of log level. For completion: outcome (success, failure, partial), total_steps, total_tokens, total_cost_usd.

Document the schema in a shared location that all developers working on the agent can reference. Add schema validation at the log ingestion point, even if it is just a warning that a required field is missing, to catch instrumentation gaps before they become blind spots during investigation.

Instrument LLM Calls

Create a wrapper around your LLM client that captures telemetry for every call. The wrapper records the start time before the call, extracts token counts and the response content from the API response, computes the latency, and emits a structured log event with all the schema-defined fields. The wrapper should be the only path through which the agent calls the LLM, so that no call can happen without being logged.

For the prompt summary field, generate a truncated or hashed version of the prompt that is useful for debugging without storing the full prompt text at INFO level. A practical approach is to log the first and last hundred characters of the prompt plus its total length, which is usually enough to identify the prompt version and the general context without the full payload. At DEBUG level, log the complete prompt and response for full investigative capability.

If your LLM client supports streaming responses, the instrumentation must accumulate the streamed tokens to compute the output token count and capture the full response text. Streaming complicates instrumentation but does not change the schema; the log event is emitted when the stream completes, with the same fields as a non-streaming call.

Instrument Tool Calls

Create a similar wrapper for each tool the agent can call. The wrapper records the arguments the agent passed, the start time, the tool's response, the latency, and the outcome status, then emits a tool_call event followed by a tool_response event. Separating the call and the response into two events makes it possible to see, in the logs, when a tool call was initiated but the response never arrived, which is invisible if you only log after the response returns.

For tools that return large payloads, such as search results with many documents or API responses with nested data, truncate the response in the log event's response_summary field at INFO level and log the full response at DEBUG level. Define a consistent truncation policy, such as keeping the first five hundred characters and the total byte count, so that the truncated version is useful for identifying the response without overwhelming the log storage.

If a tool call fails and the agent retries, each attempt should be a separate pair of tool_call and tool_response events with the same step_index but an incremented attempt_index. This makes retry behavior visible in the logs without conflating the retry with the original call.

Configure Agent-Adapted Log Levels

Set your production default to INFO and implement the ability to switch individual sessions to DEBUG on demand. The simplest implementation is a configuration flag or header that can be set per request, which the logging wrapper checks when deciding whether to emit full payloads or truncated summaries.

Configure WARN events to fire on specific conditions: a tool call that required a retry, an LLM call where the output exceeded the expected format, a context window that exceeded eighty percent of the model's limit, or a task that took more than twice the median number of steps. These WARN events are your canary metrics, and they should be queryable as a group so you can track WARN rate over time as an early indicator of degradation.

Configure ERROR events to always include the full payload of the failing step, regardless of the current log level. An error you cannot investigate is an error you cannot fix, and saving a few kilobytes of log storage by truncating error details is a false economy that costs hours of debugging time when the error recurs.

Add Decision Context Capture

For each LLM call event, add a reasoning field that captures the model's explanation for its next action. If your agent framework exposes chain-of-thought output, extract the planning or reasoning section and include it in the log event. If the framework does not expose reasoning, the LLM's tool call choice and the preceding context are the best available proxies.

To manage volume, capture full decision context for all failures and for a configurable percentage of successes, typically five to ten percent. For the remaining successes, log the action without the reasoning text. This gives you a representative sample of normal decision-making for analysis while keeping the detailed reasoning available for every failure investigation.

Set Retention Policies

Configure three retention tiers. The first tier keeps full-fidelity data (all events including DEBUG-level payloads) for failed tasks and sampled successes, retained for ninety days. The second tier keeps INFO-level event skeletons for all tasks, retained for thirty to ninety days. The third tier keeps aggregated metrics computed from the events, retained indefinitely for trend analysis.

Implement automated lifecycle rules that move data between tiers as it ages. Most managed log services support retention policies that automatically delete events beyond the configured window. For the aggregated metrics tier, set up a scheduled job that computes daily and hourly aggregates from the raw events before they are deleted, preserving the statistical trends without the per-event detail.

If your logs may contain personal data from user inputs, apply encryption at rest, access controls, and a data handling policy that addresses privacy requirements. Consider pseudonymizing user identifiers in the INFO-level tier so that operational analysis does not require access to personally identifiable information.

Key Takeaway

Structured logging for agents requires a defined schema, consistent instrumentation at every decision point, log levels adapted to agent verbosity, decision context for investigative depth, and tiered retention that balances cost against the ability to diagnose problems after the fact.