Building Dashboards for AI Agent Systems
The Headline Layer
The headline layer is what you glance at to decide whether to keep working on something else or start investigating. It should contain no more than five numbers, each with a clear green, yellow, or red state based on your service level objectives. The specific numbers depend on your agent's purpose, but a strong default set is: task success rate (the percentage of tasks that completed correctly over the trailing window), median end-to-end latency (how long the typical user waits), cost per task (rolling average in dollars or tokens), error rate (the percentage of tasks that failed outright), and optionally a quality score if you have automated evaluation.
Each headline number should be shown alongside its recent baseline, typically the same metric over the previous seven days or thirty days, so that you can distinguish a genuine anomaly from normal variation. A success rate of ninety-two percent is alarming if the baseline is ninety-eight percent but unremarkable if the baseline is ninety-one percent. Without the comparison, you would need to remember the baseline from memory, which nobody does reliably, especially during an incident when stress degrades recall.
The headline layer should also include a prominent deployment marker that shows when the most recent changes were deployed: prompt updates, model changes, tool modifications, or framework upgrades. The single most common pattern in agent incidents is that a metric changes and the first question is "did we deploy something?" Having the deployment timeline on the same screen as the metrics eliminates the most common investigative detour.
The Investigation Layer
When a headline metric goes yellow or red, the investigation layer surfaces to help you narrow the problem. Its purpose is to decompose the aggregate into its components so you can identify which specific thing is wrong.
For a success rate drop, the investigation layer shows error rate broken down by error type (tool failure, model refusal, timeout, parsing error, quality failure), success rate by task type or user segment, and a timeline of individual failures. This decomposition typically reduces the investigation space from "the system is broken" to "the search tool started returning errors at 2 PM," which is actionable.
For a latency increase, the investigation layer shows latency broken down by step type (LLM call, tool call, retrieval, post-processing), latency by model (if you use multiple models), and the distribution of task lengths in steps. A latency increase caused by a slow tool looks completely different from one caused by the agent taking more reasoning steps, and the investigation layer makes this distinction visible immediately.
For a cost spike, the investigation layer shows cost by model, cost by tool (since some tools indirectly increase cost by forcing retries or additional reasoning), cost by component (system prompt, context, history, output, retries), and the distribution of task costs. Cost spikes are frequently caused by a small number of very expensive tasks rather than a broad increase, and the distribution view identifies these outliers so you can investigate them individually.
The investigation layer should support interactive filtering. Clicking on a specific error type, task type, or time window should filter all the other panels to that selection. This cross-filtering is what makes the investigation layer genuinely useful rather than just a collection of static charts, because agent problems often involve correlations between multiple dimensions that you cannot see without holding one dimension constant while examining the others.
The Detail Layer
The detail layer is individual task inspection: opening a specific task's trace and seeing every step the agent took, with full context and reasoning. The transition from the investigation layer should be seamless: clicking on a specific failed task in the investigation timeline opens its trace in the detail view, preserving the context of how you got there.
A good detail view for an agent task shows the complete span tree with each span's type, timing, status, and key metadata visible at a glance. Expanding a span reveals its full payload: for an LLM call, the prompt summary and the model's response; for a tool call, the arguments and the result; for an error, the full error message and stack trace. The view should make the causal chain visually obvious: the user asked this, the model decided to do that, the tool returned this, the model evaluated it and decided to try something else, and so on until the final output.
For failed tasks, the detail view should highlight the failure point in the span tree so you do not have to scan through a long sequence of successful steps to find the one that broke. For slow tasks, it should highlight the bottleneck span. For expensive tasks, it should show the cumulative cost at each step so you can see where the budget was consumed. These highlights turn the detail view from a raw data display into a guided investigation tool.
Design Principles That Last
Time is the primary axis. Every chart on every layer should show how the metric changes over time, because trends are more informative than snapshots. A current success rate of ninety-five percent means nothing without knowing whether it was ninety-eight percent yesterday and ninety-nine percent last week. Time-series charts with consistent windows (one hour, one day, one week) let you see both acute incidents and gradual degradation.
Less is more. Every metric you add to a dashboard competes for attention with every other metric. A dashboard with fifty panels is effectively invisible because no human can monitor fifty things simultaneously. Start with the minimum set that covers your service level objectives and add metrics only when a specific question arises that the existing metrics cannot answer. Remove metrics when the question they answer is no longer relevant. A dashboard that grows without pruning eventually becomes useless.
Alerts belong on the dashboard, not beside it. If you have alerting rules, show their thresholds as horizontal lines on the corresponding charts, and show the alert history as markers on the timeline. This integration means that when an alert fires, the person responding can immediately see the metric that triggered it in context, with its history and its relationship to other metrics, without switching to a separate alerting console.
Make the happy path invisible. When everything is green, the dashboard should be boring. Healthy metrics should have a calm visual treatment, muted colors, and compact layout, so that anomalies pop visually. A dashboard where everything is bright and attention-grabbing at all times provides no visual signal for actual problems, because there is no contrast between normal and abnormal. Reserve red, bold text, and expanded panels for states that actually require action.
Version your dashboards with your agent. When you make a significant change to the agent, review whether the dashboard still asks the right questions. A prompt redesign may make some metrics irrelevant and introduce new dimensions worth tracking. A tool change may require new per-tool panels. Treating the dashboard as a living artifact that evolves with the system, rather than a one-time build, keeps it aligned with what you actually need to see.
Build dashboards in three layers: headlines for instant health assessment, investigation panels for narrowing problems, and trace-level detail for root cause analysis. Keep the headline layer to five numbers or fewer, and let everything else surface only when needed.