How to Fine-Tune Models from Agent Experience
This guide covers the practical path from raw agent activity to a measurably better model. Fine-tuning from experience is the most powerful form of agent learning because it bakes improvement permanently into the weights, and the most dangerous if done carelessly, because a model trained on its own unchecked output can spiral into degradation. The steps below are ordered to keep the process grounded: verify before you train, measure honestly, and deploy reversibly. Treat fine-tuning as the final stage of a learning system, undertaken only once stable behavior and verified data exist.
Collect and Verify Trajectories
Start by gathering complete trajectories of the agent's task attempts, each capturing the input, the steps and tool calls, the reasoning, and the final outcome. A trajectory is richer than an input-output pair because it preserves the process you want the model to learn, not just the result. This collection rests on the comprehensive logging that every learning system needs as its foundation.
Then verify each trajectory before it is allowed near a training set, because verification is what separates genuine improvement from self-reinforcing error. Label each trajectory by a reliable outcome check: whether the generated code passed its tests, whether the extracted data matched a known reference, whether the task reached its goal state. Where an automatic verifier exists, this is objective and scalable. Where it does not, fall back to a model judge or human review, accepting the added cost and noise. Never treat the agent's own confidence as verification; an attempt the agent believes succeeded is not the same as one confirmed to have succeeded. The reasoning behind this is detailed in learning from experience.
Capture failed trajectories as deliberately as successful ones. Failures are not waste; they are the negative signal that outcome-based methods learn from, and they reveal the specific ways the agent goes wrong, which is invaluable for diagnosis even if you only train on successes. A collection that discards failures throws away half the information its own activity produced.
Curate the Training Dataset
Verified trajectories are not yet a good training set; they must be curated. For methods that learn by imitation, keep only the high-quality successes, since the model will reproduce whatever it is shown, and a few bad examples teach a disproportionate amount of bad behavior. Filter out trajectories that succeeded by luck, that relied on malformed inputs, or whose outcome is ambiguous.
Balance the dataset deliberately so that important but rare cases are well represented rather than drowned out by common ones, and so coverage spans the full range of task types the model should handle. Redact personally identifiable and sensitive information before it enters the set, and version the resulting dataset immutably so the training run is reproducible and traceable. The dataset is a designed artifact whose quality caps the quality of the model trained on it, and the full set of practices appears in training data collection.
Choose a Fine-Tuning Method
Match the method to your data and your goal. Supervised fine-tuning on successful trajectories, sometimes called rejection sampling or best-of-N training, is the simplest and most stable approach: the model learns to imitate its own best behavior, raising its baseline toward what it previously achieved only occasionally. It is the right starting point for most teams.
Lightweight adapter tuning, such as LoRA, trains a small set of additional parameters while leaving the base model untouched, which dramatically reduces cost and the risk of catastrophic forgetting and lets you maintain multiple specialized variants cheaply. Preference optimization, such as direct preference optimization, is appropriate when your data takes the form of comparisons, which output was better, rather than single correct answers, and it excels at qualities like tone and helpfulness. Outcome-based reinforcement learning can extract more from the data by learning from failures as well as successes, but it is more complex to run and tune, so reserve it for when simpler methods have plateaued.
When in doubt, start with the simplest method that could work and escalate only if it falls short. Supervised fine-tuning on verified successes is stable, well understood, and sufficient for a large fraction of real goals, and it provides a clean baseline against which to judge whether a more complex method is actually worth its added cost and fragility. Complexity adopted prematurely tends to cost more than it returns.
Hold Out an Evaluation Set
Before training, set aside a portion of your data that the training process will never see, and keep your standing evaluation set entirely separate from training as well. This held-out data is the only honest detector of two failure modes that fine-tuning commonly introduces. The first is overfitting, where the model memorizes the training examples and scores well on them while failing to generalize; a held-out set reveals this as a gap between training and held-out performance.
The second is capability loss, where training on new data degrades skills the model already had, which a narrow evaluation would miss entirely. Guard against it by evaluating on a broad set that covers the model's prior capabilities, not just the new behavior you are training. Without held-out data, you can convince yourself a model improved when it merely memorized its own homework, which is why this step is non-negotiable rather than optional.
Run the Fine-Tune and Evaluate
With method chosen and data prepared, run the training. Keep the changes conservative at first: a modest amount of training on a clean dataset usually produces better, safer results than aggressive training that risks overfitting and forgetting. Mixing in a sample of broad, general data alongside the task-specific examples helps preserve existing capabilities while the new behavior is learned.
Once trained, evaluate the new model thoroughly before it goes anywhere near production. Compare it against the held-out set to check generalization, against the standing eval set to confirm it improved on the target behavior, and against a broad capability set to confirm it did not regress elsewhere. Track success rate, regression rate, cost, and latency. A new model version earns deployment only by demonstrably beating the current one on the metrics that matter, with no unacceptable regressions, exactly the bar that formal agent benchmarks and evaluation are designed to enforce.
Watch the gap between training performance and held-out performance as the clearest early warning of overfitting. When the model scores far better on the data it trained on than on data it has never seen, it has memorized rather than generalized, and pushing further will only widen the gap. Stopping training while the two move together, rather than chasing the highest possible training score, produces a model that performs in the real world.
Deploy with Canary and Rollback
A model that passes offline evaluation has earned a careful release, not an immediate full rollout. Deploy it as a canary: route a small fraction of live traffic to the new version while the rest stays on the current one, and compare their real-world outcomes directly. This controlled comparison catches the problems that offline evaluation misses, the messiness of real traffic that no fixed set fully replicates.
Keep the previous version ready to take over instantly, and define the conditions under which you will roll back, such as a drop in success rate or a spike in errors or cost. Expand the new version's share of traffic only as it proves itself. This canary-and-rollback discipline makes fine-tuning a reversible, low-risk step rather than a gamble, and it closes the loop: the deployed model generates fresh trajectories that feed the next round of learning. Fitting this into the broader system is covered in setting up learning pipelines.
Fine-tune from experience by collecting complete trajectories, verifying each one before training, curating a clean balanced dataset, and choosing a method matched to your data, starting with supervised fine-tuning or LoRA. Always hold out data the training never sees, evaluate for both gains and regressions, and deploy as a canary with instant rollback. Verification and held-out evaluation are what keep self-improvement from becoming self-degradation.