Setting Resource Limits for AI Containers
AI workloads are uniquely resource-intensive compared to typical web applications. Model inference can spike CPU usage to 100 percent on all cores, a single loaded model can consume 40 GB or more of RAM, and GPU VRAM exhaustion causes immediate container crashes with no graceful degradation. Setting appropriate resource limits for each service in your Compose stack prevents these scenarios from affecting your entire system. These steps cover CPU, memory, and GPU resource management with specific numbers and strategies for common AI agent configurations.
Understand Docker Resource Controls
Docker enforces resource limits through Linux control groups (cgroups), a kernel feature that restricts and accounts for resource usage per process group. When you set a memory limit on a container, Docker creates a cgroup with that limit, and the Linux kernel enforces it. If the container processes try to allocate more memory than the limit allows, the kernel OOM (out-of-memory) killer terminates the container.
CPU limits work differently from memory limits. A CPU limit does not cause termination. Instead, it throttles the container by restricting how much CPU time it can use within each scheduling period. A container with a 2-CPU limit can use at most 2 cores worth of compute time, even if the host has 16 cores sitting idle. The container processes are paused when they hit the limit and resume in the next scheduling period.
GPU resources are managed differently from CPU and memory. Docker delegates GPU allocation to the NVIDIA Container Toolkit, which uses a device-level model rather than cgroups. You allocate whole GPUs or specific GPU devices to containers, not fractions of GPU compute time. Two containers cannot share a single GPU through Docker resource controls alone (though the GPU itself can time-slice between CUDA contexts at the hardware level).
Resource reservations are distinct from resource limits. A reservation guarantees a minimum amount of resources for a container, while a limit caps the maximum. For AI workloads, reservations ensure your model server always has enough memory to hold the loaded model, while limits prevent it from consuming so much that other services cannot function.
Set Memory Limits for Each Service
Memory is the most critical resource to limit for AI agent stacks because memory exhaustion causes immediate, unrecoverable container termination. The Docker OOM killer does not give the container time to save state or shut down gracefully. It terminates the container process immediately, and any in-progress operations, unsaved state, or buffered writes are lost.
Calculate memory requirements for each service based on actual usage, not estimates. Run your stack without memory limits for a test period and monitor peak memory usage with docker stats. Note the peak memory for each service during typical operation and during peak load (like when loading a new model or processing a large batch). Set your memory limit to 120 to 150 percent of the observed peak to provide headroom for unexpected spikes.
For model servers like Ollama or vLLM, memory requirements depend on the model size and quantization level. A 7B model at 4-bit quantization requires approximately 4 GB of VRAM plus 2 to 3 GB of system RAM for the server process. A 70B model at 4-bit quantization requires approximately 35 to 40 GB of VRAM plus 4 to 6 GB of system RAM. Set the memory limit to cover both VRAM mapped memory and system RAM usage.
PostgreSQL memory usage depends on your shared_buffers and work_mem settings. A typical configuration for an AI agent database uses 256 MB to 1 GB of shared_buffers plus per-connection work_mem. Set the container memory limit to at least twice the shared_buffers value plus 50 MB per expected concurrent connection plus 200 MB for the PostgreSQL process overhead.
Configure CPU Limits and Reservations
CPU limits in Docker Compose use the cpus setting under deploy.resources.limits. The value represents the number of CPU cores the container can use. A value of 2.0 means the container can use at most two full cores. A value of 0.5 means the container can use at most half of one core. Fractional values are useful for lightweight services like log collectors or monitoring sidecars.
AI agent stacks have uneven CPU demands across services. The model server needs the most CPU during inference (especially for CPU-only inference without GPU). The agent runtime needs moderate CPU for request processing, tool execution, and orchestration logic. The database needs CPU for query processing but is often I/O-bound rather than CPU-bound. Allocate CPU limits proportionally: give the model server the largest share, the agent a moderate share, and the database a smaller share.
CPU reservations guarantee minimum CPU availability. If your host has 8 cores and you reserve 4 cores for the model server, Docker ensures those 4 cores are always available to the model server even when other containers are under heavy load. Without reservations, a CPU-intensive agent task could starve the model server and cause inference timeouts.
For CPU-only inference (no GPU), consider not setting CPU limits on the model server and instead limiting all other services. This lets the model server use any idle CPU capacity for faster inference while the limits on other services prevent them from interfering during inference. Monitor CPU utilization across all services to verify that this approach does not cause CPU starvation for critical services like the database.
Allocate GPU Resources
GPU allocation in Docker Compose uses the deploy.resources.reservations.devices section. Each device entry specifies the driver (nvidia), the count or device_ids, and the capabilities (gpu). Unlike CPU and memory, GPU allocation is all-or-nothing at the device level. You allocate whole GPUs to containers, and each GPU can only be allocated to one container at a time through Docker resource controls.
VRAM cannot be limited through Docker configuration. When a container has access to a GPU, it can use all available VRAM on that GPU. If a model server loads a model that exceeds available VRAM, the CUDA runtime returns an out-of-memory error. Prevent this by knowing your model VRAM requirements and ensuring they fit within the GPU VRAM capacity before deployment.
On multi-GPU systems, use device_ids to assign specific GPUs to specific services. This prevents two services from competing for the same GPU VRAM. Assign your primary inference GPU (typically the one with the most VRAM or highest performance) to the model server and secondary GPUs to embedding generation, batch processing, or training services.
If you need to share a single GPU between multiple services, consider using a single model server that handles all inference requests and allocating the GPU exclusively to that server. Other services send inference requests to the model server over the Docker network rather than accessing the GPU directly. This approach uses the model server queuing and batching capabilities to manage GPU access efficiently.
Monitor and Adjust Resource Allocation
Run docker stats to see real-time resource usage for all containers. The output shows CPU percentage, memory usage versus limit, network I/O, and block I/O for each container. Watch this output during typical operation and during peak load to understand your actual resource consumption patterns.
Compare actual usage against your configured limits to identify over-provisioned and under-provisioned services. If a service consistently uses less than 50 percent of its memory limit, you may be able to reduce the limit and free memory for other services. If a service frequently approaches 90 percent of its limit, increase the limit before the OOM killer intervenes.
Set up automated alerting for resource usage thresholds. Alert when any container exceeds 80 percent of its memory limit, when CPU throttling occurs for more than 30 seconds continuously, or when GPU VRAM usage exceeds 90 percent. These alerts give you time to investigate and adjust before resource exhaustion causes service failures.
Review and adjust resource allocations after any significant change to your agent stack: adding a new model, increasing batch sizes, adding concurrent users, or upgrading hardware. Resource requirements that were appropriate for a 7B model may be wildly inadequate for a 70B model, and limits that worked with 10 concurrent users may cause throttling with 100.
Document your resource allocation decisions and the reasoning behind them. Record the model sizes, expected concurrency, observed peak usage, and calculated limits for each service. This documentation helps future team members understand why specific limits were chosen and provides the baseline data needed to adjust limits for changed requirements.
Set memory limits to 120 to 150 percent of observed peak usage, allocate CPU proportionally with the model server getting the largest share, assign whole GPUs to specific services rather than sharing, and monitor actual usage continuously to refine your allocations over time.