GPU Access in Docker for AI Workloads
GPU access is the single most impactful configuration for AI agent stacks that run local models. Without it, inference runs on CPU and can be 10 to 100 times slower depending on the model architecture and size. These steps cover the complete process from toolkit installation to multi-GPU allocation and monitoring, with specific guidance for the containerized model servers most commonly used in AI agent deployments.
Install the NVIDIA Container Toolkit
The NVIDIA Container Toolkit is the software layer that enables Docker containers to access host GPUs. It consists of a container runtime hook that intercepts container creation requests and injects the necessary GPU device files, driver libraries, and CUDA runtime into the container filesystem. Without this toolkit, containers cannot see or use the host GPU even if the NVIDIA driver is installed on the host.
Before installing the toolkit, verify that your host has a compatible NVIDIA GPU and a working driver installation. Run nvidia-smi on the host to confirm the driver is loaded and your GPU is detected. The output shows your GPU model, driver version, and CUDA version. If nvidia-smi is not found or returns an error, install the NVIDIA driver first before proceeding with the container toolkit.
Install the toolkit using the NVIDIA package repository for your Linux distribution. For Ubuntu and Debian, add the NVIDIA container toolkit repository, update your package list, and install nvidia-container-toolkit. After installation, restart the Docker daemon so it picks up the new container runtime configuration. The restart command is systemctl restart docker on systems using systemd.
Verify the installation by running a test container with GPU access: docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi. This command pulls a minimal CUDA container, passes all host GPUs to it, and runs nvidia-smi inside the container. If the output shows your GPU information, the toolkit is working correctly. If it fails, check that the Docker daemon was restarted and that your user is in the docker group.
Configure GPU Access in Docker Compose
Docker Compose exposes GPU access through the deploy.resources.reservations.devices section of a service definition. You specify the driver (nvidia), the number of GPUs to allocate (count), and the required capabilities (gpu). This configuration tells Docker to pass GPU devices to the container when it starts.
For a single-GPU system where your model server needs the entire GPU, set count to 1 or use the string all to pass every available GPU. Using all is simpler but less precise. If you later add a second GPU, the all setting passes both GPUs to the service, which may not be what you want. Specifying an explicit count gives you predictable behavior regardless of how many GPUs the host has.
The capabilities list controls which GPU features the container can access. The gpu capability enables CUDA compute operations, which is what model inference requires. Additional capabilities like utility enable monitoring tools like nvidia-smi inside the container, and compute specifically enables CUDA compute kernels. For most AI agent workloads, specifying gpu is sufficient.
When multiple services need GPU access, you must allocate GPUs carefully to avoid oversubscription. Two services sharing a single GPU will compete for VRAM and compute time, leading to out-of-memory errors or degraded performance. If you have one GPU and multiple GPU-hungry services, consider running them on separate machines or using a model server that handles queuing and batching internally.
Allocate Specific GPUs to Services
On multi-GPU systems, you can assign specific GPUs to specific services using device_ids instead of count. Device IDs correspond to the GPU indices shown by nvidia-smi, starting from 0. To assign GPU 0 to your model server and GPU 1 to a training service, set device_ids to ["0"] in the model server and ["1"] in the training service.
GPU assignment by device ID gives you predictable resource allocation. Your model server always gets the same GPU, which means you can choose your fastest or highest-VRAM GPU for inference and dedicate slower GPUs to background tasks. This is particularly useful when your system has mixed GPU models, like an RTX 4090 for inference and an RTX 3060 for embedding generation.
The NVIDIA_VISIBLE_DEVICES environment variable provides an alternative way to control GPU visibility. Setting this variable to a comma-separated list of GPU indices (like 0,2) restricts the container to seeing only those GPUs. Inside the container, the visible GPUs are renumbered starting from 0, so GPU 2 on the host becomes GPU 1 inside the container if the container also sees GPU 0.
For AI agent stacks that use Ollama, the model server handles GPU allocation internally. Ollama detects available GPUs and loads models into VRAM automatically. If you pass multiple GPUs to an Ollama container, it can distribute model layers across GPUs for models that exceed the VRAM of a single GPU. Configure the number of GPUs Ollama uses through the OLLAMA_NUM_GPU environment variable.
Monitor GPU Usage and VRAM
Monitoring GPU resources inside Docker containers is essential for AI workloads where VRAM exhaustion causes immediate container crashes. The simplest monitoring approach is running nvidia-smi inside the container periodically. Use docker exec with your container name to run nvidia-smi and see current GPU utilization, VRAM usage, temperature, and power consumption.
For continuous monitoring, run nvidia-smi in a loop with the -l flag and a refresh interval in seconds. This gives you a live dashboard of GPU metrics. For production deployments, export GPU metrics to your monitoring system using the NVIDIA DCGM (Data Center GPU Manager) exporter, which provides Prometheus-compatible metrics for GPU utilization, memory usage, temperature, power draw, and error counts.
VRAM management is the most critical monitoring concern for AI agent deployments. Each loaded model consumes a fixed amount of VRAM determined by the model size and quantization level. A 7B model at 4-bit quantization uses approximately 4 GB of VRAM, a 13B model uses approximately 8 GB, and a 70B model uses approximately 35 to 40 GB. Monitor VRAM usage to ensure your model server has enough headroom for inference batch processing on top of the static model weight allocation.
Set up alerts for GPU temperature and VRAM usage thresholds. GPUs throttle performance when they overheat, typically above 80 to 85 degrees Celsius, and containers crash when VRAM is exhausted. An alert at 75 percent VRAM usage and 80 degrees gives you time to investigate before performance degrades or services fail.
Troubleshoot Common GPU Issues
The most common GPU issue is "could not select device driver" or "no GPU devices available" errors when starting a container. This almost always means the NVIDIA Container Toolkit is not installed or the Docker daemon was not restarted after installation. Verify with nvidia-ctk runtime configure and restart Docker.
CUDA version mismatches between the host driver and the container image cause "CUDA driver version is insufficient" errors. The CUDA version in your container image must be compatible with the NVIDIA driver version on the host. Check the NVIDIA CUDA compatibility matrix to verify. Generally, newer drivers support older CUDA versions, but older drivers cannot run newer CUDA versions. Upgrading the host driver usually resolves this.
Out-of-memory errors during model loading mean the model is too large for the available VRAM. Solutions include using a more aggressively quantized model (4-bit instead of 8-bit), using a smaller model entirely, offloading some model layers to system RAM (slower but prevents the crash), or upgrading to a GPU with more VRAM. Ollama handles partial offloading automatically when a model exceeds available VRAM.
If GPU performance inside the container is significantly worse than on bare metal, check for power management settings. Some GPUs default to a power-saving mode that clocks down the GPU when it is idle. Set the persistence mode with nvidia-smi -pm 1 on the host to keep the GPU in its highest performance state. Also verify that the GPU is not being shared with another container or host process that competes for compute time.
Install the NVIDIA Container Toolkit on the host, configure GPU reservations in your Compose file, assign specific GPUs to services on multi-GPU systems, and monitor VRAM usage to prevent out-of-memory crashes during inference.