Common Docker Issues with AI Deployments

Updated May 2026
Docker deployments for AI agents encounter a predictable set of issues that stem from the unique resource demands of AI workloads. GPU passthrough failures, out-of-memory container kills, networking misconfigurations, and volume permission problems account for the vast majority of Docker-related support requests in AI agent projects. Understanding these common issues and their solutions saves hours of debugging.

GPU Access Failures

The most common GPU issue is a "could not select device driver" or "no GPU devices available" error when starting a container that requests GPU access. This almost always means the NVIDIA Container Toolkit is not installed, the Docker daemon was not restarted after toolkit installation, or the toolkit runtime configuration was not applied. Fix this by running nvidia-ctk runtime configure --runtime=docker followed by systemctl restart docker.

CUDA version mismatches between the host driver and the container image cause "CUDA driver version is insufficient for CUDA runtime version" errors. The CUDA runtime version in your container image must be compatible with the NVIDIA driver version installed on the host. Generally, newer drivers support older CUDA runtimes, but older drivers cannot run newer CUDA versions. Check the NVIDIA CUDA compatibility matrix and upgrade your host driver if needed.

GPU access works with docker run --gpus all but fails in Docker Compose when the deploy.resources.reservations.devices section is not formatted correctly. The YAML syntax for GPU reservations is strict: the devices key must be a list with a single item that specifies driver, count (or device_ids), and capabilities. A common formatting mistake is putting capabilities at the wrong indentation level.

On multi-GPU systems, "CUDA out of memory" errors occur when two services try to use the same GPU simultaneously. Assign specific GPUs to specific services using device_ids in the Compose file rather than allocating all GPUs to every service. This prevents VRAM contention between model servers, embedding generators, and other GPU-consuming workloads.

Out-of-Memory Container Kills

Docker containers that exceed their memory limit are immediately terminated by the Linux OOM (out-of-memory) killer. The container process receives a SIGKILL signal with no opportunity to save state, close connections, or shut down gracefully. The docker compose ps output shows the container as having exited with code 137, which indicates an OOM kill.

AI agent containers are particularly susceptible to OOM kills because model loading and inference have high, variable memory requirements. A model server might use 4 GB of RAM during normal operation but spike to 8 GB during model loading. If the memory limit is set to 6 GB, the container works fine until you load a new model, then gets killed immediately.

Fix OOM kills by increasing the memory limit for the affected service. Monitor actual memory usage with docker stats before setting limits, and add 30 to 50 percent headroom above the observed peak. For model servers, the memory limit should account for both the model weights in VRAM (which may be memory-mapped and count toward the container memory usage) and the server process system RAM.

If increasing the memory limit is not possible because the host has limited RAM, reduce memory usage by using more aggressively quantized models (4-bit instead of 8-bit reduces memory by roughly half), reducing batch sizes, or enabling model offloading features that split the model between GPU VRAM and system RAM.

Networking and Service Discovery Problems

Services failing to connect to each other is usually caused by using the wrong hostname. In Docker Compose, services connect using the service name defined in the Compose file, not localhost, not the container ID, and not the container IP address. If your agent configuration uses localhost as the database host, it will fail because localhost inside a container refers to the container itself, not other services.

Port mapping confusion is another common networking issue. Services within a Compose stack communicate using their internal container ports, not the host-mapped ports. If your PostgreSQL service maps port 5432 to host port 5433 (with 5433:5432), other containers in the stack still connect on port 5432 because they communicate over the internal Docker network. The host mapping only affects connections from outside Docker.

DNS resolution failures (service name not resolving) typically occur when containers are on different Docker networks. If you define custom networks in your Compose file, make sure that services that need to communicate share at least one common network. A service on the frontend network cannot resolve the hostname of a service on the backend network unless one of them joins both networks.

Connection timeouts during startup happen when a service tries to connect to a dependency before the dependency is ready. Use depends_on with condition: service_healthy rather than condition: service_started so that Docker waits for the dependency to pass its health check before starting dependent services.

Volume Permission Errors

Permission denied errors when containers try to write to volumes are common when the container runs as a non-root user. Docker named volumes are initially created with root ownership, and a container running as UID 1000 cannot write to a root-owned directory. Fix this by adding a chown command in your Dockerfile that changes ownership of the data directory to the application user, or by using an init container that sets permissions before the main container starts.

On Linux hosts with SELinux enabled (common on RHEL, CentOS, and Fedora), volume mounts may fail with permission denied even when file permissions are correct. Docker needs the :z or :Z suffix on bind mount paths to apply the correct SELinux context. The :z suffix allows sharing between containers, while :Z makes the mount private to one container.

Volume data from a previous installation may have incompatible permissions or ownership when you change the container user or image. If you switch from an image that runs as root to one that runs as UID 999 (common for PostgreSQL), the existing volume data is still owned by root and the new container cannot access it. Fix this by running a one-time permission change with a temporary privileged container.

Slow Image Builds and Large Images

Docker image builds that take excessively long usually result from poor layer caching. If your Dockerfile copies source code before installing dependencies, every code change invalidates the dependency installation layer, forcing a full reinstall. Fix this by copying requirements.txt and running pip install before copying your application code. This way, the dependency layer is cached and reused for most builds.

Large image sizes (over 2 GB) are common with AI agent containers because ML packages like PyTorch, TensorFlow, and their dependencies are large. Reduce image size by using slim base images (python:3.11-slim instead of python:3.11), using multi-stage builds to exclude build tools from the final image, and removing cache files after package installation.

Build context size affects build speed because Docker sends the entire build context to the daemon before starting the build. If your project directory contains large files like model weights, datasets, or virtual environments, the build context transfer takes a long time. Add these files to .dockerignore to exclude them from the build context.

Pulling base images on every build suggests that local image caching is not working properly or that docker system prune is being run too aggressively. Base images should be cached locally after the first pull. If you use CI/CD systems, configure them to cache Docker layers between builds to avoid pulling and rebuilding from scratch on every pipeline run.

Key Takeaway

Most Docker issues with AI deployments come from GPU access configuration, memory limits that are too tight for model loading, networking misconfigurations (using localhost instead of service names), and volume permissions. Check these areas first when debugging any deployment problem.