Building an AI Stack with Ollama

Updated May 2026
Ollama is the most popular foundation for self-hosted AI stacks because it simplifies model management to a few commands while providing an OpenAI-compatible API that other stack components can use without modification. This guide covers installing Ollama, selecting and managing models, configuring multi-model serving, integrating with other stack components, and optimizing performance for your hardware.

Step 1: Install and Configure Ollama

Ollama installs as a single binary on Linux, macOS, and Windows. On Linux, the official install script downloads and configures everything in one command. On macOS, download the application from the Ollama website. On Windows, use the official installer. For Docker-based stacks, use the ollama/ollama container image with GPU access enabled through the NVIDIA Container Toolkit.

After installation, Ollama runs as a background service that serves on port 11434 by default. The service starts automatically on system boot and manages model lifecycle, GPU allocation, and request queuing. You can verify the installation by running ollama list (which shows an empty model list on first install) and ollama --version.

For Docker deployments, configure Ollama's container with the following considerations: mount a volume for model storage at /root/.ollama (so models persist across container restarts), pass GPU access using --gpus all, and set the OLLAMA_HOST environment variable to 0.0.0.0 to allow connections from other containers on the Docker network. These settings ensure Ollama works correctly as part of a multi-container stack.

Step 2: Select and Pull Models

Choose models based on your primary use case. For general-purpose chat and instruction following, pull llama3.1:8b or qwen2.5:7b. For code generation and programming assistance, pull deepseek-coder-v2:16b or codellama:13b. For embedding (needed for RAG), pull nomic-embed-text. Each model downloads with a simple command like ollama pull llama3.1:8b.

Ollama's model library contains hundreds of models in various sizes and quantization levels. The tag after the model name specifies the variant: llama3.1:8b is the 8-billion parameter version, llama3.1:70b is the 70-billion parameter version, and llama3.1:8b-q4_0 specifies a particular quantization level. When no quantization tag is specified, Ollama pulls the recommended default (usually Q4_K_M, which balances quality and size well).

Consider keeping multiple models available for different tasks. A 7B model handles routine chat and simple tasks with fast response times. A 13B model tackles more complex reasoning and detailed content generation. An embedding model supports your RAG pipeline. Ollama loads and unloads models from GPU memory on demand, so having multiple models downloaded does not consume additional VRAM when they are not actively in use. Only the currently loaded model occupies GPU memory.

Step 3: Configure the Ollama API

Ollama exposes two API formats: its native API (endpoints like /api/generate and /api/chat) and an OpenAI-compatible API (endpoint /v1/chat/completions). Most third-party tools connect through the OpenAI-compatible endpoint because it uses the same format as the OpenAI API, making integration seamless. Open WebUI, n8n, LangChain, and most AI frameworks can connect to Ollama by pointing their OpenAI API base URL to http://localhost:11434/v1.

Model parameters like temperature, top_p, top_k, and maximum tokens can be set per-request through the API or globally through Ollama's Modelfile system. A Modelfile lets you create custom model configurations that set default parameters, system prompts, and response formatting. This is useful for creating task-specific model variants: a "coding" configuration with low temperature for precise code generation, a "creative" configuration with higher temperature for brainstorming, and a "summarizer" configuration with a specialized system prompt for document condensation.

Step 4: Set Up Multi-Model Serving

Ollama manages multiple models through automatic loading and unloading. When a request specifies a model that is not currently loaded, Ollama loads it into GPU memory (unloading a previously loaded model if necessary to free VRAM). This process takes a few seconds for the first request to a new model, after which subsequent requests are served at full speed. The keep_alive parameter controls how long a model stays loaded after the last request, defaulting to 5 minutes.

For stacks that regularly use multiple models simultaneously (for example, one model for chat and another for embedding), you can configure Ollama to keep both models loaded by setting OLLAMA_NUM_PARALLEL to allow concurrent model loading. This requires enough GPU VRAM to hold all simultaneously loaded models. On a 24 GB GPU, you can comfortably keep a 7B chat model (4 GB) and an embedding model (0.5 GB) loaded simultaneously with plenty of room for KV cache.

When VRAM is limited, a more practical approach is to designate specific models for specific stack components. Open WebUI uses one model for interactive chat. n8n workflows use the same or a different model for automated processing. Embedding requests use the embedding model. Ollama queues requests and manages model swapping automatically, though frequent model switches add latency as models load and unload.

Step 5: Integrate with Stack Components

Open WebUI connects to Ollama by setting the OLLAMA_BASE_URL environment variable to the Ollama endpoint (http://ollama:11434 in Docker). Open WebUI automatically discovers all available models and presents them in its model selector. No additional configuration is needed for basic integration.

n8n connects to Ollama through its Ollama Chat Model credential. Create a new credential with the Ollama base URL, then use it in AI Agent nodes, LLM Chain nodes, and other AI-related workflow nodes. n8n treats Ollama as any other LLM provider, so all of n8n's AI workflow capabilities are available with local models.

Qdrant integration works through the embedding pipeline. Your RAG application (whether Open WebUI's built-in RAG, an n8n workflow, or custom code) sends text to Ollama's embedding endpoint, receives vectors, and stores them in Qdrant. At query time, the same process generates a query vector that Qdrant searches against. The integration point is your application code or workflow logic, not a direct Ollama-to-Qdrant connection.

Step 6: Optimize Performance

The most impactful performance optimization is choosing the right model size for your GPU. A model that fits entirely in VRAM generates responses 5 to 10 times faster than one that partially offloads to CPU. Check GPU VRAM usage with nvidia-smi while a model is loaded. If VRAM usage is at or near capacity, consider a smaller model or more aggressive quantization.

Context window size directly affects memory usage and speed. The default 4096-token context uses about 500 MB of additional VRAM for a 7B model. Increasing to 32K tokens uses about 4 GB additional. Set the context size to the minimum your application needs. For simple chat interactions, 4096 tokens is sufficient. For RAG with large context, 8192 to 16384 tokens is a good range. Only use 32K or larger contexts when working with very long documents.

Monitor system resources during operation to identify bottlenecks. High CPU usage during generation indicates the model is partially running on CPU (common when VRAM is insufficient). High RAM usage suggests too many stack components competing for system memory. Slow initial responses after model switches indicate frequent model loading, which can be reduced by increasing the keep_alive timeout or pre-loading models at startup.

For Docker-based deployments, ensure Ollama has access to sufficient shared memory by setting the shm_size option in your docker-compose configuration. The default shared memory allocation in Docker (64 MB) can be insufficient for large models and cause unexpected crashes during inference. Setting shared memory to 1 GB or higher eliminates this issue. Also configure Docker's logging driver to prevent container logs from consuming excessive disk space over time, especially when running inference continuously with high request volumes.

Key Takeaway

Ollama simplifies the hardest part of self-hosted AI: managing models and GPU resources. Start with a single model, verify it works through the API, connect your other stack components, then expand to multiple models and optimize performance based on actual usage patterns.