Popular Self-Hosted AI Stack Combinations

Updated May 2026
While the number of possible component combinations in a self-hosted AI stack is enormous, several specific configurations have proven themselves through widespread adoption and real production use. These stacks represent tested, documented combinations where the components work well together, community support is strong, and the tradeoffs are well understood.

The Starter Stack: Ollama and Open WebUI

The most common entry point into self-hosted AI is the Ollama and Open WebUI combination. Ollama handles model downloading, GPU management, and inference serving through a clean REST API on port 11434. Open WebUI provides a polished ChatGPT-style web interface that connects to Ollama and adds conversation history, model switching, document uploads for basic RAG, web search integration, and multi-user support with authentication.

This two-component stack deploys in under five minutes with Docker Compose. The hardware requirement is modest: any machine with 8 GB of RAM and a GPU with at least 6 GB of VRAM runs a 7B model comfortably. Open WebUI stores conversations in a local SQLite database by default, which means you get persistent chat history with zero additional configuration. The web interface is responsive and works well on mobile devices, making it practical as a daily-driver AI assistant.

The limitations of this stack are the limitations of its simplicity. There is no dedicated vector database for large-scale document search (Open WebUI's built-in RAG works for small document collections but does not scale). There is no workflow automation or agent orchestration beyond basic chat. There is no programmatic API for integrating AI into other applications. When these limitations matter, you upgrade to a more complete stack rather than replacing what you have.

The Knowledge Stack: Ollama, Open WebUI, and Qdrant

Adding Qdrant to the starter stack transforms it from a chat interface into a knowledge system. Qdrant provides production-grade vector storage and search, enabling RAG over large document collections (thousands to millions of documents) with fast, filtered similarity search. Open WebUI can connect to Qdrant for its RAG pipeline, or you can build a custom RAG application using Qdrant's API directly.

This stack is the natural choice for teams that need a private AI assistant that can answer questions about their company's documentation, codebase, knowledge base, or internal communications. You ingest documents into Qdrant (chunking, embedding, and indexing them), and queries automatically retrieve relevant context before sending prompts to the LLM. The addition of Qdrant adds approximately 200 MB of RAM usage and minimal CPU overhead.

Alternatively, teams already running PostgreSQL can achieve similar results by adding the pgvector extension instead of deploying Qdrant as a separate service. This reduces the number of components to manage and keeps vector data in the same database as other application data. The tradeoff is slightly lower vector search performance at large scale and fewer vector-specific features (payload filtering, quantized vectors) compared to Qdrant.

The Automation Stack: Ollama, Open WebUI, n8n, and Qdrant

The automation stack adds n8n as an orchestration layer, connecting AI models to external services and enabling multi-step agent workflows. n8n's visual workflow builder lets you create AI-powered automations that trigger on events (new email, Slack message, scheduled time), process information through LLM calls, search for context in Qdrant, and take actions (send notifications, update databases, create tickets, post messages).

This stack is particularly powerful for business automation use cases. A customer support workflow might monitor an email inbox, classify incoming messages with the LLM, search a knowledge base in Qdrant for relevant answers, generate a draft response, and send it for human review before delivery. A content monitoring workflow might fetch RSS feeds, summarize new articles, compare them against your company's knowledge base, and alert relevant team members about competitive developments. These workflows run continuously in the background without human intervention.

n8n integrates with over 400 external services through pre-built nodes, so connecting your AI to Slack, Google Workspace, Notion, GitHub, Jira, HubSpot, and hundreds of other tools requires configuration rather than code. For services without pre-built nodes, n8n's HTTP Request node and Code node let you integrate with any API. The AI Agent node provides a built-in tool-calling loop that handles the iterative nature of agent interactions automatically.

The Production Stack: vLLM, PostgreSQL with pgvector, Redis, and LangGraph

Production deployments serving multiple concurrent users typically move beyond Ollama to a stack optimized for throughput, reliability, and operational control. vLLM replaces Ollama for two to five times better concurrent request handling through PagedAttention and continuous batching. PostgreSQL with pgvector provides both relational storage (user accounts, conversation history, audit logs) and vector search in a single, well-understood database. Redis adds session caching, rate limiting, and fast temporary storage. LangGraph provides programmatic orchestration with explicit state management, error recovery, and observability.

This stack requires more operational expertise to deploy and maintain. vLLM configuration involves specifying model parameters, GPU memory allocation, and batching settings. PostgreSQL needs proper connection pooling (PgBouncer), regular vacuuming, and backup procedures. Redis needs persistence configuration and memory management. LangGraph requires Python development skills and a deployment pipeline for code changes. The reward for this complexity is a system that handles dozens of concurrent users reliably, survives component restarts gracefully, and provides the monitoring and debugging tools needed for production operation.

Many production stacks also include monitoring and observability tools: Prometheus for metrics collection, Grafana for dashboards, and either LangSmith or a custom solution for LLM-specific tracing (viewing the full chain of LLM calls, tool invocations, and retrieval results for each user request). These tools are not strictly necessary but become essential for diagnosing issues and optimizing performance as usage grows.

The Developer Stack: Ollama, Continue, and Qdrant

Developers building AI-assisted coding environments often use a stack centered around Ollama for inference and Continue (an open-source AI coding assistant) as the interface. Continue integrates with VS Code and JetBrains IDEs, connecting to local Ollama models for code completion, explanation, refactoring, and chat. Adding Qdrant enables codebase-aware assistance where the model searches your repository's code and documentation to provide contextually relevant suggestions.

This stack keeps all code and prompts local, which is critical for teams working with proprietary or sensitive codebases that cannot be sent to external APIs. The quality of code assistance depends heavily on the model: DeepSeek Coder models provide the best code generation quality in self-hosted setups, while general-purpose models like Llama 3.1 handle code explanation and refactoring well but produce less idiomatic code for specialized languages and frameworks.

Choosing Your Stack

The right stack depends on what you are building. For personal AI assistance and experimentation, the starter stack (Ollama and Open WebUI) provides everything you need with minimal complexity. For knowledge-intensive applications like documentation search and research assistance, add Qdrant for proper vector search. For business automation, add n8n for workflow orchestration. For multi-user production services, invest in the production stack with vLLM, PostgreSQL, and LangGraph.

Start with the simplest stack that meets your needs and upgrade individual components as you hit specific limitations. The beauty of a layered architecture is that you can replace any component without rebuilding the entire system. Swap Ollama for vLLM when you need better concurrency. Add Qdrant when built-in RAG is not sufficient. Introduce LangGraph when n8n workflows become too limiting. Each upgrade addresses a concrete problem rather than adding speculative complexity.

Keep in mind that the community around each stack configuration matters as much as the technical capabilities. Ollama and Open WebUI have the largest user communities, which means extensive documentation, active forums, and rapid bug fixes. n8n has a growing community of AI workflow builders sharing templates and best practices. LangGraph benefits from LangChain's large developer community and extensive tutorials. When evaluating stack options, check the community resources available for your chosen combination, because you will inevitably encounter configuration issues and edge cases where community knowledge saves significant debugging time.

Documentation quality varies significantly between components. Before committing to a stack, verify that setup guides, API references, and troubleshooting resources exist for each component you plan to use. A technically superior tool with poor documentation will cost you more time than a simpler alternative with excellent guides. Open WebUI, Qdrant, n8n, and Ollama all maintain comprehensive documentation with active community contributions. Evaluate documentation quality as a first-class decision criterion alongside features and performance when selecting your stack components.

Key Takeaway

Start with Ollama and Open WebUI. Add Qdrant when you need document search. Add n8n when you need automation. Move to vLLM and LangGraph when you need production-grade concurrent access. Each stack builds on the previous one, so every step forward preserves your existing work.