Automate 3000+ Apps AI Agent Workspace Custom AI Chatbot AI Support From Your Docs AI Meeting Notes Proxies For Automation

The Complete Self-Hosted AI Stack Explained

Updated May 2026

A self-hosted AI stack consists of five interconnected layers: an LLM inference engine for text generation, an embedding system for semantic search, a memory layer for persistent knowledge, a tool layer for external integrations, and an orchestration framework that coordinates everything into functioning AI agents. Each layer has multiple open-source options, and understanding how they connect is the first step toward building your own system.

Why Think in Layers

Breaking an AI system into layers serves the same purpose as architectural layers in any software system: it creates clear boundaries of responsibility, makes components interchangeable, and lets you reason about each piece independently. When your RAG pipeline returns poor results, the layer model tells you to investigate either your chunking strategy (embedding layer), your retrieval parameters (also embedding layer), or your prompt construction (orchestration layer). Without this mental model, debugging AI systems becomes guesswork.

The layer approach also simplifies upgrades. When a faster inference engine appears, you swap it at the LLM layer without touching your embedding pipeline or orchestration logic. When a better vector database launches, you migrate your embeddings without changing your model configuration. Each layer has its own API contract, and as long as the contract is honored, the rest of the system does not need to know what changed.

Layer 1: LLM Inference

The inference layer is responsible for loading model weights, accepting text prompts, and generating responses. This is the most compute-intensive layer and the one most affected by hardware constraints. The primary choice here is between Ollama (simple, great for development and single-user setups), vLLM (optimized for production with concurrent users), and llama.cpp (maximum efficiency for constrained hardware).

The inference engine exposes an API, typically OpenAI-compatible, that all other layers communicate with. This API abstraction is crucial because it means your orchestration code, your RAG pipeline, and your agent logic do not need to know which model is running underneath. You can switch from Llama 3.1 to Qwen 2.5 by changing a single configuration value.

Model selection at this layer involves tradeoffs between size, quality, and speed. A 7B parameter model at 4-bit quantization generates fast responses on modest hardware but struggles with complex reasoning. A 70B model produces more nuanced output but requires substantial GPU resources. Most stacks start with a 7B to 13B model and upgrade only when the quality gap becomes a real limitation for their use case.

Layer 2: Embeddings and Vector Search

The embedding layer transforms text into numerical representations that enable semantic search. When a user asks a question, the system embeds the question, searches the vector database for similar content, and passes the most relevant results to the LLM as context. This is the core mechanism behind retrieval-augmented generation (RAG), and it is what allows AI systems to answer questions about specific documents they were never trained on.

This layer has two components: the embedding model (which converts text to vectors) and the vector database (which stores and searches those vectors). Popular embedding models include nomic-embed-text for English content and BGE for multilingual needs. For vector storage, Qdrant offers the best combination of performance and features, while pgvector integrates naturally into existing PostgreSQL deployments.

Layer 3: Memory and State

The memory layer manages persistent state across conversations and sessions. At minimum, this means storing conversation history so agents can refer to earlier messages. More advanced implementations include summarized long-term memory (condensed knowledge from past interactions), entity memory (structured facts about users, projects, and concepts), and episodic memory (specific events and their outcomes).

Memory interacts closely with the embedding layer. Long-term memories are often embedded and stored in the same vector database used for document search, allowing the agent to retrieve relevant memories using the same similarity search mechanism. Session state and conversation history typically live in a relational database (PostgreSQL, SQLite) or key-value store (Redis) for fast structured access.

Layer 4: Tools and Integrations

The tool layer gives AI agents the ability to act on the world beyond text generation. This includes reading and writing files, making API calls, querying databases, sending messages, browsing the web, and executing code. Each tool is defined by its name, a description of what it does, its parameters, and its return format. The LLM receives these definitions and decides when and how to invoke them.

MCP (Model Context Protocol) has become the standard for packaging and distributing tools. An MCP server hosts one or more tools behind a standardized interface, and any MCP-compatible client can discover and use them. This has created a growing ecosystem of community tools that you can add to your stack without writing integration code from scratch. File system access, GitHub integration, database queries, and web search are all available as pre-built MCP servers.

Layer 5: Orchestration

Orchestration ties the other four layers into a functioning system. It defines the control flow: when to retrieve context, which model to call, which tools are available for each step, how to handle errors, and when a task is complete. Simple orchestration is a linear pipeline (retrieve, augment, generate). Complex orchestration involves agent loops, multi-agent collaboration, conditional branching, and human-in-the-loop checkpoints.

The choice of orchestration tool depends on your team and use case. LangGraph offers programmatic control through Python with explicit state management. n8n provides a visual workflow builder that non-developers can use. Dify combines a visual builder with built-in RAG and model management. Custom orchestration code gives maximum flexibility but requires more development effort.

How the Layers Connect

In a typical request flow, the orchestration layer receives a user message, queries the memory layer for relevant conversation history, searches the embedding layer for related documents, constructs a prompt that includes the retrieved context and available tool definitions, sends this prompt to the LLM layer, parses any tool calls from the response, executes those tools through the tool layer, and returns the final result to the user. This entire cycle might repeat multiple times for complex tasks where the agent needs to gather information iteratively.

The connections between layers are usually HTTP APIs or function calls within the same process. Ollama exposes a REST API on port 11434. Qdrant exposes REST and gRPC APIs. MCP servers communicate over standard I/O or HTTP. The orchestration layer acts as the central coordinator that knows how to talk to each component. Docker Compose networking makes this straightforward since all containers share a network and can reference each other by service name.

Common Deployment Architectures

The most common deployment architecture for self-hosted AI stacks is Docker Compose on a single machine. Each layer runs in its own container: Ollama for inference, Qdrant for vector search, PostgreSQL for relational storage, n8n or a custom application for orchestration, and Open WebUI for the user interface. Docker Compose defines the entire stack in a single YAML file, manages container lifecycle, and provides internal DNS so containers can reference each other by service name. This approach works well for development, small teams, and single-server production deployments.

For larger deployments, Kubernetes orchestrates the same containers across multiple machines. Kubernetes adds automatic scaling by spinning up additional inference containers when load increases, load balancing by distributing requests across multiple model servers, health checking by restarting failed containers automatically, and rolling updates by deploying new versions without downtime. The operational complexity of Kubernetes is substantial, and most self-hosted AI deployments do not need it. Consider Kubernetes only when you have genuinely outgrown a single server and have the infrastructure expertise to manage a cluster.

Regardless of deployment architecture, separate your data storage from your compute containers. Mount database directories and model files on persistent volumes that survive container restarts and replacements. Back up these volumes regularly. A container can be rebuilt from its image in seconds, but data lost from a failed volume is gone permanently. This separation of compute and storage is a fundamental principle that applies whether you run Docker Compose on a laptop or Kubernetes on a cloud cluster.

Key Takeaway

A self-hosted AI stack is five layers working together: inference, embedding, memory, tools, and orchestration. Understanding each layer independently, and the interfaces between them, gives you the foundation to build, debug, and upgrade any AI application.

Why Think in Layers

Layer 1: LLM Inference

Layer 2: Embeddings and Vector Search

Layer 3: Memory and State

Layer 4: Tools and Integrations

Layer 5: Orchestration

How the Layers Connect

Common Deployment Architectures

Related Articles

Choosing Components for Your AI Stack

The LLM Layer: Choosing Your AI Models

Popular Self-Hosted AI Stack Combinations

Run AI Locally