Minimal AI Stack: The Cheapest Setup That Works

Updated May 2026
You do not need expensive GPUs or powerful servers to run a useful self-hosted AI stack. A minimal setup using CPU inference, lightweight models, and efficient components costs as little as zero dollars on hardware you already own. This guide covers the cheapest configurations that deliver genuinely useful AI capabilities, the tradeoffs involved, and how to get the most from limited resources.

What Minimal Means

A minimal AI stack strips away everything except the essentials: a model that can generate text and a way to interact with it. At the bare minimum, this is Ollama running a small model on CPU with the command-line interface as your interaction method. No web UI, no vector database, no orchestration framework. This setup costs nothing beyond the computer you already have and delivers a functional AI assistant for text generation, question answering, code help, and creative writing.

The minimal viable stack adds Open WebUI for a proper web interface and conversation history, bringing the total to two components. This remains well within the capabilities of any modern computer with 8 GB of RAM, even without a dedicated GPU. The experience is slower than GPU inference (5 to 15 tokens per second versus 30 to 100 on GPU), but the output quality is identical since the same model produces the same results regardless of whether the computation happens on CPU or GPU.

The defining characteristic of a minimal stack is CPU inference. Every other layer (memory, embedding, tools, orchestration) is optional and can be added later. The hardware requirement is simply a computer with enough RAM to hold the model weights: 8 GB for a 7B model at 4-bit quantization (the model needs about 4 GB, and the system needs the rest), 16 GB for comfortable operation with larger context windows, or 32 GB for running 13B models on CPU.

Hardware That Works

Almost any computer manufactured in the last five years can run a minimal AI stack. The key requirement is RAM, not CPU speed or GPU presence. An older laptop with an Intel i5 or AMD Ryzen 5, 8 GB of RAM, and a standard SSD can run a 7B model through Ollama. Response generation will be slow (3 to 8 tokens per second depending on the specific CPU), but the model produces useful output for tasks that do not require real-time interaction.

Apple Silicon Macs (M1, M2, M3, M4) are exceptionally good for CPU-based AI inference. The unified memory architecture lets the model access system RAM at much higher bandwidth than traditional computers, and the integrated neural engine accelerates certain inference operations. An M1 MacBook Air with 16 GB of unified memory runs a 7B model at 15 to 25 tokens per second, which feels responsive enough for interactive chat. This makes older Apple Silicon Macs one of the best value propositions for minimal self-hosted AI.

Raspberry Pi 5 with 8 GB of RAM represents the absolute hardware floor. Ollama runs on ARM Linux, and a 3B parameter model (like Phi-3 Mini) fits in 8 GB with room for the operating system. Performance is limited to 1 to 3 tokens per second, making it impractical for interactive use but functional for batch processing, automated analysis, and background tasks where response time does not matter. The hardware cost is under 100 dollars.

Choosing Models for Limited Hardware

On CPU-only hardware, model size directly determines both quality and speed. Smaller models generate faster but produce less capable output. The sweet spot for minimal stacks in 2026 is the 7B to 8B parameter range at 4-bit quantization: small enough to run on 8 GB of RAM, large enough to handle most practical tasks competently. Llama 3.1 8B and Qwen 2.5 7B are the strongest options at this size.

For extremely constrained hardware (under 8 GB RAM or Raspberry Pi), 3B parameter models like Phi-3 Mini and Llama 3.2 3B are surprisingly capable for their size. They handle basic question answering, simple coding tasks, text summarization, and translation well. They struggle with complex reasoning, long-form content generation, and tasks requiring deep domain knowledge. Think of them as knowledgeable assistants that work best when given clear, specific questions rather than open-ended complex tasks.

Quantization level matters more on CPU than on GPU. On GPU, the difference between Q4 (4-bit) and Q5 (5-bit) quantization is negligible in speed. On CPU, each additional bit per weight increases memory bandwidth requirements and slows generation. For minimal stacks, Q4_K_M offers the best balance: it runs fast, uses less RAM, and retains quality well. Only use higher quantization (Q5, Q6, Q8) if you have RAM to spare and do not mind slower responses.

Getting the Most from Minimal Hardware

Prompt engineering matters more on smaller models. A 70B model can interpret vague instructions and produce good results. A 7B model needs clear, specific prompts with explicit instructions, relevant context, and examples of desired output format. Spending time crafting better prompts yields more improvement than any hardware upgrade on a minimal stack.

Reduce context window size if memory is tight. Most models default to a context window of 4096 or 8192 tokens, but you can reduce this in Ollama's configuration to free up RAM. A 2048-token context window is sufficient for most single-turn interactions and uses significantly less memory during inference. Only expand the context window when you genuinely need long conversations or large documents in context.

Consider a hybrid approach: run the minimal stack locally for routine tasks and low-sensitivity work, and use a commercial API (pay-per-token) for occasional complex tasks that exceed your local model's capabilities. This gives you the privacy and cost benefits of self-hosting for 90 percent of usage while accessing frontier model quality when it genuinely matters. The total cost stays low because the expensive API calls are infrequent.

Use Cases That Work on Minimal Hardware

Certain AI tasks perform surprisingly well on minimal hardware because they do not require complex reasoning or massive context windows. Text summarization works reliably on 7B models: give the model a document and ask for a summary, and even CPU inference produces accurate, useful results. Translation between common language pairs handles well on small models trained on multilingual data. Sentiment analysis and text classification are straightforward pattern-matching tasks that small models execute accurately. Code explanation, as opposed to code generation, works well because the model is describing existing logic rather than creating new solutions.

Tasks that struggle on minimal hardware are those requiring deep reasoning chains, large context windows, or creative generation at high quality. Writing a detailed technical report, analyzing complex multi-step logic problems, generating production-quality code for unfamiliar frameworks, or maintaining coherent narratives over thousands of words all benefit significantly from larger models. For these tasks, a minimal stack works best when combined with occasional cloud API access for the complex queries that exceed local capabilities.

Expanding Beyond Minimal

When you outgrow the minimal stack, add components one at a time based on the specific limitation you are hitting. If you need document search, add Qdrant or pgvector for vector storage and enable RAG in Open WebUI. If you need automated workflows, add n8n for orchestration. If response speed is the bottleneck, add a GPU or upgrade to a faster model. Each addition addresses a concrete problem while preserving everything you have already configured.

The most common first upgrade is adding a GPU. Even an entry-level NVIDIA RTX 3060 with 12 GB of VRAM transforms the experience from 5 tokens per second on CPU to 30 or more tokens per second on GPU, making interactive chat feel responsive rather than sluggish. Used RTX 3060 cards are available for 200 to 300 dollars in mid-2026, making this the single highest-impact upgrade for a minimal stack. The software configuration stays identical since Ollama automatically detects and uses the GPU without any configuration changes.

The second most common upgrade is adding persistent storage and proper memory. The default SQLite database in Open WebUI works for personal use but does not support concurrent access well. Migrating to PostgreSQL provides reliable multi-user access, proper backup capabilities, and the option to add pgvector for RAG in the same database. PostgreSQL runs comfortably in a Docker container with 256 MB of RAM allocated, adding negligible overhead to your minimal stack.

Key Takeaway

A useful self-hosted AI stack runs on any computer with 8 GB of RAM. Start with Ollama and a 7B model on CPU, add Open WebUI for a proper interface, and invest in better prompts rather than better hardware. Upgrade to GPU inference only when CPU speed becomes a genuine bottleneck for your workflow.