The LLM Layer: Choosing Your AI Models

Updated May 2026
The LLM layer is the foundation of your self-hosted AI stack: the inference engine that loads model weights, processes prompts, and generates text responses. Your choice of inference engine (Ollama, vLLM, or llama.cpp) and model (Llama, Mistral, Qwen, DeepSeek) determines response quality, generation speed, and hardware requirements for your entire system.

What the LLM Layer Does

The LLM layer accepts text prompts and returns generated text. Underneath that simple interface, it manages model weight loading (moving gigabytes of parameters into GPU VRAM or system RAM), tokenization (converting text to numerical tokens the model understands), attention computation (the mathematical core of transformer models), and token sampling (selecting each output token based on probability distributions). The inference engine abstracts all of this behind an API, typically compatible with the OpenAI chat completions format, so the rest of your stack interacts with a simple HTTP endpoint.

This layer is the most hardware-intensive component of your stack. A 7-billion parameter model at 4-bit quantization requires approximately 4 GB of VRAM. A 13B model needs around 8 GB. A 70B model needs 35 to 40 GB. Everything else in the stack (vector databases, orchestration, tools) is CPU-bound and uses minimal resources compared to model inference. Your GPU budget effectively determines the maximum model size you can run, which in turn determines the quality ceiling for your AI applications.

Inference Engines Compared

Ollama is the most popular inference engine for self-hosted AI because it prioritizes simplicity. Installation is a single command on Linux, macOS, and Windows. Pulling a model is equally simple: ollama pull llama3.1:8b downloads, verifies, and configures the model automatically. Ollama detects your GPU, allocates VRAM, and begins serving on port 11434 with no manual configuration. For models too large for your GPU, it automatically splits layers between GPU and CPU, trading speed for the ability to run larger models.

vLLM targets production workloads where throughput and concurrency matter. Its key innovation, PagedAttention, manages GPU memory like a virtual memory system, allocating and freeing blocks dynamically as requests come and go. Combined with continuous batching (processing multiple requests through the model simultaneously rather than one at a time), vLLM achieves two to five times the throughput of Ollama on the same hardware when serving concurrent users. The tradeoff is more complex deployment and fewer supported model formats.

llama.cpp provides the most efficient inference for constrained hardware. Written in C/C++ with extensive SIMD optimization, it achieves the highest tokens-per-second on CPU-only machines and the most efficient use of limited GPU VRAM. Its GGUF model format supports flexible quantization schemes that let you trade quality for memory and speed with fine granularity. The cost is a less user-friendly interface and fewer high-level features compared to Ollama or vLLM.

Model Selection in 2026

The open-source model landscape in mid-2026 offers strong options at every size tier. At the 7B to 8B parameter level, Llama 3.1 8B and Qwen 2.5 7B deliver the best general-purpose performance. Both handle instruction following, conversation, summarization, and light reasoning well. Qwen has a slight edge on multilingual tasks and mathematical reasoning, while Llama excels at English language generation and following complex instructions.

At the 13B to 14B tier, model quality improves noticeably for reasoning, code generation, and handling nuanced instructions. The difference between 7B and 13B is most apparent on tasks requiring multi-step logic, long-form content generation, and accurate technical explanations. If your hardware supports it, a 13B model is a meaningful upgrade from 7B for knowledge-intensive applications.

The 70B tier represents the quality ceiling for most self-hosted setups. Models at this size compete with earlier commercial offerings on many benchmarks and handle complex reasoning, coding, and analysis with strong accuracy. Running a 70B model requires either a high-end workstation GPU (48 GB VRAM from the A6000 or dual consumer GPUs) or aggressive quantization with partial CPU offloading on a 24 GB card. The quality improvement over 13B is real but the hardware cost is substantial.

DeepSeek models deserve special mention for code generation tasks. DeepSeek Coder V2 and its successors consistently outperform similarly-sized general-purpose models on programming benchmarks and produce more accurate, idiomatic code across dozens of programming languages. If your AI stack primarily serves coding assistance, prioritizing DeepSeek at the LLM layer yields better results than using a general model.

Quantization Explained

Quantization reduces model size by representing weights with fewer bits. A full-precision model uses 16-bit floating-point numbers (FP16), requiring 2 bytes per parameter. A 7B model at FP16 needs 14 GB of VRAM. Quantizing to 4-bit (Q4_K_M in GGUF terminology) reduces this to roughly 4 GB while retaining most of the model's quality. The quality loss from 4-bit quantization is typically 1 to 3 percent on standard benchmarks, which is imperceptible for most practical applications.

Common quantization levels include Q8 (8-bit, minimal quality loss, half the size of FP16), Q5 (5-bit, good balance of quality and compression), Q4 (4-bit, the most popular for self-hosted use), and Q3 (3-bit, noticeable quality degradation but useful for running larger models on limited hardware). GGUF files from sources like Hugging Face typically offer multiple quantization variants so you can choose the right tradeoff for your hardware.

Hardware Requirements

NVIDIA GPUs dominate the self-hosted AI hardware landscape due to CUDA's mature software ecosystem. Consumer cards suitable for inference include the RTX 3060 12GB (budget option, runs 7B models well), RTX 3090 24GB (runs 13B at full speed and 70B with heavy quantization), and RTX 4090 24GB (fastest consumer card, excellent for 7B to 13B). Professional and server cards like the A100 40GB/80GB, A6000 48GB, and H100 80GB handle larger models without compromise.

AMD GPUs have improved their AI inference support through ROCm, and Ollama now supports AMD cards natively. Performance is generally 10 to 30 percent behind comparable NVIDIA cards due to less mature driver and library support, but the hardware is often significantly cheaper. For budget-conscious deployments, AMD offers compelling value.

CPU-only inference is viable for smaller models (up to 7B) on machines with sufficient RAM and modern processors. Performance is roughly 10 to 20 times slower than GPU inference, which makes it unsuitable for interactive chat but acceptable for batch processing, background analysis, and applications where latency is not critical. Apple Silicon Macs offer surprisingly good CPU inference performance due to their unified memory architecture and neural engine acceleration.

Model Updates and Migration

The open-source model landscape evolves rapidly, with significant new releases appearing every few months. When a new model arrives that benchmarks better than your current choice, resist the urge to switch immediately. Instead, run the new model alongside your current one using the same set of test prompts that represent your actual workload. Compare results on the dimensions that matter for your application: response accuracy, instruction following, output formatting, and generation speed. Benchmark scores do not always translate to real-world improvements for specific use cases.

Model migration in Ollama is straightforward: pull the new model, test it through the API, and update your configuration to point at the new model name. The old model remains available for fallback until you explicitly remove it. This zero-downtime migration process means you can switch models without disrupting active users or workflows. For production deployments using vLLM, model changes require restarting the server with the new model path, so schedule migrations during maintenance windows.

Keep at least one previous model version available for rollback. Occasionally a new model that performs better on benchmarks produces worse results for your specific prompts or introduces unexpected behavioral changes like different formatting conventions, altered refusal patterns, or changed verbosity levels. Having the previous model available lets you roll back immediately if the new model causes problems, rather than waiting for a fix or workaround. Model files are large but storage is cheap compared to the cost of degraded AI performance.

Document which model version each component of your stack uses, especially when different workflows use different models. A configuration file or environment variable that specifies the model name for each service makes it easy to audit and update model assignments across the entire stack from a single location.

Key Takeaway

Start with Ollama and a 7B model at 4-bit quantization. This combination works on modest hardware, provides good quality for most tasks, and lets you focus on building your application rather than optimizing infrastructure. Upgrade the model or engine only when you identify a concrete quality or performance gap.