Self-Hosted LLMs: Run Language Models Yourself
In This Guide
What Self-Hosted Actually Means
When you use ChatGPT, Claude, or Gemini through their web interfaces or APIs, your prompts travel to servers owned by OpenAI, Anthropic, or Google. Those companies process your input, generate a response, and send it back. You pay per token, you share infrastructure with millions of other users, and your data passes through systems you do not control.
Self-hosting flips this model entirely. You download model weights, typically ranging from a few gigabytes for smaller quantized models to hundreds of gigabytes for full-precision frontier models, and run inference on machines you own or rent exclusively. The model runs as a process on your hardware. Prompts never leave your network. Responses generate locally. There is no API call, no usage meter, and no third-party data processor sitting between you and the model.
The practical reality of self-hosting has improved dramatically since 2024. What once required deep expertise in CUDA, Python dependency management, and model conversion now takes a single command in tools like Ollama. A developer can run a capable 8-billion parameter model on a laptop with 16GB of RAM. An engineering team can deploy a 70-billion parameter model behind an OpenAI-compatible API endpoint on a single GPU server. The barrier to entry has dropped from ML engineer with a week of setup time to any developer with a terminal.
That said, self-hosting is not a universal solution. It introduces operational responsibility, hardware costs, and model selection decisions that managed APIs abstract away. Understanding when self-hosting makes sense, and when it creates unnecessary complexity, is the first step toward making a good decision.
Why Organizations Self-Host LLMs
The motivations for self-hosting cluster into four categories, and most organizations are driven by a combination of them rather than a single factor.
Data privacy and regulatory compliance is the most common driver in enterprise settings. Healthcare organizations handling patient records, financial institutions processing transaction data, and legal firms working with privileged communications all face strict rules about where data can be processed. The EU AI Act, which entered full enforcement in early 2026, requires organizations to document where AI-processed data flows, demonstrate control over model behavior, and maintain audit trails. Running models locally simplifies compliance dramatically because the data never leaves your infrastructure.
Cost reduction at scale becomes significant once an organization processes more than roughly 100 million tokens per month. Cloud API pricing, typically ranging from $0.15 to $15 per million tokens depending on the model, adds up quickly in production workloads. A single GPU server running Ollama or vLLM can handle the same volume for a fixed monthly cost, often paying for itself within a few months. The economics favor self-hosting even more for workloads with predictable, steady demand rather than sporadic bursts.
Customization and control matters for teams that need models tailored to specific domains. Self-hosting allows fine-tuning on proprietary datasets, creating specialized models for tasks like medical coding, legal document analysis, or internal knowledge retrieval. You can also control every aspect of the inference pipeline, from context window sizes and temperature settings to custom stopping conditions and token filtering that cloud APIs may not expose.
Latency and availability concerns drive self-hosting for applications that cannot tolerate network round trips or third-party downtime. Edge deployments, real-time coding assistants, and on-premises industrial applications all benefit from inference that happens locally. When your model runs on the same network as your application, response times drop from hundreds of milliseconds to single-digit milliseconds for time-to-first-token.
The Model Landscape in 2026
The open-weight model ecosystem has reached a level of quality that would have seemed improbable two years ago. Multiple model families now deliver performance competitive with cloud-only offerings across a range of tasks, and the gap continues to narrow with each release cycle.
Meta Llama
Meta Llama family remains the most widely deployed open-weight model series. Llama 3.1 and 3.3 refined the 8B and 70B parameter sizes with improved reasoning and 128K token context windows. The Llama 4 release in April 2025 introduced Mixture of Experts (MoE) architecture to the family. Llama 4 Scout uses 109 billion total parameters with 16 experts, activating only 17 billion parameters per token, while offering an unprecedented 10 million token context window. Llama 4 Maverick scales to 400 billion total parameters with 128 experts, maintaining the same 17B active parameter count but with higher quality across benchmarks. Scout runs on a single H100 GPU or even on Apple Silicon Macs with 32GB unified memory when quantized, making it remarkably accessible despite its large total parameter count.
Mistral
Mistral has established itself as the European counterpart to Meta in the open-weight space. Mistral Small 4, released in March 2026, packs 119 billion total parameters into a MoE architecture that activates only 24 billion per token, combining instruction following, reasoning, image understanding, and coding into a single model. Mistral Medium 3.5, released in April 2026, is their first flagship dense model at 128 billion parameters, scoring 77.6% on SWE Bench Verified and offering configurable reasoning effort per request. Both models ship under permissive licenses and run on standard inference tools.
Other Notable Families
Qwen from Alibaba has become a strong contender, particularly for multilingual workloads and coding tasks. DeepSeek continues to push the boundaries of reasoning capability in open models. NVIDIA Nemotron Cascade 2 delivers approximately 54 tokens per second on consumer GPUs at 30 billion parameters, emphasizing practical inference speed. Google Gemma models provide strong performance at smaller sizes suitable for edge deployment. The landscape is competitive enough that no single model family dominates across all use cases.
Tools and Runtimes
The tooling for self-hosted LLMs has consolidated around a handful of mature options, each optimized for different use cases. Choosing the right tool matters as much as choosing the right model.
Ollama
Ollama has become the default starting point for local LLM work. A single command, ollama pull llama3.2, downloads and configures a model with no Python environment, no dependency management, and no CUDA version conflicts. Ollama wraps llama.cpp with a clean CLI and REST API, making it the fastest path from interest to running inference. Version 0.17.5 (March 2026) added cloud model offloading, web search APIs, multimodal support, streaming tool calls, and thinking model support. For individual developers, small teams, and prototyping, Ollama is the right choice. Its limitation is concurrency: performance degrades noticeably beyond four or five simultaneous users, making it unsuitable for production API serving without additional infrastructure.
vLLM
vLLM is the production inference engine. Its core innovation, PagedAttention, borrows virtual memory concepts from operating systems to manage GPU KV cache memory dynamically, reducing memory fragmentation by 50% or more and increasing throughput by 2-4x for concurrent requests. On NVIDIA Blackwell GPUs running Llama 3.1 70B with NVFP4 quantization, vLLM achieves over 8,000 tokens per second compared to Ollama at 484, a 16.6x throughput advantage. Time-to-first-token clocks in at 10.7ms versus 65ms. If you need to serve a model to many concurrent users, spread a large model across multiple GPUs, or maximize tokens per dollar, vLLM is the standard choice.
Other Tools
llama.cpp provides the underlying inference engine that Ollama builds upon, offering direct control for users who want maximum flexibility with GGUF format models. LM Studio provides the most polished graphical interface for managing local models, with version 0.4.0 adding headless server mode and continuous batching. SGLang has emerged as a genuine contender for workloads involving RAG, agents, or multi-turn conversations, thanks to its RadixAttention prefix-cache reuse. Llamafile packages models into single executable files for maximum portability across platforms.
Hardware Requirements
Hardware selection depends on three factors: the model size you want to run, the quantization level you find acceptable, and the throughput you need.
Consumer hardware can run models up to roughly 13 billion parameters at full quality or 30-70 billion parameters with aggressive quantization. A modern laptop or desktop with 16GB of RAM handles 7-8B models comfortably. Apple Silicon Macs with 32GB or more of unified memory can run quantized versions of much larger models because the CPU and GPU share the same memory pool, eliminating the VRAM bottleneck that limits dedicated GPU setups. NVIDIA RTX 4090 cards with 24GB of VRAM represent the top end of consumer GPU capability, sufficient for 13B models at full precision or 30B models at 4-bit quantization.
Professional and server hardware opens up the full range of model sizes. NVIDIA A100 GPUs with 80GB of VRAM handle 70B parameter models at full precision. H100 GPUs offer higher throughput for the same memory capacity. For the largest models, multi-GPU configurations using NVLink allow tensor parallelism across two, four, or eight GPUs. A server with four H100s (320GB total VRAM) can run Mistral Medium 3.5 full 128B parameters with room for context window and KV cache.
A quick estimation formula: multiply the model parameter count by the bytes per weight at your chosen precision. A 7B model at 4-bit quantization needs roughly 4-6GB. A 70B model at the same quantization needs approximately 35-45GB. Add 10-20% overhead for KV cache and runtime framework memory.
Cost Economics
The cost comparison between self-hosted and cloud-hosted LLMs depends heavily on usage volume and pattern.
Low volume (under 10 million tokens per month): Cloud APIs win. The operational overhead of maintaining self-hosted infrastructure exceeds the API costs at this scale. A team spending $15-50 per month on API calls would spend more on electricity and maintenance for a dedicated server.
Medium volume (10-100 million tokens per month): The comparison becomes nuanced. A single consumer GPU setup running Ollama can handle this volume for roughly $50-100 per month in electricity and amortized hardware costs, compared to $100-1,500 in API fees depending on the model tier. The break-even point depends on which cloud model you would otherwise use and your tolerance for managing hardware.
High volume (over 100 million tokens per month): Self-hosting almost always wins on cost. A server with two A100 GPUs costs roughly $2,000-3,000 per month in a colocation facility (including power and cooling) and can process hundreds of millions of tokens daily. The equivalent cloud API cost would run $1,500-15,000 per month or more depending on the model.
These calculations assume steady workloads. Bursty, unpredictable demand favors cloud APIs because you only pay for what you use. Self-hosted infrastructure costs the same whether it is serving tokens or sitting idle.
Quality Tradeoffs
The quality gap between self-hosted and cloud models has narrowed considerably but has not disappeared. The largest cloud models, GPT-4o, Claude Opus, and Gemini Ultra, still outperform the best open-weight models on complex reasoning, nuanced instruction following, and creative tasks. The margin is shrinking with each generation, and for many practical applications, the difference is not meaningful.
Quantization introduces its own quality tradeoff. Modern quantization methods like GGUF Q4_K_M and Q5_K_M preserve 95-98% of full-precision quality on most benchmarks. Two years ago, 4-bit quantization caused noticeable degradation. Today, it is nearly imperceptible for most tasks. The 5-bit sweet spot offers the best balance between memory savings and quality retention, though 4-bit remains practical for applications that prioritize speed and memory efficiency over maximum accuracy.
Where self-hosted models can actually exceed cloud offerings is in specialized domains. A fine-tuned 8B model trained on your organization internal data can outperform a general-purpose 400B cloud model for domain-specific tasks. The ability to customize through fine-tuning, combine multiple specialized models, and control the entire inference pipeline creates advantages that raw model size cannot match.
Getting Started
The fastest path to running your first self-hosted LLM takes about five minutes. Install Ollama (available for macOS, Linux, and Windows), run ollama pull llama3.2 to download a capable 3B parameter model, and start a conversation with ollama run llama3.2. That is genuinely all it takes to have a functional local language model.
From that starting point, the path forward depends on your goals. If you want to use the model from your own applications, Ollama exposes an API at localhost:11434 that is compatible with the OpenAI client library format, meaning most existing code that calls OpenAI can point to your local model with a one-line URL change. If you need higher quality, pull a larger model: ollama pull llama3.1:70b downloads the 70B parameter version, though you will need sufficient RAM or VRAM.
For production deployments, the path leads through vLLM. Install it via pip, load a model from Hugging Face, and serve it behind an OpenAI-compatible API endpoint. The migration from Ollama to vLLM typically involves changing the endpoint URL and nothing else in your application code, since both speak the same API protocol.
The articles below cover each aspect of self-hosted LLMs in depth, from foundational concepts through specific tools, models, and deployment strategies.