Ollama: Complete Guide to Local AI Models

Updated May 2026
Ollama is a free, open source tool that lets you download and run large language models on your own computer with a single terminal command. It handles model management, quantization, and GPU acceleration automatically, giving developers and researchers full local access to models like Llama 4, Qwen3, DeepSeek-R1, and thousands more without sending data to any external server.

What Is Ollama

Ollama is an open source application that simplifies the entire process of downloading, configuring, and running large language models on local hardware. Released under the MIT license, it provides a command line interface and a REST API that together make local model inference as straightforward as pulling and running a Docker container. You type ollama run llama4 in your terminal, and within moments you have a fully functional language model running on your own machine, ready for conversation or integration into your applications.

Under the hood, Ollama is built on top of llama.cpp, the high performance C++ inference engine originally created by Georgi Gerganov. Ollama wraps this engine with a user-friendly layer that handles model downloading from the Ollama library, automatic GPU detection and configuration, memory management, and a persistent API server. The result is a tool that abstracts away the complexity of local model inference while preserving the performance benefits of running natively compiled code on your hardware.

Ollama supports macOS, Linux, and Windows, with native GPU acceleration for NVIDIA GPUs through CUDA, AMD GPUs through ROCm, and Apple Silicon through Metal. It uses the GGUF model format, which stores quantized model weights optimized for efficient inference on consumer hardware. As of May 2026, the Ollama library hosts over 4,500 models spanning every major model family, and the tool has become one of the most widely used local inference solutions in the AI development community.

The project follows a model management paradigm similar to Docker. You pull models from a central registry, they are stored locally on disk, and you can list, delete, copy, or customize them. Ollama also supports Modelfiles, which function like Dockerfiles for language models, letting you define a base model, set parameters like temperature and context length, specify a system prompt, and save the resulting configuration as a new named model.

Why Run AI Models Locally

The most immediate advantage of local inference is data privacy. When you send prompts to a cloud API, your data travels across the internet and is processed on servers you do not control. For organizations working with sensitive information in healthcare, legal, financial, or government contexts, this creates compliance and security challenges that local inference eliminates entirely. With Ollama, every prompt and every response stays on your machine, never leaving your network.

Cost is another significant factor. Cloud API pricing accumulates quickly for applications that generate substantial token volume. A development team running frequent model calls during prototyping, testing, and debugging can easily spend hundreds of dollars per month on API fees. Ollama has no usage fees, no rate limits, and no per-token costs. Your only expense is the hardware you already own or choose to purchase, and that hardware serves you indefinitely across every project.

Latency and availability also improve with local inference. Cloud APIs introduce network round-trip time on every request, and they are subject to rate limits, outages, and variable response times based on server load. A local model responds in milliseconds with consistent performance, regardless of internet connectivity. This makes local inference particularly valuable for real-time applications, offline development, and scenarios where predictable response timing matters.

Local models also provide complete control over model behavior. You can fine-tune models on your own data, create custom system prompts, adjust generation parameters precisely, and test different model versions without waiting for a provider to update their API. This flexibility accelerates experimentation and gives you reproducible results that do not change when a cloud provider silently updates their model weights.

Finally, running models locally removes vendor lock-in. You are not dependent on any single provider's pricing decisions, terms of service changes, or model deprecation schedules. If a new open source model outperforms your current choice, you can switch to it with a single command. This independence is increasingly valuable as the AI landscape evolves rapidly and new models appear every few weeks.

How Ollama Works Under the Hood

When you install Ollama, it sets up a background service that listens on port 11434 by default. This service manages model storage, handles model loading and unloading from memory, and exposes a REST API for interaction. The service starts automatically on system boot for macOS and Linux installations, and can be managed through standard system service commands.

Model storage uses the GGUF format, a binary file format designed specifically for fast loading and efficient inference with quantized model weights. Quantization reduces the precision of model parameters from their original 16-bit or 32-bit floating point values to smaller representations like 4-bit or 8-bit integers. This dramatically reduces memory requirements and speeds up inference, with only a modest reduction in output quality. The most commonly used quantization level is Q4_K_M, which offers the best balance of quality, speed, and memory usage for most applications.

When you run a model for the first time, Ollama checks if the model files exist locally. If they do not, it downloads them from the Ollama library, a public registry of pre-quantized models. Models are stored in layers, similar to Docker images, so models that share a base architecture can share common layers and reduce disk usage. Once downloaded, a model remains cached locally until you explicitly delete it.

GPU detection happens automatically at startup. On systems with NVIDIA GPUs, Ollama uses CUDA for acceleration. On AMD GPUs, it uses ROCm. On Apple Silicon, it uses the Metal framework, which has the unique advantage of unified memory architecture where the GPU can access the same RAM as the CPU without copying data between separate memory pools. This makes Apple Silicon particularly efficient for local inference, as a 32GB M-series Mac can effectively use all 32GB as GPU-accessible memory.

When a model is loaded for inference, Ollama reads the GGUF file and maps the model layers to available GPU memory. If the model fits entirely in VRAM, all computation happens on the GPU. If it does not fit, Ollama automatically splits the model between GPU and CPU, loading as many layers as possible onto the GPU and handling the remainder on the CPU. This partial offloading lets you run models larger than your VRAM capacity, though at reduced speed for the CPU-processed layers.

The Modelfile system provides a declarative way to customize models. A Modelfile specifies a base model, sets parameters like temperature, top_p, top_k, repeat_penalty, and context window size, defines a system prompt, and can include adapter weights for fine-tuned versions. Running ollama create with a Modelfile produces a new named model that you can use exactly like any other model. This makes it simple to create task-specific configurations without modifying the underlying model weights.

The Ollama Model Library

The Ollama model library is a curated registry of pre-quantized models available for immediate download and use. As of May 2026, it contains over 4,500 model variants spanning dozens of model families. Each model in the library comes in multiple size variants and quantization levels, letting you choose the right trade-off between quality and resource requirements for your specific hardware and use case.

The most popular model families available through Ollama include Meta's Llama series, Alibaba's Qwen series, DeepSeek's reasoning models, Google's Gemma models, Mistral AI's models, and Microsoft's Phi series. Each family has distinct strengths. Llama 4 Scout uses a mixture-of-experts architecture with 17 billion active parameters out of 109 billion total, delivering strong general performance while remaining runnable on hardware with roughly 10GB of VRAM. Qwen3 has become the fastest-growing model on the platform, excelling particularly at coding tasks. DeepSeek-R1 specializes in chain-of-thought reasoning with step-by-step problem solving that significantly outperforms standard models on math, logic, and analytical tasks.

Vision models have expanded substantially on the platform. Gemma 4 from Google supports image input alongside text and includes native tool calling capabilities. Llama 3.2 Vision handles image understanding at the 11B parameter scale. These multimodal models accept images as part of the conversation and can describe, analyze, and reason about visual content entirely on your local hardware.

Embedding models are also available through the library. These specialized models convert text into numerical vector representations that capture semantic meaning, enabling similarity search, clustering, and retrieval augmented generation pipelines. Models like nomic-embed-text and mxbai-embed-large provide high quality embeddings that you can generate locally without sending your documents to an external service.

Code-specialized models form another important category. Beyond the general-purpose models that handle coding well, dedicated code models like CodeGemma, StarCoder2, and the coding-optimized variants of Qwen3 provide focused performance for code generation, completion, refactoring, and explanation tasks. These models understand dozens of programming languages and can generate syntactically correct, contextually appropriate code for most common development scenarios.

Each model in the library includes documentation on its parameter count, quantization options, recommended hardware, license terms, and benchmark performance. You can browse the library at ollama.com/library or search directly from the command line using ollama list for installed models and the web interface for available downloads.

Hardware Requirements at a Glance

The critical resource for local model inference is memory, specifically GPU VRAM for accelerated inference or system RAM for CPU-only operation. The general rule is that your available memory determines the largest model you can run effectively, and fitting the entire model in GPU memory is the single most important factor for achieving good performance.

At the popular Q4_K_M quantization level, an 8B parameter model requires approximately 6GB of memory, a 14B model needs about 10GB, a 32B model needs 20 to 22GB, and a 70B model needs roughly 43GB. These figures represent the model weights alone and do not include the additional memory needed for the KV cache during inference, which scales with context length and can add 1 to 4GB depending on your settings.

For NVIDIA GPUs, the RTX 3060 12GB and RTX 4060 Ti 16GB represent good entry points for running 7 to 14B models at full GPU speed. The RTX 4090 with 24GB of VRAM handles 14B models comfortably and can run 32B models with some CPU offloading. Professional cards like the A6000 with 48GB enable 70B model inference entirely on the GPU. Multi-GPU setups are supported, allowing you to split a model across two or more cards.

Apple Silicon provides a uniquely efficient platform for local inference thanks to its unified memory architecture. An M2 Max with 32GB of unified memory can run models that would require 32GB of dedicated VRAM on a discrete GPU, without the overhead of copying data between separate CPU and GPU memory pools. The M4 Max and M4 Ultra chips push this further, with configurations supporting 64GB, 128GB, or even 192GB of unified memory accessible to the GPU, making them capable of running the largest open source models available.

CPU-only inference remains an option for machines without compatible GPUs, though performance is dramatically lower. A model running entirely on the CPU typically generates 5 to 10 tokens per second compared to 40 to 80 tokens per second on a capable GPU. This makes CPU-only operation viable for testing and light development work, but impractical for production use or latency-sensitive applications.

Getting Started in Minutes

Installing Ollama takes a single command on macOS and Linux. On macOS, you download the application from ollama.com or install it through Homebrew with brew install ollama. On Linux, the installer script handles everything: curl -fsSL https://ollama.com/install.sh | sh. On Windows, a standard installer is available from the Ollama website. All three platforms set up the background service automatically.

Once installed, running your first model is equally simple. The command ollama run llama4 downloads the Llama 4 Scout model if it is not already cached locally, loads it into memory, and opens an interactive chat session in your terminal. You can type messages and receive responses immediately. To exit the session, type /bye. Other useful commands include ollama list to see your installed models, ollama pull to download a model without starting a session, and ollama rm to delete a model and free disk space.

For programmatic access, the Ollama API server runs on http://localhost:11434 as soon as the service starts. You can send requests using curl, any HTTP client library, or the official Ollama client libraries for Python and JavaScript. The API follows the same patterns as the OpenAI API, making it straightforward to switch between local and cloud models in your applications by changing only the base URL and model name.

If you prefer a graphical chat interface, the open source project Open WebUI provides a ChatGPT-style web interface that connects to Ollama directly. It supports conversation history, model switching, file uploads for multimodal models, and multiple users, making it a popular choice for teams that want a shared local AI chat environment.

The Ollama REST API

The Ollama API provides a complete set of HTTP endpoints for model interaction and management. The two primary generation endpoints are POST /api/generate for single-turn text completion and POST /api/chat for multi-turn conversations with message history. Both endpoints support streaming responses, where tokens are sent as they are generated, and non-streaming mode, where the complete response is returned as a single JSON object.

The chat endpoint accepts a messages array with role and content fields, following the same format used by the OpenAI API. This compatibility means that many libraries and frameworks designed for OpenAI can work with Ollama by changing the base URL to http://localhost:11434/v1, which exposes an OpenAI-compatible endpoint. Supported parameters include temperature for controlling randomness, top_p and top_k for sampling strategies, num_predict for maximum output length, and system for setting a system prompt.

The embedding endpoint at POST /api/embed generates vector representations of text input. It supports batching multiple text inputs in a single request and provides options for dimension reduction and automatic truncation of text that exceeds the model's context window. This endpoint is essential for building retrieval augmented generation systems, semantic search engines, and document similarity applications using local models.

Model management endpoints handle the full lifecycle of models on your system. GET /api/tags lists all installed models with their sizes and modification dates. POST /api/pull downloads a model from the registry. DELETE /api/delete removes a model from local storage. POST /api/show returns detailed metadata about a model including its Modelfile parameters, template format, and license information. POST /api/create builds a new model from a Modelfile specification. GET /api/ps shows currently loaded models and their memory usage.

Environment variables let you configure the API server behavior. OLLAMA_HOST sets the listen address and port, defaulting to 127.0.0.1:11434. OLLAMA_NUM_PARALLEL controls how many concurrent requests a single model can handle. OLLAMA_MAX_LOADED_MODELS determines how many models can stay loaded in memory simultaneously. OLLAMA_KEEP_ALIVE sets how long a model stays in memory after its last request, defaulting to 5 minutes.

Common Use Cases for Ollama

Local coding assistance is one of the most popular applications for Ollama. Tools like Claude Code, GitHub Copilot CLI, and OpenCode can connect to a local Ollama instance for code generation, explanation, and refactoring. Running a coding model locally means your source code never leaves your machine, which is particularly important when working on proprietary or security-sensitive projects. Models like Qwen3, DeepSeek-R1, and code-specific variants provide strong coding performance across dozens of programming languages.

Retrieval augmented generation, commonly known as RAG, is another major use case. In a RAG pipeline, you embed your documents locally using Ollama embedding models, store the vectors in a local database like ChromaDB or pgvector, and then use Ollama chat models to answer questions about your documents using retrieved context. The entire pipeline runs on your hardware, making it suitable for processing confidential documents, internal knowledge bases, and proprietary research materials.

AI agent systems increasingly support Ollama as a model backend. Frameworks like LangChain, CrewAI, AutoGen, and n8n can route their LLM calls through a local Ollama instance instead of a cloud API. This enables agent development and testing without accumulating API costs, and it provides consistent behavior that does not depend on cloud provider availability or model updates. For agent workflows that involve many sequential model calls, the elimination of network latency can also improve total execution time.

Content generation and analysis tasks benefit from local models when you need to process large volumes of text without per-token costs. Summarizing documents, extracting structured data from unstructured text, translating between languages, and generating product descriptions or marketing copy are all tasks where local models can handle the workload at a fraction of the cost of cloud APIs, especially for batch processing scenarios.

Education and research represent a growing use case as well. Students and researchers can experiment with language models freely, without worrying about costs or usage quotas. They can study model behavior, test different prompting strategies, compare model architectures, and build prototype applications, all with the full transparency that comes from running the model directly on their own hardware.

Ollama in Production Environments

While Ollama is primarily designed for development and single-user scenarios, it can serve in production environments with the right configuration. Running Ollama in Docker provides containerization, easy deployment, and GPU passthrough through the NVIDIA Container Toolkit. The official Docker image at ollama/ollama supports named volumes for model persistence, environment variable configuration, and Docker Compose orchestration with related services like Open WebUI.

For multi-user production workloads with high concurrency requirements, Ollama may not be the optimal choice. Tools like vLLM and text-generation-inference are purpose-built for high-throughput serving with features like PagedAttention for efficient GPU memory management, continuous batching for handling many concurrent requests, and optimized scheduling algorithms. In benchmarks, vLLM delivers over 35 times the request throughput compared to Ollama under heavy concurrent load. The recommended pattern for many teams is using Ollama during development and switching to vLLM or a similar serving framework for production deployment.

For moderate production loads with a handful of concurrent users, Ollama performs well when configured appropriately. Setting OLLAMA_NUM_PARALLEL to match your expected concurrency, pre-loading models with ollama pull before they are needed, and ensuring sufficient VRAM to keep models fully GPU-resident all contribute to stable, responsive performance. Monitoring memory usage and model load times helps identify when a more specialized serving solution becomes necessary.

Integration with reverse proxies like nginx or Caddy adds TLS termination, authentication, and load balancing in front of the Ollama API. This is particularly useful when exposing Ollama to a local network for team use, where you want to add access controls without modifying Ollama configuration directly. The HTTP-based design of the API makes it straightforward to integrate with standard web infrastructure tools.

Explore This Topic

Getting Started

Models and Performance

Technical Setup and Integration

Comparisons and Use Cases