Ollama Performance: Speed and Quality by Model
Token Generation Speed by GPU
Performance numbers vary based on model, quantization, context length, and specific GPU model, but general ranges give useful guidance. On an NVIDIA RTX 4090 with 24GB VRAM at Q4_K_M quantization, 8B models generate 60 to 80+ tokens per second. 14B models run at 35 to 55 tokens per second. 32B models that fit entirely in VRAM achieve 15 to 30 tokens per second. Models requiring CPU offloading drop to 5 to 15 tokens per second depending on how much spills to system RAM.
On an RTX 4060 Ti 16GB, expect roughly 40 to 60 tokens per second for 8B models and 25 to 40 for 14B models at Q4_K_M. The RTX 3060 12GB delivers similar per-token speeds but is limited to smaller models or lower quantization levels. For budget GPUs like the RTX 3060 with 8GB, 8B models in Q4_K_M run at 30 to 50 tokens per second, which is comfortable for interactive use.
Apple Silicon performance benefits from the unified memory architecture. An M2 Pro with 16GB runs 8B models at 25 to 40 tokens per second. An M2 Max with 32GB handles 14B to 32B models at 15 to 35 tokens per second depending on the model. The M4 Max and M4 Ultra push these numbers higher with improved memory bandwidth and neural engine capabilities, reaching 40 to 60 tokens per second for 14B models.
The VRAM Cliff
The single most important performance factor is whether your model fits entirely in GPU memory. When a model fits in VRAM, you get full GPU-accelerated inference. When it does not, the model splits between GPU and CPU, and the CPU-processed layers run 5 to 10 times slower than the GPU layers. This creates a dramatic performance cliff rather than a gradual degradation.
For example, a 14B model at Q4_K_M needs about 10GB of VRAM. On a 12GB GPU, it fits comfortably and runs at full speed. On an 8GB GPU, roughly 20 to 30 percent of the model spills to CPU, cutting effective generation speed by 40 to 60 percent. The drop is not proportional to the amount of spillover because inference must wait for the slowest layer, and CPU layers are dramatically slower than GPU layers.
This cliff effect means that choosing a slightly smaller model that fits entirely in VRAM almost always produces a better experience than choosing a larger model that requires CPU offloading. A fully GPU-loaded 8B model at 50 tokens per second feels faster and more responsive than a partially offloaded 14B model at 15 tokens per second, even if the larger model produces marginally better quality output.
Quantization Impact on Quality and Speed
Quantization level affects both quality and speed, but the impact on each is different. Moving from Q8_0 (8-bit) to Q4_K_M (4-bit) roughly halves memory usage, increases speed by 20 to 40 percent, and reduces output quality by a small but measurable amount. For most practical applications, Q4_K_M quality is indistinguishable from full precision in everyday use, and the memory and speed benefits make it the default recommendation.
Q5_K_M offers a middle ground with roughly 60 percent of full precision memory usage and quality that is very close to Q8_0. If you have the VRAM headroom to run Q5_K_M without CPU offloading, it provides a noticeable quality improvement over Q4_K_M for tasks that demand precision, like complex reasoning and nuanced writing.
Going below Q4, to Q3_K or Q2_K, saves additional memory but introduces more noticeable quality degradation. These ultra-low quantization levels are primarily useful for running very large models on limited hardware where the alternative is not running the model at all. A Q3_K 70B model that barely fits in VRAM may still produce better results than a Q4_K_M 32B model, despite the heavier quantization.
Context Length and Performance
Longer context windows consume more memory and slow down inference. The KV cache, which stores the attention state for all tokens in the context, grows linearly with context length and can consume 1 to 4GB of additional memory for contexts of 8192 to 32768 tokens. This memory comes out of your available VRAM or system RAM, potentially pushing a model from fully GPU-loaded to partially CPU-offloaded.
Generation speed also decreases as context grows because each new token must attend to all previous tokens. The first token in a long context takes longer to generate than the first token in a short context, and this slowdown accumulates across the entire generation. For interactive chat sessions, keeping context under 4096 tokens provides the best balance of conversation memory and generation speed.
If your use case requires long contexts, such as analyzing lengthy documents or maintaining extended conversations, account for the KV cache memory when choosing your model and quantization. A model that fits perfectly in 12GB VRAM with a 2048 context might spill to CPU with a 16384 context, crossing the performance cliff discussed above.
Prompt Processing vs Token Generation
Ollama reports two speed metrics: prompt processing (input) and token generation (output). Prompt processing happens when the model reads and processes your input text. Token generation is the autoregressive loop that produces the response. These two phases have different performance characteristics.
Prompt processing is inherently parallel because all input tokens can be processed simultaneously. GPUs excel at parallel computation, so prompt processing is typically very fast, often processing thousands of tokens per second. Even long prompts process in under a second on capable hardware.
Token generation is sequential because each new token depends on all previous tokens. This makes it slower than prompt processing and creates the bottleneck that determines perceived response speed. The tokens-per-second figures quoted throughout this guide refer to token generation speed, as this is what determines how quickly you see the model's response appear.
Optimization Tips
Keep models that you use frequently loaded in memory by setting OLLAMA_KEEP_ALIVE to a longer duration. The default 5-minute timeout means the model unloads after idle periods, requiring a reload that takes several seconds on the next request. Setting it to a longer value or to -1 for indefinite keeps the model ready for immediate response.
Close other GPU-consuming applications when running larger models. Video editors, games, and even multiple browser tabs with GPU-accelerated content can consume VRAM that would otherwise be available for model inference. Freeing up even 1 to 2GB of VRAM can mean the difference between a model fitting fully in VRAM or spilling to CPU.
Monitor actual VRAM usage with nvidia-smi on NVIDIA systems or Activity Monitor on macOS. Understanding your actual VRAM consumption helps you make informed decisions about model size, quantization, and context length. The ollama ps command shows which models are currently loaded and how much memory they are using.
For batch processing tasks, disable streaming and use the non-streaming API endpoints. Streaming adds overhead from sending individual token responses over HTTP, which is negligible for interactive use but adds up when processing thousands of requests in sequence.
Performance is dominated by whether your model fits in GPU memory. Choose the largest model that fits entirely in VRAM at Q4_K_M quantization for the best balance of quality and speed. A fully GPU-loaded smaller model almost always outperforms a partially CPU-offloaded larger model in practice.