GPU and Hardware for AI

Computing power is the primary constraint for training and running Large Language Models (LLMs). While CPUs can run models, GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) are required for practical performance due to their ability to handle massive parallel matrix operations.

VRAM and Model Fitting

The critical hardware metric for local LLM inference is VRAM (Video RAM). A model must fit within the available VRAM to run at acceptable speeds; otherwise, it must “offload” to system RAM, which is significantly slower.

VRAM Estimation Table

The amount of VRAM needed depends on the model size (parameters) and the quantisation level (how many bits are used to represent each weight).

Model SizeQuantisationVRAM NeededNote
7B4-bit (GGUF Q4)~4 GBEntry-level consumer GPUs
7B8-bit~8 GBHigh quality, standard GPUs
13B4-bit~8 GBBalanced performance
30B4-bit~16 GBMid-to-high end consumer GPUs
30B2-3 bit~8–12 GBViable on edge hardware (e.g., Raspberry Pi)
70B4-bit~40 GBHigh-end workstation (A100/H100 or multi-GPU)
70B2-bit~20 GBMinimum for massive models on consumer gear

Key Tool: The “LLM GPU fit tool” is used by the community to calculate if a specific model version will fit in available VRAM before downloading.

Hardware Ecosystem (2026)

Consumer Hardware

  • NVIDIA RTX Series (e.g., 4090): The gold standard for local LLMs due to CUDA cores and high VRAM.
  • Apple M-Series (Unified Memory): Highly effective because the GPU can access the entire system RAM as VRAM, allowing larger models (70B+) to run on high-spec Macs.
  • Raspberry Pi: Recent milestones show that highly quantised 30B models (Qwen) can now achieve real-time inference on sub-$100 hardware.

Hardware Ecosystem (2026)

Consumer Hardware

  • NVIDIA RTX Series (e.g., 4090): The gold standard for local LLMs due to CUDA cores and high VRAM.
  • Apple M-Series (Unified Memory): Highly effective because the GPU can access the entire system RAM as VRAM, allowing larger models (70B+) to run on high-spec Macs.
  • Raspberry Pi: Recent milestones show that highly quantised 30B models (Qwen) can now achieve real-time inference on sub-$100 hardware.

Specialized AI Hardware

  • TPUs (Tensor Processing Units): Google’s custom ASICs designed specifically for TensorFlow/JAX workloads. They provide massive acceleration for tensor operations and are available via Google Cloud infrastructure to empower developers to build NLP and AI projects without managing underlying hardware.
  • Edge AI: Small-scale models like Nemotron 3 Nano are designed specifically for on-device deployment where VRAM is extremely limited.

TensorFlow Hardware Integration

TensorFlow provides specific abstractions to manage hardware:

  • Distributed Execution: Ability to run models across a cluster of machines or multiple GPUs/CPUs on a single machine.
  • TensorFlow Lite: Optimises models for mobile and embedded devices, reducing memory footprint and increasing inference speed.
  • TensorFlow Serving: High-performance system for deploying models to production servers.

Production Realities

  • Resource Intensity: Deep learning is computationally expensive. Training requires massive GPU clusters, while inference requires a specific VRAM “floor” to be functional.
  • Bottlenecks: Performance is often limited by memory bandwidth (how fast data moves to the GPU) rather than raw compute power.

Local Inference Runtimes (2026)

RuntimeBest forNotes
llama.cppRaw performance — fastest tokens/sec30–70% faster than Ollama on same model. Now has router mode + web UI. GGUF first-class
LM StudioGUI + simplicityPolished interface, proper GGUF support, model switching, regularly updated llama.cpp backend
vLLMMulti-user / productionProper concurrency, production-ready, industry standard
SGLangMulti-user / productionAlternative to vLLM for production concurrency
oMLX / MLXMac / Apple SiliconNative Apple Silicon, continuous batching, MTP support
Ollama⚠️ Caution (2026)See below

Ollama — 2026 status

Ollama was the go-to local inference tool in 2023–2025 but has accrued significant problems. Key issues:

  • 30–70% slower than llama.cpp on same model (confirmed by llama.cpp creator Georgi Gerganov via X): Ollama’s MXFP4 kernels have too much branching, attention sinks implementation is inefficient
  • Proprietary model format (2024–2025): forked ggml, stored models in hashed filenames in their own registry — models were trapped, couldn’t use with llama.cpp/LM Studio. Switched back to llama.cpp in v0.30.0-rc15 (May 2026) due to falling behind on new architectures (MTP, structured output, hybrid attention) and community pressure
  • Ollama Cloud reliability (as of May 2026): 29.7% failure rate on Qwen3.5, 95% failure rate reported across all models, 60+ second timeouts, broken tool calling, hostile rate limiting ($100/month users throttled after 5 days)
  • Misleading model naming: listed DeepSeek-R1-Distill-Qwen-32B simply as “DeepSeek-R1” — drove confusion about actual model capabilities
  • Trust broken: was local-first, now VC-backed platform company pushing cloud

Practical advice: continue using Ollama for embeddings + simple local tasks where you already have it set up. For new local inference infrastructure, prefer llama.cpp directly or LM Studio for GUI. For production multi-user, use vLLM or SGLang.

Source: Andrew Zhu — “Why You Should Completely Avoid Ollama in 2026” (2026-05-27, 377 claps, 619 claps total reactions)


See also:

  • AI-ML — for the mathematical foundations of the patterns GPUs accelerate.
  • Python — the primary language used to interface with GPU hardware via libraries like PyTorch and TensorFlow.