What Hardware Do You Actually Need?
With modern Q4/Q5 quantization, 7B and many 13B models run comfortably on 8–16 GB VRAM. 24 GB VRAM is the sweet spot for quantized 30B–70B models — not a hard minimum for all mid-size inference.
For cost-sensitive setups, CPUs paired with 4-bit or 8-bit quantization via llama.cpp remain a viable path. Quantization trades a measurable but often acceptable reduction in output precision for dramatically reduced memory requirements.
Storage requirements are more forgiving than older guides suggest. A heavily quantized 7B model typically needs 4–8 GB of disk space. A quantized 70B model requires roughly 30–40 GB. Storage climbs into the hundreds of gigabytes only if you store multiple model variants or keep full-precision weights. Plan your storage tier before you plan your software stack.
Network matters too. A reliable, low-latency connection between your server and clients affects perceived response speed — especially under concurrent load.
Why Would You Host an LLM Locally Instead of Using an API?
Self-hosting gives you complete control over data privacy, compliance, latency, and cost — none of which you fully own with a cloud API.
The reasons break down clearly:
- Privacy and control: Sensitive data never leaves your infrastructure.
- GDPR and regulatory compliance: For hospitals (HIPAA) and fintech (BaFin, PCI-DSS), a model that physically stays on your premises removes most jurisdictional red tape. Regulators can trace the full processing path because logs and weights are stored on-premises.
- Custom encryption: You can enable LUKS or S3-style server-side encryption without waiting on a cloud vendor.
- Offline use: When there is no internet connection at all, a local model keeps running. This matters for edge deployments and air-gapped environments.
- Cost predictability: At scale, a fixed hardware cost is more manageable than variable per-token API fees.
- Faster audits: On-prem storage of weights and logs gives compliance teams direct access without involving third parties.
Which Serving Stack Should You Use?
The right serving stack depends on your use case: Ollama for developer simplicity, vLLM for production-grade API performance, and LocalAI when you need multimodal extensions.
Here is how the major tools map to use cases:
- Ollama: Simple model management with commands like ollama run llama3.2. Offers an OpenAI-compatible API and supports an extensive model library including Llama, Mistral, Gemma, Phi, and Qwen. Best for developers who need API integration and flexibility.
- LM Studio: UI-driven experience designed for beginners. Excellent for getting a model running quickly with minimal configuration.
- vLLM: High-throughput token streaming, batching, and GPU scheduling. The standard choice for production deployments with OpenAI-compatible /v1/chat/completions, /v1/embeddings, and /v1/models routes. Best for teams that need enterprise-grade serving.
- LocalAI: Extends beyond text with wide format support (GGUF, ONNX), tool calling, and MCP support. The strongest choice when your application requires multimodal capabilities, autonomous agents, or broad model format coverage.
- Jan: Privacy-first, with offline experience and optional mobile support. Suited for users who prioritize data isolation.
- Docker Model Runner: Well-suited for container-first workflows with ecosystem integration.
- Lemonade: Targets AMD Ryzen AI hardware, leveraging NPU and integrated GPU for strong performance on that specific platform.
- Backyard AI: Designed for character-based creative writing interactions.
How Do You Set Up a Self-Hosted LLM Step by Step?
The setup sequence is: provision hardware, install Linux and drivers, choose a serving stack, configure your API, and add monitoring and security.
- Provision hardware. For GPU inference, 8–16 GB VRAM handles quantized 7B–13B models; 24 GB is the sweet spot for 30B–70B quantized models. For CPU-only setups, configure the machine for quantized inference via llama.cpp. Ensure sufficient NVMe storage for model weights.
- Install a Linux OS. Debian or Ubuntu are the recommended distributions.
- Install CUDA drivers and the NVIDIA Docker runtime. These are required before any GPU-accelerated serving stack will function.
- Choose a serving stack. For development: Ollama or LM Studio. For production: vLLM (GPU) or LocalAI (multimodal). For orchestrated environments: Kubernetes.
- Configure your API endpoints. Expose /v1/chat/completions, /v1/embeddings, and /v1/models routes over a secure HTTPS endpoint. Use an Ingress controller with TLS.
- Deploy model weights. Download and store the model files. For heavily quantized 7B models, allocate 4–8 GB. For quantized 70B models, plan for 30–40 GB. Storage scales into the hundreds of gigabytes only when storing multiple variants or full-precision weights.
- Tune quantization and batch size. Use 4-bit or 8-bit quantization to reduce memory load. Tune batch size for optimal throughput on your specific hardware.
- Add monitoring. Integrate Prometheus-compatible exporters and OpenTelemetry to track latency, GPU utilization, and request rates.
- Harden security. Configure firewalls, VPNs, and optional LUKS disk encryption.
- Set up automated updates. Maintain update pipelines for both model files and serving software.
How Do You Run a Self-Hosted LLM on Kubernetes?
Kubernetes provides the orchestration layer needed to manage GPU-intensive LLM workloads at scale, with autoscaling, persistent storage, and production-grade reliability.
The Kubernetes path involves several deliberate choices:
Deploy containers on Kubernetes with GPU node pools. Use Helm charts or custom manifests to define compute resources, autoscaling policies, and persistent volumes for storing model weights. Without persistent volumes correctly configured, model weights must be re-downloaded every time a pod restarts — a significant operational problem for large models.
Expose a secure HTTPS endpoint using an Ingress controller with TLS termination. The endpoint should implement the standard OpenAI-compatible routes: /v1/chat/completions, /v1/embeddings, and /v1/models.
For observability, integrate Prometheus-compatible exporters and OpenTelemetry. Track GPU utilization, token throughput, request latency, and error rates. Without this telemetry, capacity planning and incident response become guesswork.
Running LLMs on Kubernetes gives you complete control over data privacy, latency, and costs — with the orchestration layer capable of managing these resource-intensive workloads effectively.
What Are the Limitations and Risks?
Self-hosting an LLM is operationally demanding. Hardware costs, security responsibility, and manual maintenance are real constraints — not edge cases.
Hardware requirements are more accessible than they were. Modern 7B–30B models from 2025–2026, combined with better quantization, now outperform older 70B models on many practical benchmarks. The entry bar has dropped — but 70B models still require hardware most small teams need to budget for carefully.
Disk space is a real constraint, though less extreme than older guides suggested. Heavily quantized 7B models need 4–8 GB. Quantized 70B models require roughly 30–40 GB. The hundreds-of-gigabytes figure applies only when storing multiple variants or full-precision weights. Storage planning is still mandatory.
Security is fully your responsibility. SSL configuration, firewall rules, VPN access, and disk encryption are tasks that cloud providers handle by default. On-premises, they fall to your team.
Manual maintenance is ongoing. Model files and serving software require regular updates. Unlike managed APIs, there is no automatic patching.
Compliance is faster but not automatic. While self-hosting removes the need for many third-party Data Transfer Impact Assessments, you are still responsible for implementing and documenting the required controls — access logging, encryption at rest, and access via VPN or VLAN.
Quantization trade-offs exist. Aggressive 4-bit quantization reduces hardware requirements significantly, but there is a measurable reduction in output precision. The provided sources do not quantify this reduction numerically.
The 2026 ecosystem has matured significantly in API standardization and quantization tooling, but hardware accessibility and operational complexity remain genuine barriers for smaller teams.