How to run high-performance LLMs locally?
Keep your data private and reduce cloud bills by hosting Llama 3, Mistral, or Gemma on your own infrastructure with Ollama, llama.cpp, and vLLM.
You want LLMs on your own metal, under your rules, without spraying data into random US-West regions. Good. Let's walk through how to actually do that fast and reliably, not as a weekend science project.
1. First: What Does "Local" Actually Mean for You?
"Local" can mean three very different things:
- Developer laptop / workstation – MacBook, Linux box, maybe a single consumer GPU. Good for: internal tools, prototyping, small-team assistants (Ollama, llama.cpp)
- On-prem / private cluster – Your own racks, or at least a private VPC account under your control. Good for: production APIs, RAG systems, multi-tenant internal apps (vLLM, custom servers)
- Edge / constrained hardware – Minis, NUCs, ARM boards, offline boxes. Good for: hardcore data isolation, low-latency edge use cases (heavily quantized llama.cpp)
All three can run Llama-class models locally. The stack you choose will differ.
2. Picking the Model: Llama 3 as the Default Starting Point
Meta's Llama 3 family basically became the default "serious open model": 8B and 70B parameter variants, both pre-trained and instruction-tuned, meant to be used for a wide range of tasks.
Key facts:
- You are allowed to self-host Llama 3 under Meta's license (with some conditions, especially at big scale / MAU thresholds)
- You can download weights from Meta or hubs like Hugging Face and run them on your own hardware
- There are plenty of guides for running Llama 3 locally with Python servers, llama.cpp, or tools like Ollama and GPT4All
So when people search "self-hosting Llama 3," they're really asking: What runtime should I use? How do I make it fast enough on my hardware? How do I do that in a way Legal and Security can live with?
3. Three Main Runtimes You Should Actually Care About
You'll see a zoo of options, but 90% of serious local setups land on some combination of: Ollama, llama.cpp, vLLM. They solve different problems.
3.1 Ollama – Easy Mode for Local LLMs
What it is: Ollama is a CLI + server that makes it trivially easy to run LLMs locally on macOS, Windows, and Linux. You ollama pull llama3, it handles downloading, packaging, and spinning up a local API. It uses container-like "Modelfiles" to bundle weights + config and is built on top of optimized runtimes like llama.cpp.
When it's a good fit:
- You want to run Llama 3, Mistral, DeepSeek, etc. locally with minimal effort
- You care about developer productivity and simple integration: local REST API, easy model switching, simple config for quantization / GPU usage
- You don't want to manually deal with GGUF files, GPU flags, or custom servers yet
When it's not enough: You need cluster-level throughput, multi-GPU sharding, or heavy multi-tenant serving. You want full control over serving internals (batching, scheduling) and deep integration into your infra.
Think of Ollama as "local LLM platform for humans". Great for teams getting off the ground and for internal-only workflows.
3.2 llama.cpp – Bare-Metal Control, Runs Anywhere
What it is: llama.cpp is a C/C++ inference engine for running LLMs efficiently on CPUs and GPUs, with support for quantized GGUF models and tons of backends (CUDA, Metal, ROCm, etc.). It's famously capable of running large models like Llama 3 on laptops, desktops, and even Raspberry Pis (slowly, but it works).
Why people use it:
- Runs basically everywhere: Linux, macOS, Windows, ARM, embedded
- Heavy quantization support: 8-bit, 6-bit, 4-bit GGUF variants and more to squeeze models onto cheaper hardware
- No Python runtime required: good for hardened environments and strict ops teams
When it's a good fit: You're serious about CPU or mixed hardware (not just big NVIDIA boxes). You want fine-grained control over quantization and performance tradeoffs. You're comfortable wrapping it in your own service (REST/gRPC, auth, logging, etc.).
Tradeoff: More power and portability, more work. Ollama uses llama.cpp under the hood for many models; llama.cpp is the low-level engine, Ollama is the nicer DX wrapper.
3.3 vLLM – High-Throughput GPU Inference Engine
What it is: vLLM is a high-performance LLM inference engine and server designed for GPU clusters: continuous batching, paged attention, prefix caching, and all the tricks you need to squeeze maximum throughput out of expensive GPUs.
Real-world benchmarks routinely show multiple-x throughput improvements versus naive serving stacks for multi-request workloads.
When it's a good fit:
- You're running Llama 3 or other big models on A100 / H100 / similar GPUs
- You care about serving many concurrent users with stable latency
- You want features like: continuous batching, paged attention / KV-cache optimization, multi-GPU / model sharding, OpenAI-style serving APIs
Tradeoff: vLLM assumes you're OK with Python and GPUs and that you treat LLMs like a proper service (Kubernetes, observability, etc.). It's overkill for a single developer laptop, perfect for on-prem LLM APIs.
4. Performance Levers: How to Make Local LLMs Not Suck
Tools are nice, but if you ignore the basics, you still end up with a slow local chatbot that times out under real load.
4.1 Hardware: Match Model Size to VRAM/Memory
Rules of thumb:
- Full-precision 70B models are not for your 8 GB GPU
- Use: 7–8B models for commodity GPUs / laptops; 8–14B models for mid-range single GPUs; 70B class only if you have serious VRAM (or you accept heavy quantization / CPU offload)
For high-throughput on-prem serving: prioritize VRAM and bandwidth (A100/H100, MI300, etc.). Fewer big GPUs used well > many small GPUs used naively.
4.2 Quantization: Your Best Friend for Local LLMs
Quantization is how you run 70B-class models on hardware that shouldn't reasonably hold them:
- Convert model weights from fp16 → int8 / int4
- You lose a bit of quality, gain: lower memory footprint, higher throughput, ability to run on cheaper hardware
llama.cpp and tools around it are heavily optimized for GGUF quantized models.
Guidelines: For experiments / internal tools, 4-bit quant is usually fine. For precision-sensitive domains, start with 8-bit or mixed-precision and eval before going more aggressive. vLLM also supports quantized models and LoRA adapters.
4.3 Batching & Scheduling (Where vLLM Earns Its Keep)
If you're serving more than a handful of users, throughput matters as much as per-request latency.
vLLM's whole reason to exist is better GPU utilization via:
- Continuous batching – dynamically merges requests into larger batches as tokens stream, instead of static batches
- Paged attention – better KV cache management → more concurrent requests per GPU
What you should actually tune: max batch size, number of concurrent model replicas, context length vs throughput tradeoff, token limits per request.
At on-prem scale, vLLM behind an API gateway is usually your best bet for "Llama 3 as a service" with sane latency and cost.
5. Data Sovereignty & Governance: Don't Wing This
Running LLMs locally is not just about hugging your GPUs. It's also: Regulatory (data residency, GDPR, sector-specific rules), Contractual (customer DPAs and security commitments), Licensing (model licenses, MAU caps, commercial clauses).
Minimum you should do:
- Network isolation – LLM hosts in a subnet with no outbound internet by default. Only allow egress where you explicitly need it.
- Log hygiene – Decide up front what you log: prompts? outputs? just metrics? Mask PII or sensitive fields if you're touching customer data.
- Access control – LLM API behind auth and RBAC (per app / team). Audit log who is sending what where.
- License tracking – Keep a list of models, their licenses, and how you use them. Llama 3's license allows self-hosting but has conditions for very large-scale products; Legal needs to know.
If you skip this and your "private" LLM starts ingesting production data, you're one audit away from a headache.
6. Concrete Stack Patterns That Actually Make Sense
6.1 Pattern A – Developer-First Local Stack (Ollama + RAG)
Good for: internal assistants, prototypes, small teams.
- Runtime: Ollama on dev machines or a small shared server
- Model: Llama 3 8B / 8B-instruct quantized
- Extras: Local RAG via something like Chroma / Qdrant + LangChain/LlamaIndex. VSCode / CLI integration for code or Q&A workflows
Pros: minimal friction, no one needs to be a GPU whisperer. Cons: not designed for large org-wide scale.
6.2 Pattern B – On-Prem LLM API with vLLM (Self-Hosted Llama 3)
Good for: production internal apps, RAG backends, multi-team usage.
- Runtime: vLLM in Kubernetes or similar, fronted by an API Gateway (OpenAI-style or custom)
- Model: Llama 3 8B or 70B loaded in fp16 or quantized, depending on hardware
- Infra: Dedicated GPU nodes, autoscaling based on QPS and latency, Prometheus/Grafana or similar for metrics
This is the pattern for "self-hosting Llama 3 as a private foundation model" that other internal services can call.
6.3 Pattern C – Hardened / Edge / Air-Gapped (llama.cpp)
Good for: strict data isolation, regulated environments, or lightweight edge deployments.
- Runtime: llama.cpp built into your own binary or service
- Model: Llama 3 or similar in GGUF, heavily quantized, optional GPU offload
- Infra: no Python; packaged as a service or even a static binary; runs offline
This is how you get "LLM inside the firewall, literally cannot call out" setups, or stick a small LLM at the edge.
7. Setup Checklist: From Zero to Private LLM That Doesn't Suck
Here's the short version you can hand to your infra/ML team:
- Pick your model – Start with Llama 3 8B-class; only go 70B when you know you need it. Confirm the license works for your use case.
- Pick your runtime – Need easy local dev → Ollama. Need portable, low-level control → llama.cpp. Need high-throughput GPU serving → vLLM.
- Estimate hardware – Size VRAM vs model (and quantization). Decide single-machine vs cluster.
- Lock down data paths – Network isolation for the LLM hosts. Logging and PII policy. Auth around the API.
- Tune performance – Turn on quantization where acceptable. Tune batch sizes, context limits, and concurrency. Add caching (prefix/ KV cache / app-level).
- Measure – Track tokens, latency, throughput. Run evaluation on your real tasks (RAG, code, agents). Iterate—don't assume default configs are optimal.
Do this and "run high-performance LLMs locally" stops being a vague aspiration and becomes a well-defined, owned piece of your infra instead of someone else's black box.
Top Tools
- Ollama: The easiest way to get up and running on Mac/Linux.
- vLLM: High-throughput serving engine for production.
- llama.cpp: For running models on consumer hardware (CPUs/Apple Silicon).
8. Kubernetes Deployment Architecture (Production On-Prem)
If you're serious about on-prem LLMs, here's the actual Kubernetes layout your infra team can deploy. This isn't hand-wavy—it's YAML and pseudo-code that shows how to orchestrate vLLM, a vector DB, and a RAG controller on Kubernetes.
8.1 High-level Topology
We'll use three namespaces:
llm– For vLLM pods (Llama, Qwen).rag– For orchestrator and vector DB (Qdrant).edge– For the API gateway that sits in front of everything.
Each vLLM pod gets GPU resources. The orchestrator calls vLLM and Qdrant over internal K8s services. The gateway exposes everything externally via Ingress.
8.2 vLLM Deployment (Llama)
# llm/vllm-llama-service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-llama
namespace: llm
spec:
selector:
app: vllm-llama
ports:
- port: 8000
targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
namespace: llm
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama
template:
metadata:
labels:
app: vllm-llama
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=/models/Llama-3-8B-Instruct
- --tensor-parallel-size=1
- --dtype=float16
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llama-model-pvc
Key: Mount the model from a PVC. If you have multiple GPUs, increase --tensor-parallel-size.
8.3 vLLM Deployment (Qwen)
# llm/vllm-qwen-service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen
namespace: llm
spec:
selector:
app: vllm-qwen
ports:
- port: 8000
targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen
namespace: llm
spec:
replicas: 1
selector:
matchLabels:
app: vllm-qwen
template:
metadata:
labels:
app: vllm-qwen
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=/models/Qwen2.5-7B-Instruct
- --tensor-parallel-size=1
- --dtype=float16
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: qwen-model-pvc
Same pattern—just swap out the model path. Each model gets its own pod and GPU.
8.4 Vector DB (Qdrant)
# rag/qdrant-service.yaml
apiVersion: v1
kind: Service
metadata:
name: qdrant
namespace: rag
spec:
selector:
app: qdrant
ports:
- port: 6333
targetPort: 6333
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: qdrant
namespace: rag
spec:
replicas: 1
selector:
matchLabels:
app: qdrant
template:
metadata:
labels:
app: qdrant
spec:
containers:
- name: qdrant
image: qdrant/qdrant:latest
ports:
- containerPort: 6333
volumeMounts:
- name: qdrant-storage
mountPath: /qdrant/storage
volumes:
- name: qdrant-storage
persistentVolumeClaim:
claimName: qdrant-pvc
Mount a PVC so you don't lose your embeddings on pod restart.
8.5 RAG Orchestrator
# rag/orchestrator-service.yaml
apiVersion: v1
kind: Service
metadata:
name: rag-orchestrator
namespace: rag
spec:
selector:
app: rag-orchestrator
ports:
- port: 8080
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-orchestrator
namespace: rag
spec:
replicas: 2
selector:
matchLabels:
app: rag-orchestrator
template:
metadata:
labels:
app: rag-orchestrator
spec:
containers:
- name: orchestrator
image: your-registry/rag-orchestrator:v1
ports:
- containerPort: 8080
env:
- name: VLLM_LLAMA_URL
value: "http://vllm-llama.llm.svc.cluster.local:8000"
- name: VLLM_QWEN_URL
value: "http://vllm-qwen.llm.svc.cluster.local:8000"
- name: QDRANT_URL
value: "http://qdrant.rag.svc.cluster.local:6333"
This is your brain. It decides: RAG or not? Llama or Qwen? Then constructs the prompt and calls the right vLLM service.
8.6 API Gateway / Ingress
# edge/gateway-service.yaml
apiVersion: v1
kind: Service
metadata:
name: api-gateway
namespace: edge
spec:
selector:
app: api-gateway
ports:
- port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
namespace: edge
spec:
replicas: 2
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: gateway
image: your-registry/api-gateway:v1
ports:
- containerPort: 8080
env:
- name: ORCHESTRATOR_URL
value: "http://rag-orchestrator.rag.svc.cluster.local:8080"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
namespace: edge
spec:
rules:
- host: llm.yourcompany.internal
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-gateway
port:
number: 80
External clients hit llm.yourcompany.internal. The gateway routes to the orchestrator. Add auth, rate limiting, and logging here.
8.7 Monitoring & Sovereignty Hooks
- Prometheus scraping – Add
prometheus.io/scrape: "true"annotations on pods. vLLM and Qdrant expose metrics. - Logging – Use a DaemonSet (Fluentd/Fluent Bit) to ship logs to your SIEM.
- Network policies – Restrict cross-namespace traffic. Only
ragcan talk tollm. - Data sovereignty – All traffic stays internal. No external API calls. Encrypt PVCs if required by compliance.
8.8 Evolution Path
- Swap models – Update the PVC and deployment. No code change.
- Add a judge – Deploy another vLLM pod for a smaller "guard" model. Route sensitive requests through it first.
- LangGraph orchestration – Replace the orchestrator with a LangGraph agent that can do multi-step tool use. Keep the same K8s structure.
8.9 Orchestrator Pseudo-Code
Here's what rag-orchestrator actually does when a request comes in:
// rag-orchestrator/main.py (pseudo-code)
@app.post("/generate")
def generate(request: GenerateRequest):
# 1. Decide: Do we need RAG?
needs_rag = classify_intent(request.prompt) # Classifier or simple keyword check
context = ""
if needs_rag:
# 2. Query vector DB
query_embedding = embed(request.prompt)
results = qdrant_client.search(
collection="docs",
query_vector=query_embedding,
limit=3
)
context = "\n\n".join([r.payload["text"] for r in results])
# 3. Route to model
if request.model_preference == "qwen":
llm_url = VLLM_QWEN_URL
else:
llm_url = VLLM_LLAMA_URL
# 4. Construct prompt
if context:
full_prompt = f"""Use the following context to answer the question.
Context:
{context}
Question: {request.prompt}
Answer:"""
else:
full_prompt = request.prompt
# 5. Call vLLM
response = requests.post(
f"{llm_url}/v1/completions",
json={
"model": "local",
"prompt": full_prompt,
"max_tokens": request.max_tokens,
"temperature": request.temperature
}
)
# 6. Return
return {
"text": response.json()["choices"][0]["text"],
"model_used": llm_url,
"used_rag": needs_rag
}
Why this works: The orchestrator is stateless. It just routes, augments, and calls. You can scale it horizontally. All the heavy lifting (inference, vector search) happens in dedicated services.
Now your infra team has real YAML and real code to deploy, tune, and monitor. No more diagrams that say "just run vLLM somehow."