Ops2025-12-08

How to run high-performance LLMs locally?

Keep your data private and reduce cloud bills by hosting Llama 3, Mistral, or Gemma on your own infrastructure with Ollama, llama.cpp, and vLLM.

You want LLMs on your own metal, under your rules, without spraying data into random US-West regions. Good. Let's walk through how to actually do that fast and reliably, not as a weekend science project.

1. First: What Does "Local" Actually Mean for You?

"Local" can mean three very different things:

Developer laptop / workstation – MacBook, Linux box, maybe a single consumer GPU. Good for: internal tools, prototyping, small-team assistants (Ollama, llama.cpp)
On-prem / private cluster – Your own racks, or at least a private VPC account under your control. Good for: production APIs, RAG systems, multi-tenant internal apps (vLLM, custom servers)
Edge / constrained hardware – Minis, NUCs, ARM boards, offline boxes. Good for: hardcore data isolation, low-latency edge use cases (heavily quantized llama.cpp)

All three can run Llama-class models locally. The stack you choose will differ.

2. Picking the Model: Llama 3 as the Default Starting Point

Meta's Llama 3 family basically became the default "serious open model": 8B and 70B parameter variants, both pre-trained and instruction-tuned, meant to be used for a wide range of tasks.

Key facts:

You are allowed to self-host Llama 3 under Meta's license (with some conditions, especially at big scale / MAU thresholds)
You can download weights from Meta or hubs like Hugging Face and run them on your own hardware
There are plenty of guides for running Llama 3 locally with Python servers, llama.cpp, or tools like Ollama and GPT4All

So when people search "self-hosting Llama 3," they're really asking: What runtime should I use? How do I make it fast enough on my hardware? How do I do that in a way Legal and Security can live with?

3. Three Main Runtimes You Should Actually Care About

You'll see a zoo of options, but 90% of serious local setups land on some combination of: Ollama, llama.cpp, vLLM. They solve different problems.

3.1 Ollama – Easy Mode for Local LLMs

What it is: Ollama is a CLI + server that makes it trivially easy to run LLMs locally on macOS, Windows, and Linux. You ollama pull llama3, it handles downloading, packaging, and spinning up a local API. It uses container-like "Modelfiles" to bundle weights + config and is built on top of optimized runtimes like llama.cpp.

When it's a good fit:

You want to run Llama 3, Mistral, DeepSeek, etc. locally with minimal effort
You care about developer productivity and simple integration: local REST API, easy model switching, simple config for quantization / GPU usage
You don't want to manually deal with GGUF files, GPU flags, or custom servers yet

When it's not enough: You need cluster-level throughput, multi-GPU sharding, or heavy multi-tenant serving. You want full control over serving internals (batching, scheduling) and deep integration into your infra.

Think of Ollama as "local LLM platform for humans". Great for teams getting off the ground and for internal-only workflows.

3.2 llama.cpp – Bare-Metal Control, Runs Anywhere

What it is: llama.cpp is a C/C++ inference engine for running LLMs efficiently on CPUs and GPUs, with support for quantized GGUF models and tons of backends (CUDA, Metal, ROCm, etc.). It's famously capable of running large models like Llama 3 on laptops, desktops, and even Raspberry Pis (slowly, but it works).

Why people use it:

Runs basically everywhere: Linux, macOS, Windows, ARM, embedded
Heavy quantization support: 8-bit, 6-bit, 4-bit GGUF variants and more to squeeze models onto cheaper hardware
No Python runtime required: good for hardened environments and strict ops teams

When it's a good fit: You're serious about CPU or mixed hardware (not just big NVIDIA boxes). You want fine-grained control over quantization and performance tradeoffs. You're comfortable wrapping it in your own service (REST/gRPC, auth, logging, etc.).

Tradeoff: More power and portability, more work. Ollama uses llama.cpp under the hood for many models; llama.cpp is the low-level engine, Ollama is the nicer DX wrapper.

3.3 vLLM – High-Throughput GPU Inference Engine

What it is: vLLM is a high-performance LLM inference engine and server designed for GPU clusters: continuous batching, paged attention, prefix caching, and all the tricks you need to squeeze maximum throughput out of expensive GPUs.

Real-world benchmarks routinely show multiple-x throughput improvements versus naive serving stacks for multi-request workloads.

When it's a good fit:

You're running Llama 3 or other big models on A100 / H100 / similar GPUs
You care about serving many concurrent users with stable latency
You want features like: continuous batching, paged attention / KV-cache optimization, multi-GPU / model sharding, OpenAI-style serving APIs

Tradeoff: vLLM assumes you're OK with Python and GPUs and that you treat LLMs like a proper service (Kubernetes, observability, etc.). It's overkill for a single developer laptop, perfect for on-prem LLM APIs.

4. Performance Levers: How to Make Local LLMs Not Suck

Tools are nice, but if you ignore the basics, you still end up with a slow local chatbot that times out under real load.

4.1 Hardware: Match Model Size to VRAM/Memory

Rules of thumb:

Full-precision 70B models are not for your 8 GB GPU
Use: 7–8B models for commodity GPUs / laptops; 8–14B models for mid-range single GPUs; 70B class only if you have serious VRAM (or you accept heavy quantization / CPU offload)

For high-throughput on-prem serving: prioritize VRAM and bandwidth (A100/H100, MI300, etc.). Fewer big GPUs used well > many small GPUs used naively.

4.2 Quantization: Your Best Friend for Local LLMs

Quantization is how you run 70B-class models on hardware that shouldn't reasonably hold them:

Convert model weights from fp16 → int8 / int4
You lose a bit of quality, gain: lower memory footprint, higher throughput, ability to run on cheaper hardware

llama.cpp and tools around it are heavily optimized for GGUF quantized models.

Guidelines: For experiments / internal tools, 4-bit quant is usually fine. For precision-sensitive domains, start with 8-bit or mixed-precision and eval before going more aggressive. vLLM also supports quantized models and LoRA adapters.

4.3 Batching & Scheduling (Where vLLM Earns Its Keep)

If you're serving more than a handful of users, throughput matters as much as per-request latency.

vLLM's whole reason to exist is better GPU utilization via:

Continuous batching – dynamically merges requests into larger batches as tokens stream, instead of static batches
Paged attention – better KV cache management → more concurrent requests per GPU

What you should actually tune: max batch size, number of concurrent model replicas, context length vs throughput tradeoff, token limits per request.

At on-prem scale, vLLM behind an API gateway is usually your best bet for "Llama 3 as a service" with sane latency and cost.

5. Data Sovereignty & Governance: Don't Wing This

Running LLMs locally is not just about hugging your GPUs. It's also: Regulatory (data residency, GDPR, sector-specific rules), Contractual (customer DPAs and security commitments), Licensing (model licenses, MAU caps, commercial clauses).

Minimum you should do:

Network isolation – LLM hosts in a subnet with no outbound internet by default. Only allow egress where you explicitly need it.
Log hygiene – Decide up front what you log: prompts? outputs? just metrics? Mask PII or sensitive fields if you're touching customer data.
Access control – LLM API behind auth and RBAC (per app / team). Audit log who is sending what where.
License tracking – Keep a list of models, their licenses, and how you use them. Llama 3's license allows self-hosting but has conditions for very large-scale products; Legal needs to know.

If you skip this and your "private" LLM starts ingesting production data, you're one audit away from a headache.

6. Concrete Stack Patterns That Actually Make Sense

6.1 Pattern A – Developer-First Local Stack (Ollama + RAG)

Good for: internal assistants, prototypes, small teams.

Runtime: Ollama on dev machines or a small shared server
Model: Llama 3 8B / 8B-instruct quantized
Extras: Local RAG via something like Chroma / Qdrant + LangChain/LlamaIndex. VSCode / CLI integration for code or Q&A workflows

Pros: minimal friction, no one needs to be a GPU whisperer. Cons: not designed for large org-wide scale.

6.2 Pattern B – On-Prem LLM API with vLLM (Self-Hosted Llama 3)

Good for: production internal apps, RAG backends, multi-team usage.

Runtime: vLLM in Kubernetes or similar, fronted by an API Gateway (OpenAI-style or custom)
Model: Llama 3 8B or 70B loaded in fp16 or quantized, depending on hardware
Infra: Dedicated GPU nodes, autoscaling based on QPS and latency, Prometheus/Grafana or similar for metrics

This is the pattern for "self-hosting Llama 3 as a private foundation model" that other internal services can call.

6.3 Pattern C – Hardened / Edge / Air-Gapped (llama.cpp)

Good for: strict data isolation, regulated environments, or lightweight edge deployments.

Runtime: llama.cpp built into your own binary or service
Model: Llama 3 or similar in GGUF, heavily quantized, optional GPU offload
Infra: no Python; packaged as a service or even a static binary; runs offline

This is how you get "LLM inside the firewall, literally cannot call out" setups, or stick a small LLM at the edge.

7. Setup Checklist: From Zero to Private LLM That Doesn't Suck

Here's the short version you can hand to your infra/ML team:

Pick your model – Start with Llama 3 8B-class; only go 70B when you know you need it. Confirm the license works for your use case.
Pick your runtime – Need easy local dev → Ollama. Need portable, low-level control → llama.cpp. Need high-throughput GPU serving → vLLM.
Estimate hardware – Size VRAM vs model (and quantization). Decide single-machine vs cluster.
Lock down data paths – Network isolation for the LLM hosts. Logging and PII policy. Auth around the API.
Tune performance – Turn on quantization where acceptable. Tune batch sizes, context limits, and concurrency. Add caching (prefix/ KV cache / app-level).
Measure – Track tokens, latency, throughput. Run evaluation on your real tasks (RAG, code, agents). Iterate—don't assume default configs are optimal.

Do this and "run high-performance LLMs locally" stops being a vague aspiration and becomes a well-defined, owned piece of your infra instead of someone else's black box.

Top Tools

Ollama: The easiest way to get up and running on Mac/Linux.
vLLM: High-throughput serving engine for production.
llama.cpp: For running models on consumer hardware (CPUs/Apple Silicon).

8. Kubernetes Deployment Architecture (Production On-Prem)

If you're serious about on-prem LLMs, here's the actual Kubernetes layout your infra team can deploy. This isn't hand-wavy—it's YAML and pseudo-code that shows how to orchestrate vLLM, a vector DB, and a RAG controller on Kubernetes.

8.1 High-level Topology

We'll use three namespaces:

llm – For vLLM pods (Llama, Qwen).
rag – For orchestrator and vector DB (Qdrant).
edge – For the API gateway that sits in front of everything.

Each vLLM pod gets GPU resources. The orchestrator calls vLLM and Qdrant over internal K8s services. The gateway exposes everything externally via Ingress.

8.2 vLLM Deployment (Llama)

# llm/vllm-llama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama
  namespace: llm
spec:
  selector:
    app: vllm-llama
  ports:
    - port: 8000
      targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama
  template:
    metadata:
      labels:
        app: vllm-llama
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=/models/Llama-3-8B-Instruct
          - --tensor-parallel-size=1
          - --dtype=float16
        ports:
          - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: model-storage
            mountPath: /models
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: llama-model-pvc

Key: Mount the model from a PVC. If you have multiple GPUs, increase --tensor-parallel-size.

8.3 vLLM Deployment (Qwen)

# llm/vllm-qwen-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen
  namespace: llm
spec:
  selector:
    app: vllm-qwen
  ports:
    - port: 8000
      targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
  namespace: llm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-qwen
  template:
    metadata:
      labels:
        app: vllm-qwen
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=/models/Qwen2.5-7B-Instruct
          - --tensor-parallel-size=1
          - --dtype=float16
        ports:
          - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: model-storage
            mountPath: /models
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: qwen-model-pvc

Same pattern—just swap out the model path. Each model gets its own pod and GPU.

8.4 Vector DB (Qdrant)

# rag/qdrant-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: qdrant
  namespace: rag
spec:
  selector:
    app: qdrant
  ports:
    - port: 6333
      targetPort: 6333
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qdrant
  namespace: rag
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qdrant
  template:
    metadata:
      labels:
        app: qdrant
    spec:
      containers:
      - name: qdrant
        image: qdrant/qdrant:latest
        ports:
          - containerPort: 6333
        volumeMounts:
          - name: qdrant-storage
            mountPath: /qdrant/storage
      volumes:
        - name: qdrant-storage
          persistentVolumeClaim:
            claimName: qdrant-pvc

Mount a PVC so you don't lose your embeddings on pod restart.

8.5 RAG Orchestrator

# rag/orchestrator-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: rag-orchestrator
  namespace: rag
spec:
  selector:
    app: rag-orchestrator
  ports:
    - port: 8080
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-orchestrator
  namespace: rag
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rag-orchestrator
  template:
    metadata:
      labels:
        app: rag-orchestrator
    spec:
      containers:
      - name: orchestrator
        image: your-registry/rag-orchestrator:v1
        ports:
          - containerPort: 8080
        env:
          - name: VLLM_LLAMA_URL
            value: "http://vllm-llama.llm.svc.cluster.local:8000"
          - name: VLLM_QWEN_URL
            value: "http://vllm-qwen.llm.svc.cluster.local:8000"
          - name: QDRANT_URL
            value: "http://qdrant.rag.svc.cluster.local:6333"

This is your brain. It decides: RAG or not? Llama or Qwen? Then constructs the prompt and calls the right vLLM service.

8.6 API Gateway / Ingress

# edge/gateway-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-gateway
  namespace: edge
spec:
  selector:
    app: api-gateway
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: edge
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      containers:
      - name: gateway
        image: your-registry/api-gateway:v1
        ports:
          - containerPort: 8080
        env:
          - name: ORCHESTRATOR_URL
            value: "http://rag-orchestrator.rag.svc.cluster.local:8080"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  namespace: edge
spec:
  rules:
    - host: llm.yourcompany.internal
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

External clients hit llm.yourcompany.internal. The gateway routes to the orchestrator. Add auth, rate limiting, and logging here.

8.7 Monitoring & Sovereignty Hooks

Prometheus scraping – Add prometheus.io/scrape: "true" annotations on pods. vLLM and Qdrant expose metrics.
Logging – Use a DaemonSet (Fluentd/Fluent Bit) to ship logs to your SIEM.
Network policies – Restrict cross-namespace traffic. Only rag can talk to llm.
Data sovereignty – All traffic stays internal. No external API calls. Encrypt PVCs if required by compliance.

8.8 Evolution Path

Swap models – Update the PVC and deployment. No code change.
Add a judge – Deploy another vLLM pod for a smaller "guard" model. Route sensitive requests through it first.
LangGraph orchestration – Replace the orchestrator with a LangGraph agent that can do multi-step tool use. Keep the same K8s structure.

8.9 Orchestrator Pseudo-Code

Here's what rag-orchestrator actually does when a request comes in:

// rag-orchestrator/main.py (pseudo-code)

@app.post("/generate")
def generate(request: GenerateRequest):
    # 1. Decide: Do we need RAG?
    needs_rag = classify_intent(request.prompt)  # Classifier or simple keyword check

    context = ""
    if needs_rag:
        # 2. Query vector DB
        query_embedding = embed(request.prompt)
        results = qdrant_client.search(
            collection="docs",
            query_vector=query_embedding,
            limit=3
        )
        context = "\n\n".join([r.payload["text"] for r in results])

    # 3. Route to model
    if request.model_preference == "qwen":
        llm_url = VLLM_QWEN_URL
    else:
        llm_url = VLLM_LLAMA_URL

    # 4. Construct prompt
    if context:
        full_prompt = f"""Use the following context to answer the question.

Context:
{context}

Question: {request.prompt}

Answer:"""
    else:
        full_prompt = request.prompt

    # 5. Call vLLM
    response = requests.post(
        f"{llm_url}/v1/completions",
        json={
            "model": "local",
            "prompt": full_prompt,
            "max_tokens": request.max_tokens,
            "temperature": request.temperature
        }
    )

    # 6. Return
    return {
        "text": response.json()["choices"][0]["text"],
        "model_used": llm_url,
        "used_rag": needs_rag
    }

Why this works: The orchestrator is stateless. It just routes, augments, and calls. You can scale it horizontally. All the heavy lifting (inference, vector search) happens in dedicated services.

Now your infra team has real YAML and real code to deploy, tune, and monitor. No more diagrams that say "just run vLLM somehow."