Local LLM Performance Architect

Local AI VRAM Calculator

Accurately calculate weight and KV cache footprint for local AI runs. Compare model fitting across GGUF & EXL2 quantizations, view matching local GPU setups, and check renting economics.

Grouped-Query Attention (GQA) Supported Static client-side app Quantization-aware Data refreshed · Jun 2026

GitHub 𝕏 @weekend__builds

⚙️ Config Engine

Choose a base model or set up a custom architectural profile below.

Model Profile

Model Preset

Model Size 8.19B

Model Quantization

Attention Architecture Auto-filled by preset

Layers (L) 36

Hidden (H) 4096

Query Heads 32

KV Heads 8

Context & Runtime

Context Window (Tokens) 8,192

KV Cache Quant

Engine Overhead 1.5 GB

Your Available VRAM 16 GB

Economics Configuration

Tokens/Sec 25

Output/Job 2000

Cloud Price ($/hr) 0.99

Jobs / Month 300

Fits

VRAM OK. Ready for local execution.

Analyzing mathematical memory margins...

0.0 GB of 16.0 GB VRAM utilized

📊 VRAM Allocation Blueprint

Interactive graphical breakdown of where every megabyte goes.

Model Weights 0.0 GB

KV Cache 0.0 GB

Engine Overhead 0.0 GB

VRAM Limit 16.0 GB

Out of VRAM 0.0 GB

Model Weights

Params × Bitrate / 8

KV Cache Size

2×L×KV_H×(H/Q_H)×Ctx×Bytes

Total Required

Weights + KV + Overhead

Inference Job

Execution duration/run

💻 Local GPU Compatibility Recommender

See how physical consumer GPUs and cloud hardware fit your configured parameters.

💰 Buy vs. Cloud Rental Economics

Calculates direct cloud GPU API run costs versus purchasing dedicated physical hardware.

- Estimated cost per cloud API run.

- Cloud rental spent per month.

- Hardware payback timeline.

🧮 View Mathematical Formula Specs

[Expand]


              Weights Size (GB)
              (Params (B) × Bitrate / 8) + 0.15 GB baseline. Includes standard GGUF header/norms padding factors.


              KV Cache (GB)
              2 × L × KV_H × (H / Q_H) × Ctx × Bytes / 1e9.
              GQA reduces this significantly by keeping KV Heads smaller than Query Heads.


              Inference duration
              Output tokens / (Tokens/Sec). Cost matches execution time scaled by hourly rental.