Llama.cpp Architecture

Fri May 08 2026 · Nitin Bansal

1. About LLama.cpp

llama.cpp is a high-performance, cross-platform inference engine for Large Language Models (LLMs) written entirely in C and C++. Its primary goal is to enable LLM inference with minimal setup, zero external dependencies, and state-of-the-art performance on a vast range of hardware — from consumer laptops and Raspberry Pis to multi-GPU servers.

Rather than wrapping Python ML frameworks (PyTorch, TensorFlow), llama.cpp is a from-scratch implementation built atop the ggml tensor library. This approach yields:

Zero external runtime dependencies — no Python, no PyTorch, no CUDA runtime required
First-class Apple Silicon support — ARM NEON, Accelerate, and Metal frameworks
Extreme quantization — supports 1.5-bit through 8-bit integer quantization (Q1.5 through Q8.0, plus IQ and K-quant variants)
Hardware diversity — CPU (x86 with AVX/AVX2/AVX512/AMX, ARM NEON, RISC-V with RVV), GPU (CUDA, Metal, Vulkan, SYCL, HIP, MUSA, WebGPU), and NPU (CANN, OpenVINO, ZenDNN) backends
Server-mode — OpenAI API-compatible HTTP server for production deployments
Multimodal support — vision (LLaVA, Qwen2-VL), audio (Whisper-style), and cross-attention encoder-decoder models

The project serves as the primary development playground for the ggml tensor library, driving innovations in quantized matrix multiplication, SIMD kernels, and GPU compute.

Repository: github.com/ggml-org/llama.cpp
License: MIT (Core C/C++ library under MIT; bindings and examples may vary)
Core Language: C/C++ with Python scripts for model conversion

2. Architecture Overview

The system is organized into a clean layering:

┌─────────────────────────────────────────────────────────────┐
│                    Tools & Applications                      │
│  llama-cli │ llama-server │ llama-perplexity │ llama-bench  │
│  llama-quantize │ llama-tokenize │ llama-gguf-split │ ...   │
├─────────────────────────────────────────────────────────────┤
│              Common Library (common/)                        │
│  Sampling │ N-gram cache │ PEG Parser │ JSON Schema → GBNF  │
│  Chat templates │ HTTP client │ Presets │ Speculative        │
├─────────────────────────────────────────────────────────────┤
│              libllama Core (src/)                            │
│  Model Loading │ Context Management │ Graph Building         │
│  KV Cache │ Vocabulary │ Grammar Engine │ LoRA Adapters      │
│  Quantization │ Samplers │ Chat Formats │ Session State      │
├─────────────────────────────────────────────────────────────┤
│         Model Implementations (src/models/)                  │
│  150+ architectures: LLaMA, Mistral, Qwen, DeepSeek,        │
│  Gemma, Phi, Mamba, RWKV, BERT, Falcon, Grok, etc.          │
├─────────────────────────────────────────────────────────────┤
│              ggml Tensor Library (ggml/)                     │
│  Tensor Ops │ Automatic Differentiation │ Optimizers        │
│  Backend Abstraction │ Memory Allocator │ Threading          │
├─────────────────────────────────────────────────────────────┤
│           GPU Backend Implementations (ggml/src/)            │
│  CUDA │ Metal │ Vulkan │ SYCL │ HIP │ CANN │ WebGPU │ RPC   │
│  OpenCL │ OpenVINO │ ZenDNN │ zDNN │ VirtGPU │ Hexagon      │
└─────────────────────────────────────────────────────────────┘

2.1 File Format: GGUF

All models are stored in the GGUF (GGML Universal Format) binary format:

[Magic "GGUF"] [Version] [Tensor Count] [KV Metadata Count]
  → KV Pairs: (key-string, type, value) — architecture, hyperparameters, tokenizer config, etc.
  → Tensor Info: (name, dimensions, dtype, offset in blob)
  → Tensor Data Blob: contiguous binary weight data

GGUF is a self-describing format that evolved from earlier GGML/GGJT formats. Key features:

Key-Value metadata: Stores architecture name, model hyperparameters (n_layer, n_head, n_embd, etc.), RoPE config, tokenizer data (vocab, merges, scores, chat template), and arbitrary metadata
Alignable data section: Supports configurable alignment (default 32 bytes) for direct memory-mapped access
Multi-file splits: Large models can be split across multiple .gguf files (model-00001-of-00003.gguf)
Extensible: New architectures add new KV keys and tensor names without breaking backward compatibility

The Python script convert_hf_to_gguf.py (668KB) handles conversion from Hugging Face PyTorch/SafeTensors to GGUF. There are also specialized converters for LoRA adapters, Llama 2c models, and GGML-to-GGUF migration.

3. The ggml Tensor Library (Core Foundation)

The ggml/ directory contains the foundational tensor computation library. It is a standalone project (also used by whisper.cpp and stable-diffusion.cpp).

3.1 Core Tensor System (`ggml.h`, `ggml.c`)

Key features:

Multi-dimensional tensors: Up to 4 dimensions, with support for view/reshape/permute
Data types: F32, F16, BF16, and ~30 quantized types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K through Q6_K, IQ1_S through IQ4_XS, TQ1_0/TQ2_0 ternary, MXFP4, NVFP4, and more)
Computation graph: Build a graph of tensor operations (ggml_cgraph), then compute forward (inference) or backward (gradients)
Automatic differentiation: Mark tensors as parameters with ggml_set_param(), then compute gradients via backward pass
Optimizers: Built-in Adam, L-BFGS, and SGD for training/fine-tuning
SIMD-optimized kernels: Hand-tuned assembly/C for AVX2, AVX512, ARM NEON, RISC-V RVV, WASM SIMD, etc.
Thread pool: Parallel computation across multiple CPU threads

Key operations (ggml_ops): MatMul (regular and quantized), convolution, attention (scaled dot-product), RoPE, RMS norm, layer norm, softmax, SiLU/GELU/ReLU activations, ArgMax, cross-entropy loss, and more. Full operation list in docs/ops.md.

3.2 Quantization Engine (`ggml-quants.c`, `ggml-quants.h`)

The quantization subsystem is arguably the most performance-critical component. It implements:

Block quantization scheme: Weights are divided into small blocks (typically 32 elements). Each block stores:

A shared scale (and optionally min value)
Each weight is quantized to low-bit representation relative to the block scale

Quantized types (from lowest to highest precision):

Type	Bits/Weight	Description
IQ1_S	1.56	1-bit importance-weighted
IQ1_M	1.75	1-bit importance-weighted medium
TQ1_0	1.69	Ternary quantization
TQ2_0	2.06	Ternary quantization
IQ2_XXS	2.06	2-bit extremely extra-small
IQ2_XS	2.31	2-bit extra-small
Q2_K	2.56-3.06	2-bit K-quant (small/medium variants)
IQ3_XXS	3.06	3-bit extremely extra-small
Q3_K	3.50-4.25	3-bit K-quant (small/medium/large)
IQ4_NL	4.50	4-bit importance-weighted no-list
Q4_K	4.50-5.50	4-bit K-quant (small/medium)
Q5_K	5.50-6.20	5-bit K-quant (small/medium)
Q6_K	6.56	6-bit K-quant
Q8_0	8.00	8-bit block
F16	16.00	Half-precision float
BF16	16.00	Bfloat16
F32	32.00	Full-precision float

"K-quant" (Q2_K through Q6_K) uses a clever super-block scheme: larger blocks (~256 elements) contain multiple sub-blocks with their own scales, plus a super-block scale. This provides superior accuracy vs. size compared to simple per-block quantization.

"IQ" (Importance-aware Quantization) applies non-uniform quantization that assigns more bits to important weights and fewer to unimportant ones, improving perplexity at very low bitrates.

Each quantized type has:

quantize_row_*: Convert float row → quantized row
dequantize_row_*: Convert quantized row → float row
ggml_compute_forward_*_mat_mul: Compute matrix multiplication directly in quantized space (no dequant → matmul → requant pipeline)

3.3 Backend Abstraction (`ggml-backend.h`, `ggml-backend-impl.h`)

The backend system provides a device abstraction:

struct ggml_backend {
    const char * name;
    // Memory management
    ggml_backend_buffer_type_t (*get_default_buffer_type)(...);
    // Tensor operations
    ggml_backend_graph_compute(...);
    // Synchronization
    ggml_backend_synchronize(...);
};

Supported backends:

Backend	File(s)	Target
CPU	`ggml-cpu.h` + `ggml.c`	All CPUs (x86, ARM, RISC-V)
CUDA	`ggml-cuda.h`	NVIDIA GPUs
Metal	`ggml-metal.h`	Apple Silicon (M1/M2/M3/M4)
Vulkan	`ggml-vulkan.h`	Cross-vendor GPU
SYCL	`ggml-sycl.h`	Intel & NVIDIA GPU
HIP	`ggml-cuda.h` (ROCm path)	AMD GPU
MUSA	`ggml-cuda.h` (MUSA path)	Moore Threads GPU
CANN	`ggml-cann.h`	Ascend NPU
OpenVINO	`ggml-openvino.h`	Intel CPU/GPU/NPU
WebGPU	`ggml-webgpu.h`	Browser (via Emscripten)
RPC	`ggml-rpc.h`	Remote GPU (network)
BLAS	`ggml-blas.h`	BLAS libraries
OpenCL	`ggml-opencl.h`	Adreno GPU
ZenDNN	`ggml-zdnn.h`	AMD CPU
zDNN	`ggml-zendnn.h`	IBM Z/LinuxONE
VirtGPU	`ggml-virtgpu.h`	Virtual GPU API
Hexagon	`ggml-hexagon.h`	Snapdragon (in progress)

The scheduler (ggml_backend_sched) orchestrates computation across multiple backends simultaneously. It partitions the computation graph and offloads compatible operations to GPU backends while keeping others on CPU — enabling CPU+GPU hybrid inference for models larger than VRAM.

3.4 Memory Allocator (`ggml-alloc.h`, `ggml-alloc.c`)

The allocator uses a multi-pool strategy tailored for the computation graph pattern:

Temporary allocator for intermediate tensors
Persistent allocator for weights and long-lived tensors
Hash-based (ggml-alloc) and address-based (ggml-alloc-addr) strategies
Integrated with backend memory management for GPU allocations

3.5 Optimization Module (`ggml-opt.h`, `ggml-opt.cpp`)

Provides training infrastructure:

Adam/AdamW: Standard adaptive optimization with weight decay
L-BFGS: Limited-memory Broyden–Fletcher–Goldfarb–Shanno for smaller optimization tasks
SGD: Stochastic gradient descent with momentum
Datasets: ggml_opt_dataset abstraction for training data batching
Used by llama.cpp for embedding model fine-tuning

4. Core llama.cpp Library (`src/`)

4.1 Model Architecture (`src/llama-model.cpp` + `src/models/*.cpp`)

The model system supports ~150+ architectures. Each architecture is defined by:

Architecture enum (llama-arch.h): A single entry for each model type (e.g., LLM_ARCH_LLAMA, LLM_ARCH_QWEN2, LLM_ARCH_DEEPSEEK2, LLM_ARCH_MAMBA2)

Tensor naming (LLM_TN helper): Maps logical tensor roles (e.g., LLM_TENSOR_ATTN_Q, LLM_TENSOR_FFN_GATE) to GGUF tensor names (e.g., blk.0.attn_q.weight, blk.0.ffn_gate.weight).

Model class hierarchy (src/models/models.h as base):

llama_model (abstract base)
├── llama_model_llama       — LLaMA 1/2/3, Mistral, Yi, etc.
├── llama_model_qwen2       — Qwen 2
├── llama_model_deepseek2   — DeepSeek V2/V3 (with MLA)
├── llama_model_mamba       — Mamba state-space model
├── llama_model_rwkv6       — RWKV-6
├── llama_model_gemma3      — Gemma 3 (multimodal)
├── llama_model_qwen2vl     — Qwen2-VL (vision-language)
├── ... and 140+ more

Each model class implements:

model_load(): Load architecture-specific tensors from GGUF, set up hyperparameters
Graph building: Typically build_llama(), build_qwen2(), etc., which construct the computation graph using llm_graph_context primitives (attention, FFN, norms, etc.)

Key hyperparameter structures:

llama_hparams: Architecture-level parameters (n_layer, n_head, n_embd, n_ff, etc.)
llama_cparams: Context-level parameters (n_ctx, n_batch, n_ubatch, pooling type, etc.)

4.2 Graph Building (`src/llama-graph.h`)

The graph construction is the heart of inference. The llm_graph_context class provides:

Layer building blocks:

build_norm(): Layer norm or RMS norm (with learnable weights)
build_qkv(): Compute Q, K, V projections (supports fused and separate paths)
build_ffn(): Feed-forward network with various activation styles (SiLU, GELU, ReLU, SwiGLU, GeGLU, etc.)
build_moe_ffn(): Mixture-of-Experts FFN with routing
build_attn() / build_attn_mha(): Multi-head attention with KV cache support

Input abstractions (llm_graph_input_i): Each input type abstracts a portion of the ubatch data into graph tensors:

llm_graph_input_embd: Token embeddings or pre-computed embeddings
llm_graph_input_pos: Position IDs
llm_graph_input_attn_kv: KV cache indices and attention mask
llm_graph_input_rs: Recurrent state copy indices (for Mamba/RWKV)
llm_graph_input_no_cache: Full self-attention without caching

Graph reuse optimization: The system tracks whether a new batch would produce the same graph topology as a previous one. If so, it reuses the existing graph and only updates input tensors — avoiding expensive re-graph-building.

4.3 Context Management (`src/llama-context.cpp`, `src/llama-context.h`)

The llama_context struct manages the runtime state:

Key responsibilities:

Batch decoding: Splits large logical batches into micro-batches (ubatch) that fit in GPU memory
Graph scheduling: Reserves backend scheduler, builds/reuses computation graphs
Memory management: KV cache allocation, state save/load, per-sequence memory buffers
Output buffers: Logits, embeddings, and sampling output management
Multi-sequence support: Up to n_seq_max concurrent sequences with independent KV cache slots
LoRA adapter switching: Hot-swap LoRA adapters without reloading the base model

Key methods:

decode(): Process a batch of tokens through the model
encode(): For encoder-decoder models, process the encoder
process_ubatch(): The inner loop that feeds one micro-batch through the graph
state_save_file() / state_load_file(): Full context state serialization
state_seq_*(): Per-sequence state save/load (for speculative decoding rollback)

4.4 Model Loading (`src/llama-model-loader.cpp`)

The loader handles GGUF file parsing:

Multi-file support: Handles sharded models (e.g., <prefix>-00001-of-00003.gguf)
Memory mapping: Uses mmap() where supported for zero-copy weight loading
Weight encoding: Supports BF16 and quantized tensor format conversion during load
Metadata extraction: Reads all KV pairs to configure hyperparameters, tokenizer, and chat template
Tensor allocation: Allocates tensor storage via the backend-specific buffer types (CPU RAM, GPU VRAM)
Layer offloading: Distributes layers across GPU (specified by count) and CPU

4.5 Vocabulary & Tokenization (`src/llama-vocab.cpp`, `src/llama-vocab.h`)

Supports multiple tokenizer models:

Tokenizer Type	Enum	Models
SentencePiece (Unigram)	`LLAMA_VOCAB_TYPE_SPM`	LLaMA, Mistral, Gemma
BPE (byte-level)	`LLAMA_VOCAB_TYPE_BPE`	GPT-2, Falcon, Qwen
WordPiece	`LLAMA_VOCAB_TYPE_WPM`	BERT
Unigram	`LLAMA_VOCAB_TYPE_UGM`	T5, Flan-T5
RWKV	`LLAMA_VOCAB_TYPE_RWKV`	RWKV-6/7
PLaMo-2	`LLAMA_VOCAB_TYPE_PLAMO2`	PLaMo-2

Pre-tokenizers (llama_vocab_pre_type): Handles architecture-specific text preprocessing (e.g., Digamma for LLaMA 3, byte fallback for DeepSeek, GPT-4o regex splitting). There are ~50 pre-tokenizer variants.

Key methods:

tokenize(): Text → token IDs (with add_special/parse_special flags)
detokenize(): Token IDs → text (with remove_special/unparse_special)
token_to_piece(): Single token → text fragment
byte_to_token() / token_to_byte(): For byte-level tokenizers

Chat template support: The vocab stores Hugging Face-style Jinja2 chat templates, applied via the common/chat.cpp module.

4.6 Sampling (`src/llama-sampler.cpp`, `src/llama-sampler.h`)

The sampling system is a chain-of-responsibility:

Input: logits (raw model outputs, shape [n_vocab])
  ↓
Sampler Chain (each step transforms the candidate distribution):
  ├── TemperatureSampler    — Divide logits by temperature
  ├── TopKSampler           — Keep only top-k logits
  ├── TopPSampler (nucleus) — Keep cumulative probability p
  ├── MinPSampler           — Keep logits ≥ min_p × max_logit
  ├── TypicalSampler        — Keep logits with entropy below threshold
  ├── XTCSampler            — Extreme token classification filtering
  ├── DRYSampler            — Context-aware repetition penalty (n-gram blocking)
  ├── RepetitionPenalty     — Penalize recently generated tokens
  ├── FrequencyPenalty      — Penalize high-frequency tokens
  ├── PresencePenalty       — Penalize tokens that have appeared
  ├── MirostatSampler       — Adaptive entropy-based sampling
  ├── GrammarSampler        — Constrain output to a GBNF grammar
  ├── LevenshteinSampler    — Constrain via Levenshtein automata
  ├── PenalizeNgram         — N-gram repetition blocking
  ├── TailFreeSampler       — Remove tail probability mass
  ├── LocallyTypical        — Locally typical sampling
  └── GreedySampler         — Always pick most likely token (no randomness)
  ↓
Output: single token ID + optional probabilities

Samplers can be CPU-side (standard C++ implementations) or backend-side (executed as part of the GPU computation graph for zero-copy). Backend sampling is more efficient but requires the sampler to be representable as ggml operations.

4.7 KV Cache (`src/llama-kv-cache.cpp`, `src/llama-kv-cache.h`)

The KV cache stores computed Key and Value tensors from previous tokens, avoiding recomputation:

Standard KV cache: Linear storage of K/V for each layer, each sequence
iSWA KV cache (llama-kv-cache-iswa.*): Interleaved sliding-window attention cache — stores both full and windowed K/V for models with sliding window attention
Memory cost: Typically the largest memory consumer (~2 × n_layers × n_embd_kv × n_ctx × dtype_size)

Key operations:

set() / get(): Basic K/V storage and retrieval
cpy_k() / cpy_v(): Efficient copy for sequence management
head() / size(): Cache capacity tracking

4.8 Recurrent Memory (`src/llama-memory-recurrent.cpp`)

For state-space models like Mamba and RWKV, there's no KV cache. Instead, a recurrent state is maintained:

llama_memory_recurrent_context: Manages per-sequence hidden states (shape depends on architecture)
State copy/update operations during batch processing
Hybrid models (like Jamba) use both KV cache and recurrent memory

4.9 Grammar Engine (`src/llama-grammar.cpp`, `src/llama-grammar.h`)

Implements GBNF (GGML BNF) grammar parsing and guided generation:

Parses .gbnf grammar files (BNF-like notation with character sets, repetitions, alternatives)
Builds a deterministic finite automaton (DFA) from the grammar
During sampling, computes a grammar mask — a boolean vector over the vocabulary indicating which tokens are valid next
Enables structured output: JSON, code, mathematical expressions, etc.

5. Common Utilities (`common/`)

The common/ directory provides shared functionality used by all tools and examples:

5.1 Chat System (`common/chat.cpp`, `common/chat.h`)

Chat templates: Applies Jinja2-like templates (LLaMA 3, ChatML, etc.) to format conversations
PEG parser (common/peg-parser.*): A parsing expression grammar engine for output structuring
Auto parser (common/chat-auto-parser.*): Automatically detects model capabilities and applies appropriate parsing
Chat diff analyzer (common/chat-diff-analyzer.cpp): Tracks conversation state changes

5.2 Sampling Extensions (`common/sampling.cpp`, `common/sampling.h`)

Convenience wrappers for sampler chain construction
DRY repetition penalty (n-gram based, llama_sampler_init_dry_testing)
Frequent use of llama_sampler_chain with pre-tuned defaults

5.3 Speculative Decoding (`common/speculative.cpp`, `common/speculative.h`)

Draft-verification: Uses a smaller "draft" model (or same model with fewer layers) to generate candidate tokens, then verifies them with the target model
Batch acceptance: Process draft tokens in parallel for significant speedup
Target model and draft model can run on different devices

5.4 Grammar & JSON Tools

json-schema-to-grammar.*: Converts JSON Schema to GBNF grammar for constrained generation
json-partial.*: Handles partial/incremental JSON parsing for streaming
regex-partial.*: Partial regex matching for streaming validation
PEG parser (common/peg-parser.*): Full PEG parsing engine for complex output constraints

5.5 N-gram Cache (`common/ngram-cache.`, `common/ngram-map.`, `common/ngram-mod.*`)

Prefix tree-based n-gram tracking
Used by repetition penalties and speculative decoding

5.6 Other Utilities

common.cpp: arg parsing, model path resolution, GPU info
console.cpp: Terminal I/O with color and raw mode
log.cpp: Logging infrastructure
hf-cache.cpp: Hugging Face Hub integration (model downloads, caching)
dl.cpp / download.cpp: HTTP download with resume
fit.cpp: Parameter fitting (e.g., optimal alpha value for models)
preset.cpp: Configuration presets
llguidance.cpp: Integration with Microsoft's llguidance library
reasoning-budget.cpp: Token budget management for chain-of-thought
unicode.cpp: Unicode utilities

6. Tools (`tools/`)

6.1 llama-cli (`tools/cli/`)

The primary command-line interface for interactive use:

Conversation mode with chat templates
Grammar-constrained generation
Multi-modal input (images for vision models)
Completion-style interface (for code FIM)
Presets for quick configuration
Streaming output

6.2 llama-server (`tools/server/`)

An OpenAI-compatible HTTP server:

POST /v1/completions — Text completion
POST /v1/chat/completions — Chat completion
POST /v1/embeddings — Embedding extraction
POST /v1/rerank — Reranking
GET /v1/models — Model listing
Multi-user support with queuing (server-queue.*)
Concurrent decoding with configurable slot count
Tool/function calling support (server-tools.*)
CORS proxy
Web UI at http://localhost:8080

Server components:

server-http.*: HTTP parsing and routing (based on cpp-httplib)
server-context.*: Per-request context management
server-chat.*: Chat history and formatting
server-models.*: Multi-model management
server-queue.*: Request queuing and scheduling
server-task.*: Task lifecycle management
server-tools.*: Tool/function calling implementation

6.3 llama-perplexity (`tools/perplexity/`)

Evaluates model quality:

Perplexity measurement over text corpora
KL divergence computation
Cross-entropy scoring

6.4 llama-bench (`tools/llama-bench/`)

Performance benchmarking:

Tests prompt processing (pp) and text generation (tg) throughput
Multi-configuration comparison
JSON output for automation

6.5 llama-quantize (`tools/quantize/`)

Model quantization tool:

Converts F16/BF16 → any quantized format
Importance matrices for IQ quantization
Mixed quantization (different types for different layers)
Output quality estimation

6.6 Other Tools

gguf-split/: Split/merge GGUF files
imatrix/: Generate importance matrices for quantization
export-lora/: Export LoRA adapters from training
tokenize/: Tokenization debugging tool
results/: Analyze benchmark results
rpc/: RPC server for remote GPU compute
mtmd/: Multi-modal processing (CLIP-based vision encoder)
parser/: Template/debug parser analysis
completion/: Bash completion script generation
batched-bench/: Batched inference benchmarking
fit-params/: Parameter fitting utilities
cvector-generator/: C vector generation
tts/: Text-to-speech integration

7. Examples (`examples/`)

The examples directory provides reference implementations:

Example	Description
`simple/`	Minimal inference: load model, tokenize, decode, sample — the simplest possible usage
`simple-chat/`	Minimal chat loop with template handling
`batched/`	Batched inference with multiple sequences
`parallel/`	Parallel decoding with independent sequences
`speculative/`	Speculative decoding implementation
`speculative-simple/`	Minimal speculative decoding
`server/` → `tools/server/`	Production server (moved to tools/)
`save-load-state/`	Context state serialization
`embedding/`	Embedding extraction
`eval-callback/`	Custom evaluation callbacks
`training/`	Fine-tuning example
`gguf/`	GGUF file reading/writing
`diffusion/`	Diffusion model inference
`retrieval/`	RAG-style retrieval augmented generation
`lookahead/`	Lookahead decoding
`lookup/`	Lookup-based decoding
`passkey/`	Passkey retrieval task
`idle/`	Idle/keepalive example
`llama.android/`	Android app using JNI
`llama.swiftui/`	SwiftUI iOS/macOS app
`batched.swift/`	Swift batched inference
`debug/`	Debug utilities
`model-conversion/`	Model conversion scripts
`convert-llama2c-to-ggml/`	Karpathy's llama2.c format converter
`sycl/`	SYCL backend example
`simple-cmake-pkg/`	CMake package integration
`gen-docs/`	Documentation generation

8. Model Implementations (`src/models/`)

Each .cpp file in src/models/ implements the graph building logic for a specific architecture. Here are the major families:

8.1 Decoder-Only Transformer (Most common)

File	Model(s)
`llama.cpp`	LLaMA 1/2/3, Mistral, Yi, Hermes, etc.
`llama4.cpp`	LLaMA 4 (Meta's latest)
`qwen.cpp`	Qwen 1
`qwen2.cpp`	Qwen 2
`qwen3.cpp`	Qwen 3
`qwen35.cpp`	Qwen 3.5
`qwen2vl.cpp`	Qwen2-VL (vision-language)
`qwen3vl.cpp`	Qwen3-VL (vision-language)
`qwen3vlmoe.cpp`	Qwen3-VL MoE
`deepseek.cpp`	DeepSeek V1
`deepseek2.cpp`	DeepSeek V2/V3/R1 (with Multi-head Latent Attention)
`deepseek2ocr.cpp`	DeepSeek V2 + OCR
`phi2.cpp`	Phi-2
`phi3.cpp`	Phi-3
`gemma.cpp`	Gemma 1
`gemma2.cpp`	Gemma 2
`gemma3.cpp`	Gemma 3 (multimodal)
`gemma3n.cpp`	Gemma 3N
`gemma4.cpp`	Gemma 4
`mistral3.cpp`	Mistral 3
`mistral4.cpp`	Mistral 4

8.2 Mixture-of-Experts (MoE)

File	Model(s)
`mixtral.cpp` → via `llama.cpp`	Mixtral 8x7B
`qwen2moe.cpp`	Qwen2 MoE
`qwen35moe.cpp`	Qwen 3.5 MoE
`qwen3moe.cpp`	Qwen 3 MoE
`phimoe.cpp`	PhiMoE
`dbrx.cpp`	DBRX
`deepseek2.cpp`	DeepSeek V3 (MoE)
`olmoe.cpp`	OLMoE
`openai-moe.cpp`	OpenAI MoE
`arctic.cpp`	Snowflake Arctic
`jamba.cpp`	Jamba (Hybrid Attention + Mamba MoE)
`exaone-moe.cpp`	EXAONE MoE
`hunyuan-moe.cpp`	Hunyuan MoE
`granite-moe.cpp`	Granite MoE
`bailingmoe.cpp`	Bailing MoE
`bailingmoe2.cpp`	Bailing MoE 2
`grokmoe.cpp`	Grok MoE
`lfm2moe.cpp`	LFM 2 MoE
`glm4-moe.cpp`	GLM4 MoE
`llada-moe.cpp`	LLADA MoE
`ernie4-5-moe.cpp`	ERNIE 4.5 MoE
`nemotron-h-moe.cpp`	Nemotron H MoE
`llama4.cpp`	LLaMA 4 MoE variants
`afmoe.cpp`	AF MoE

8.3 State-Space / Recurrent

File	Model(s)
`mamba.cpp`	Mamba
`mamba2.cpp`	Mamba 2
`mamba-base.cpp`	Base Mamba operations
`rwkv6.cpp`	RWKV-6
`rwkv7.cpp`	RWKV-7
`rwkv6-base.cpp`	RWKV-6 base ops
`rwkv6qwen2.cpp`	RWKV-6 + Qwen2 hybrid
`arwkv7.cpp`	Attention+ RWKV-7

8.4 Encoder Models (BERT, Embeddings)

File	Model(s)
`bert.cpp`	BERT
`modern-bert.cpp`	ModernBERT
`nomic-bert.cpp`	Nomic BERT
`nomic-bert-moe.cpp`	Nomic BERT MoE
`jina-bert-v2.cpp`	Jina BERT v2
`jina-bert-v3.cpp`	Jina BERT v3
`eurobert.cpp`	EuroBERT
`neo-bert.cpp`	NeoBERT
`llama-embed.cpp`	LLaMA Embedding
`pangu-embed.cpp`	Pangu Embedding
`t5encoder.cpp`	T5 Encoder
`gpt2.cpp`	GPT-2
`gptneox.cpp`	GPT-NeoX, Pythia

8.5 Encoder-Decoder

File	Model(s)
`t5.cpp`	T5, Flan-T5
`chameleon.cpp`	Chameleon (Meta)
`minicpm.cpp`	MiniCPM
`minicpm3.cpp`	MiniCPM 3

8.6 Quantized Base Models

File	Model(s)
`bitnet.cpp`	BitNet b1.58 (1-bit)
`plamo.cpp`	PLaMo
`plamo2.cpp`	PLaMo 2
`plamo3.cpp`	PLaMo 3

8.7 Other Notable Architectures

File	Model(s)
`chatglm.cpp`	ChatGLM 3
`glm4.cpp`	GLM-4
`glm-dsa.cpp`	GLM DSA
`bloom.cpp`	BLOOM
`falcon.cpp`	Falcon
`falcon-h1.cpp`	Falcon H1
`command-r.cpp`	Cohere Command-R
`cohere2.cpp`	Cohere 2
`grok.cpp`	Grok-1
`stablelm.cpp`	StableLM
`starcoder.cpp`	StarCoder
`starcoder2.cpp`	StarCoder 2
`olmo.cpp`	OLMo
`olmo2.cpp`	OLMo 2
`openelm.cpp`	Apple OpenELM
`internlm2.cpp`	InternLM 2
`xverse.cpp`	Xverse
`jais.cpp`	Jais
`jais2.cpp`	Jais 2
`exaone.cpp`	EXAONE
`exaone4.cpp`	EXAONE 4
`granite.cpp`	IBM Granite
`granite-hybrid.cpp`	IBM Granite Hybrid
`deci.cpp`	Deci
`codeshell.cpp`	CodeShell
`orion.cpp`	Orion
`nemotron.cpp`	NVIDIA Nemotron
`nemotron-h.cpp`	NVIDIA Nemotron H
`hunyuan-dense.cpp`	Tencent Hunyuan Dense
`hunyuan-vl.cpp`	Hunyuan Vision-Language
`smollm3.cpp`	SmolLM 3
`lfm2.cpp`	Liquid LFM 2
`dream.cpp`	Dream
`smallthinker.cpp`	SmallThinker
`llada.cpp`	LLADA
`seed-oss.cpp`	Seed OSS
`dots1.cpp`	Dots 1
`apertus.cpp`	Apertus
`minimax-m2.cpp`	MiniMax M2
`cogvlm.cpp`	CogVLM
`rnd1.cpp`	RND-1
`ernie4-5.cpp`	ERNIE 4.5
`step35.cpp`	Step 3.5
`maincoder.cpp`	MainCoder
`kimi-linear.cpp`	Kimi K2 (Linear Attention)
`mimo2.cpp`	MIMO 2
`paddleocr.cpp`	PaddleOCR

9. Python Infrastructure

9.1 Model Conversion (`convert_hf_to_gguf.py`)

The massive (668KB) conversion script handles:

Loading Hugging Face models (PyTorch, SafeTensors)
Architecture-specific tensor mapping (HF names → GGUF names)
Vocabulary extraction (tokenizer.model, tokenizer.json, tokenizer_config.json)
Chat template preservation
Metadata embedding (model name, description, license, etc.)
Output to GGUF format

9.2 GGUF Python Library (`gguf-py/`)

Official Python library for GGUF manipulation:

gguf/ — Core GGUF reading/writing
examples/ — Usage examples
tests/ — Test suite
Used by the Hugging Face Spaces ecosystem (GGUF-my-repo, GGUF-editor)

9.3 Other Python Scripts

convert_llama_ggml_to_gguf.py: Legacy GGML to GGUF migration
convert_lora_to_gguf.py: LoRA adapter conversion
convert_hf_to_gguf_update.py: Auto-update script for new model support

10. Build System & CI

10.1 CMake (`CMakeLists.txt`, `CMakePresets.json`)

The primary build system supporting:

GPU backend selection (-DGGML_CUDA=ON, -DGGML_METAL=ON, -DGGML_VULKAN=ON, etc.)
Quantization kernel selection
Build type (Release/Debug/RelWithDebInfo)
Sanitizers (ASan, UBSan, TSAN)
Static/shared library builds
macOS framework generation (XCFramework for iOS/tvOS/visionOS)

10.2 GitHub Actions Workflows (`.github/workflows/`)

Comprehensive CI covering:

server.yml: Server build and test across platforms
build.yml: Main build matrix (Linux, macOS, Windows, various backends)
models.yml: Model conversion and validation
codeql.yml: Security scanning
stale.yml: Issue management
clang-tidy.yml: Static analysis
Containerized Snapdragon and CUDA builds

10.3 Devops (`devops/`)

Nix flake (flake.nix): Reproducible builds
Docker: Multi-platform container builds
NixOS module: Systemd service configuration

11. Data Flow: From User Input to Generated Token

A typical inference call flow through llama-simple:

1. llama_model_load_from_file()
   └─ llama_model_loader: parse GGUF, mmap weights, allocate tensors
      └─ model-specific loader (e.g., llama_model_llama::model_load())

2. llama_init_from_model()
   └─ Create llama_context with KV cache, graph scheduler, output buffers

3. llama_tokenize()
   └─ Text → [token IDs] using the vocab's tokenizer

4. llama_decode(ctx, batch)
   └─ llama_context::decode()
      └─ Split batch into ubatches
         └─ llama_context::process_ubatch()
            ├─ Build/reuse computation graph via llm_graph_context
            │  ├─ build_inp_embd()    — Token embeddings lookup
            │  ├─ build_inp_pos()     — Position encoding
            │  └─ For each layer:
            │     ├─ build_norm()     — Layer/RMS norm
            │     ├─ build_qkv()      — QKV projections
            │     ├─ build_attn()     — Self-attention + KV cache
            │     ├─ build_ffn()      — Feed-forward network
            │     └─ Residual connections
            ├─ graph_compute()        — Execute via ggml_backend_sched
            │  ├─ Dispatch ops to GPU backends (CUDA/Metal/Vulkan)
            │  └─ Fallback to CPU for unsupported ops
            └─ Extract logits from output buffer

5. llama_sampler_sample(smpl, ctx, -1)
   └─ Apply sampler chain to logits
      └─ Grammar mask, top-k, top-p, temperature, repetition penalty...
         └─ Return next token ID

6. llama_token_to_piece()
   └─ Token ID → text character(s)

7. Append token to batch, goto 4 (autoregressive loop)

12. Key Design Decisions & Trade-offs

12.1 Why C/C++ and Not Python?

Zero dependencies: No Python runtime, no PyTorch, no CUDA toolkit required at deployment
Portability: Runs on bare-metal, embedded systems, iOS/Android apps
Performance: Complete control over memory layout, SIMD intrinsics, and threading
Small binary footprint: A static build is ~3-10 MB

12.2 Why GGUF and Not Hugging Face/SafeTensors?

Trivially parseable: Binary format with no Python dependency for reading
MMAP-friendly: Tensor data is a single contiguous block suitable for memory-mapped I/O
Self-contained: Stores tokenizer, chat template, and all metadata in the same file
Quantized storage: Natively stores quantized types without conversion on load

12.3 Computation Graph vs. Eager Execution

Graph approach: Build a DAG of operations, then execute — enables graph optimization, backend scheduling, and memory reuse
Trade-off: Higher initial latency to build the graph, but subsequent executions with the same topology are fast
Reuse optimization: The llm_graph_result system avoids rebuilding graphs when the topology hasn't changed

12.4 Quantization Strategy

K-quants: Block-based quantization (Q2_K through Q6_K) with hierarchical scaling provides excellent accuracy/size trade-offs
IQ: Importance-aware quantization pushes quality boundaries at very low bitrates (1.5-2.5 bpw)
No re-quantization at runtime: Weights stay quantized; matmul kernels operate directly on quantized data

13. Performance Characteristics

Operation	Bottleneck	Optimization
Prompt processing (prefill)	Matrix multiply (GEMM)	GPU tensor cores, quantized matmul, batch parallelism
Token generation (decoding)	Memory bandwidth	KV cache reuse, quantized weights, memory-bound matmul
Large context (>32K)	Attention computation	Flash attention, page attention, sliding window
Batch inference	Throughput vs. latency	Dynamic batching, concurrent slots, KV cache sharing
Multi-modal	Encoder compute	Separable encoder/decoder graphs, cross-attention

Typical speeds (Q4_K_M, M2 Ultra, 7B model):

Prompt processing: ~5,000-10,000 tokens/second
Text generation: ~50-200 tokens/second (varies with context size and quantization)

14. Bindings & Ecosystem

The project has extensive bindings enabling use from virtually any language:

Language	Package	Maintainer
Python	`llama-cpp-python`	abetlen
Python	`easy-llama`	ddh0
Go	`go-llama.cpp`	go-skynet
Go	`yzma` (no CGo)	hybridgroup
Node.js	`node-llama-cpp`	withcatai
Rust	`llama_cpp-rs`	edgenai/utilityai
C#	`LLamaSharp`	SciSharp
Java	`java-llama.cpp`	kherud
Swift	`llama-cpp-swift`	srgtuszy
Ruby	`llama_cpp.rb`	yoshoku
Zig	`llama.cpp.zig`	deins
Flutter/Dart	`llama_cpp_dart`	netdur
Android	`llama.android`	in-repo

Popular downstream projects:

Ollama: The most popular user-facing LLM runner, built on llama.cpp
LM Studio: GUI application for model exploration
GPT4All: Desktop LLM client by Nomic AI
llamafile: Mozilla's single-file executable LLM (combines llama.cpp + Cosmopolitan libc)
text-generation-webui: oobabooga's powerful web interface
koboldcpp: Storytelling-focused fork
LocalAI: Kubernetes-native LLM serving
Jan: Open-source ChatGPT alternative

15. Development & Contribution

15.1 Code Organization Philosophy

src/: Core library (libllama) — only depends on ggml
ggml/: Tensor library (standalone, used by multiple projects)
common/: Shared utilities for tools — depends on src/
tools/: Standalone executables — depends on common/
examples/: Minimal reference implementations

15.2 Adding a New Model

The process (detailed in docs/development/HOWTO-add-model.md):

Add architecture enum to llama-arch.h
Add tensor name mappings in llama-arch.cpp
Write model loader and graph builder in src/models/<name>.cpp
Register the architecture in the model mapping function in llama-model.cpp
Add pre-tokenizer type if needed in llama-vocab.cpp
Add GGUF KV key definitions
Update Python converter (convert_hf_to_gguf.py) for HF model mapping

16. Project Stats (Approximate)

Metric	Value
Total source files (C/C++)	~250+
Model architectures	150+
GPU backends	15+
Quantization formats	30+
Sampler types	20+
Lines of C/C++ code	~80,000+ (src + ggml)
Python conversion scripts	4 (major)
Bindings (languages)	15+
Downstream projects	100+
Commits	10,000+
Contributors	1,000+
GitHub stars	75,000+

This document was generated by exploring the repository structure, source code, and documentation of llama.cpp (commit as of May 2026). It represents a snapshot of a rapidly evolving project. For the most current information, refer to the repository and its official documentation.

Llama.cpp Architecture

Related