nb1t.sh

Llama.cpp Architecture

Fri May 08 2026 · Nitin Bansal

1. About LLama.cpp

llama.cpp is a high-performance, cross-platform inference engine for Large Language Models (LLMs) written entirely in C and C++. Its primary goal is to enable LLM inference with minimal setup, zero external dependencies, and state-of-the-art performance on a vast range of hardware — from consumer laptops and Raspberry Pis to multi-GPU servers.

Rather than wrapping Python ML frameworks (PyTorch, TensorFlow), llama.cpp is a from-scratch implementation built atop the ggml tensor library. This approach yields:

  • Zero external runtime dependencies — no Python, no PyTorch, no CUDA runtime required
  • First-class Apple Silicon support — ARM NEON, Accelerate, and Metal frameworks
  • Extreme quantization — supports 1.5-bit through 8-bit integer quantization (Q1.5 through Q8.0, plus IQ and K-quant variants)
  • Hardware diversity — CPU (x86 with AVX/AVX2/AVX512/AMX, ARM NEON, RISC-V with RVV), GPU (CUDA, Metal, Vulkan, SYCL, HIP, MUSA, WebGPU), and NPU (CANN, OpenVINO, ZenDNN) backends
  • Server-mode — OpenAI API-compatible HTTP server for production deployments
  • Multimodal support — vision (LLaVA, Qwen2-VL), audio (Whisper-style), and cross-attention encoder-decoder models

The project serves as the primary development playground for the ggml tensor library, driving innovations in quantized matrix multiplication, SIMD kernels, and GPU compute.

  • Repository: github.com/ggml-org/llama.cpp
  • License: MIT (Core C/C++ library under MIT; bindings and examples may vary)
  • Core Language: C/C++ with Python scripts for model conversion

2. Architecture Overview

The system is organized into a clean layering:

┌─────────────────────────────────────────────────────────────┐
│                    Tools & Applications                      │
│  llama-cli │ llama-server │ llama-perplexity │ llama-bench  │
│  llama-quantize │ llama-tokenize │ llama-gguf-split │ ...   │
├─────────────────────────────────────────────────────────────┤
│              Common Library (common/)                        │
│  Sampling │ N-gram cache │ PEG Parser │ JSON Schema → GBNF  │
│  Chat templates │ HTTP client │ Presets │ Speculative        │
├─────────────────────────────────────────────────────────────┤
│              libllama Core (src/)                            │
│  Model Loading │ Context Management │ Graph Building         │
│  KV Cache │ Vocabulary │ Grammar Engine │ LoRA Adapters      │
│  Quantization │ Samplers │ Chat Formats │ Session State      │
├─────────────────────────────────────────────────────────────┤
│         Model Implementations (src/models/)                  │
│  150+ architectures: LLaMA, Mistral, Qwen, DeepSeek,        │
│  Gemma, Phi, Mamba, RWKV, BERT, Falcon, Grok, etc.          │
├─────────────────────────────────────────────────────────────┤
│              ggml Tensor Library (ggml/)                     │
│  Tensor Ops │ Automatic Differentiation │ Optimizers        │
│  Backend Abstraction │ Memory Allocator │ Threading          │
├─────────────────────────────────────────────────────────────┤
│           GPU Backend Implementations (ggml/src/)            │
│  CUDA │ Metal │ Vulkan │ SYCL │ HIP │ CANN │ WebGPU │ RPC   │
│  OpenCL │ OpenVINO │ ZenDNN │ zDNN │ VirtGPU │ Hexagon      │
└─────────────────────────────────────────────────────────────┘

2.1 File Format: GGUF

All models are stored in the GGUF (GGML Universal Format) binary format:

[Magic "GGUF"] [Version] [Tensor Count] [KV Metadata Count]
  → KV Pairs: (key-string, type, value) — architecture, hyperparameters, tokenizer config, etc.
  → Tensor Info: (name, dimensions, dtype, offset in blob)
  → Tensor Data Blob: contiguous binary weight data

GGUF is a self-describing format that evolved from earlier GGML/GGJT formats. Key features:

  • Key-Value metadata: Stores architecture name, model hyperparameters (n_layer, n_head, n_embd, etc.), RoPE config, tokenizer data (vocab, merges, scores, chat template), and arbitrary metadata
  • Alignable data section: Supports configurable alignment (default 32 bytes) for direct memory-mapped access
  • Multi-file splits: Large models can be split across multiple .gguf files (model-00001-of-00003.gguf)
  • Extensible: New architectures add new KV keys and tensor names without breaking backward compatibility

The Python script convert_hf_to_gguf.py (668KB) handles conversion from Hugging Face PyTorch/SafeTensors to GGUF. There are also specialized converters for LoRA adapters, Llama 2c models, and GGML-to-GGUF migration.


3. The ggml Tensor Library (Core Foundation)

The ggml/ directory contains the foundational tensor computation library. It is a standalone project (also used by whisper.cpp and stable-diffusion.cpp).

3.1 Core Tensor System (ggml.h, ggml.c)

Key features:

  • Multi-dimensional tensors: Up to 4 dimensions, with support for view/reshape/permute
  • Data types: F32, F16, BF16, and ~30 quantized types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K through Q6_K, IQ1_S through IQ4_XS, TQ1_0/TQ2_0 ternary, MXFP4, NVFP4, and more)
  • Computation graph: Build a graph of tensor operations (ggml_cgraph), then compute forward (inference) or backward (gradients)
  • Automatic differentiation: Mark tensors as parameters with ggml_set_param(), then compute gradients via backward pass
  • Optimizers: Built-in Adam, L-BFGS, and SGD for training/fine-tuning
  • SIMD-optimized kernels: Hand-tuned assembly/C for AVX2, AVX512, ARM NEON, RISC-V RVV, WASM SIMD, etc.
  • Thread pool: Parallel computation across multiple CPU threads

Key operations (ggml_ops): MatMul (regular and quantized), convolution, attention (scaled dot-product), RoPE, RMS norm, layer norm, softmax, SiLU/GELU/ReLU activations, ArgMax, cross-entropy loss, and more. Full operation list in docs/ops.md.

3.2 Quantization Engine (ggml-quants.c, ggml-quants.h)

The quantization subsystem is arguably the most performance-critical component. It implements:

Block quantization scheme: Weights are divided into small blocks (typically 32 elements). Each block stores:

  • A shared scale (and optionally min value)
  • Each weight is quantized to low-bit representation relative to the block scale

Quantized types (from lowest to highest precision):

Type Bits/Weight Description
IQ1_S 1.56 1-bit importance-weighted
IQ1_M 1.75 1-bit importance-weighted medium
TQ1_0 1.69 Ternary quantization
TQ2_0 2.06 Ternary quantization
IQ2_XXS 2.06 2-bit extremely extra-small
IQ2_XS 2.31 2-bit extra-small
Q2_K 2.56-3.06 2-bit K-quant (small/medium variants)
IQ3_XXS 3.06 3-bit extremely extra-small
Q3_K 3.50-4.25 3-bit K-quant (small/medium/large)
IQ4_NL 4.50 4-bit importance-weighted no-list
Q4_K 4.50-5.50 4-bit K-quant (small/medium)
Q5_K 5.50-6.20 5-bit K-quant (small/medium)
Q6_K 6.56 6-bit K-quant
Q8_0 8.00 8-bit block
F16 16.00 Half-precision float
BF16 16.00 Bfloat16
F32 32.00 Full-precision float

"K-quant" (Q2_K through Q6_K) uses a clever super-block scheme: larger blocks (~256 elements) contain multiple sub-blocks with their own scales, plus a super-block scale. This provides superior accuracy vs. size compared to simple per-block quantization.

"IQ" (Importance-aware Quantization) applies non-uniform quantization that assigns more bits to important weights and fewer to unimportant ones, improving perplexity at very low bitrates.

Each quantized type has:

  • quantize_row_*: Convert float row → quantized row
  • dequantize_row_*: Convert quantized row → float row
  • ggml_compute_forward_*_mat_mul: Compute matrix multiplication directly in quantized space (no dequant → matmul → requant pipeline)

3.3 Backend Abstraction (ggml-backend.h, ggml-backend-impl.h)

The backend system provides a device abstraction:

struct ggml_backend {
    const char * name;
    // Memory management
    ggml_backend_buffer_type_t (*get_default_buffer_type)(...);
    // Tensor operations
    ggml_backend_graph_compute(...);
    // Synchronization
    ggml_backend_synchronize(...);
};

Supported backends:

Backend File(s) Target
CPU ggml-cpu.h + ggml.c All CPUs (x86, ARM, RISC-V)
CUDA ggml-cuda.h NVIDIA GPUs
Metal ggml-metal.h Apple Silicon (M1/M2/M3/M4)
Vulkan ggml-vulkan.h Cross-vendor GPU
SYCL ggml-sycl.h Intel & NVIDIA GPU
HIP ggml-cuda.h (ROCm path) AMD GPU
MUSA ggml-cuda.h (MUSA path) Moore Threads GPU
CANN ggml-cann.h Ascend NPU
OpenVINO ggml-openvino.h Intel CPU/GPU/NPU
WebGPU ggml-webgpu.h Browser (via Emscripten)
RPC ggml-rpc.h Remote GPU (network)
BLAS ggml-blas.h BLAS libraries
OpenCL ggml-opencl.h Adreno GPU
ZenDNN ggml-zdnn.h AMD CPU
zDNN ggml-zendnn.h IBM Z/LinuxONE
VirtGPU ggml-virtgpu.h Virtual GPU API
Hexagon ggml-hexagon.h Snapdragon (in progress)

The scheduler (ggml_backend_sched) orchestrates computation across multiple backends simultaneously. It partitions the computation graph and offloads compatible operations to GPU backends while keeping others on CPU — enabling CPU+GPU hybrid inference for models larger than VRAM.

3.4 Memory Allocator (ggml-alloc.h, ggml-alloc.c)

The allocator uses a multi-pool strategy tailored for the computation graph pattern:

  • Temporary allocator for intermediate tensors
  • Persistent allocator for weights and long-lived tensors
  • Hash-based (ggml-alloc) and address-based (ggml-alloc-addr) strategies
  • Integrated with backend memory management for GPU allocations

3.5 Optimization Module (ggml-opt.h, ggml-opt.cpp)

Provides training infrastructure:

  • Adam/AdamW: Standard adaptive optimization with weight decay
  • L-BFGS: Limited-memory Broyden–Fletcher–Goldfarb–Shanno for smaller optimization tasks
  • SGD: Stochastic gradient descent with momentum
  • Datasets: ggml_opt_dataset abstraction for training data batching
  • Used by llama.cpp for embedding model fine-tuning

4. Core llama.cpp Library (src/)

4.1 Model Architecture (src/llama-model.cpp + src/models/*.cpp)

The model system supports ~150+ architectures. Each architecture is defined by:

Architecture enum (llama-arch.h): A single entry for each model type (e.g., LLM_ARCH_LLAMA, LLM_ARCH_QWEN2, LLM_ARCH_DEEPSEEK2, LLM_ARCH_MAMBA2)

Tensor naming (LLM_TN helper): Maps logical tensor roles (e.g., LLM_TENSOR_ATTN_Q, LLM_TENSOR_FFN_GATE) to GGUF tensor names (e.g., blk.0.attn_q.weight, blk.0.ffn_gate.weight).

Model class hierarchy (src/models/models.h as base):

llama_model (abstract base)
├── llama_model_llama       — LLaMA 1/2/3, Mistral, Yi, etc.
├── llama_model_qwen2       — Qwen 2
├── llama_model_deepseek2   — DeepSeek V2/V3 (with MLA)
├── llama_model_mamba       — Mamba state-space model
├── llama_model_rwkv6       — RWKV-6
├── llama_model_gemma3      — Gemma 3 (multimodal)
├── llama_model_qwen2vl     — Qwen2-VL (vision-language)
├── ... and 140+ more

Each model class implements:

  • model_load(): Load architecture-specific tensors from GGUF, set up hyperparameters
  • Graph building: Typically build_llama(), build_qwen2(), etc., which construct the computation graph using llm_graph_context primitives (attention, FFN, norms, etc.)

Key hyperparameter structures:

  • llama_hparams: Architecture-level parameters (n_layer, n_head, n_embd, n_ff, etc.)
  • llama_cparams: Context-level parameters (n_ctx, n_batch, n_ubatch, pooling type, etc.)

4.2 Graph Building (src/llama-graph.h)

The graph construction is the heart of inference. The llm_graph_context class provides:

Layer building blocks:

  • build_norm(): Layer norm or RMS norm (with learnable weights)
  • build_qkv(): Compute Q, K, V projections (supports fused and separate paths)
  • build_ffn(): Feed-forward network with various activation styles (SiLU, GELU, ReLU, SwiGLU, GeGLU, etc.)
  • build_moe_ffn(): Mixture-of-Experts FFN with routing
  • build_attn() / build_attn_mha(): Multi-head attention with KV cache support

Input abstractions (llm_graph_input_i): Each input type abstracts a portion of the ubatch data into graph tensors:

  • llm_graph_input_embd: Token embeddings or pre-computed embeddings
  • llm_graph_input_pos: Position IDs
  • llm_graph_input_attn_kv: KV cache indices and attention mask
  • llm_graph_input_rs: Recurrent state copy indices (for Mamba/RWKV)
  • llm_graph_input_no_cache: Full self-attention without caching

Graph reuse optimization: The system tracks whether a new batch would produce the same graph topology as a previous one. If so, it reuses the existing graph and only updates input tensors — avoiding expensive re-graph-building.

4.3 Context Management (src/llama-context.cpp, src/llama-context.h)

The llama_context struct manages the runtime state:

Key responsibilities:

  • Batch decoding: Splits large logical batches into micro-batches (ubatch) that fit in GPU memory
  • Graph scheduling: Reserves backend scheduler, builds/reuses computation graphs
  • Memory management: KV cache allocation, state save/load, per-sequence memory buffers
  • Output buffers: Logits, embeddings, and sampling output management
  • Multi-sequence support: Up to n_seq_max concurrent sequences with independent KV cache slots
  • LoRA adapter switching: Hot-swap LoRA adapters without reloading the base model

Key methods:

  • decode(): Process a batch of tokens through the model
  • encode(): For encoder-decoder models, process the encoder
  • process_ubatch(): The inner loop that feeds one micro-batch through the graph
  • state_save_file() / state_load_file(): Full context state serialization
  • state_seq_*(): Per-sequence state save/load (for speculative decoding rollback)

4.4 Model Loading (src/llama-model-loader.cpp)

The loader handles GGUF file parsing:

  • Multi-file support: Handles sharded models (e.g., <prefix>-00001-of-00003.gguf)
  • Memory mapping: Uses mmap() where supported for zero-copy weight loading
  • Weight encoding: Supports BF16 and quantized tensor format conversion during load
  • Metadata extraction: Reads all KV pairs to configure hyperparameters, tokenizer, and chat template
  • Tensor allocation: Allocates tensor storage via the backend-specific buffer types (CPU RAM, GPU VRAM)
  • Layer offloading: Distributes layers across GPU (specified by count) and CPU

4.5 Vocabulary & Tokenization (src/llama-vocab.cpp, src/llama-vocab.h)

Supports multiple tokenizer models:

Tokenizer Type Enum Models
SentencePiece (Unigram) LLAMA_VOCAB_TYPE_SPM LLaMA, Mistral, Gemma
BPE (byte-level) LLAMA_VOCAB_TYPE_BPE GPT-2, Falcon, Qwen
WordPiece LLAMA_VOCAB_TYPE_WPM BERT
Unigram LLAMA_VOCAB_TYPE_UGM T5, Flan-T5
RWKV LLAMA_VOCAB_TYPE_RWKV RWKV-6/7
PLaMo-2 LLAMA_VOCAB_TYPE_PLAMO2 PLaMo-2

Pre-tokenizers (llama_vocab_pre_type): Handles architecture-specific text preprocessing (e.g., Digamma for LLaMA 3, byte fallback for DeepSeek, GPT-4o regex splitting). There are ~50 pre-tokenizer variants.

Key methods:

  • tokenize(): Text → token IDs (with add_special/parse_special flags)
  • detokenize(): Token IDs → text (with remove_special/unparse_special)
  • token_to_piece(): Single token → text fragment
  • byte_to_token() / token_to_byte(): For byte-level tokenizers

Chat template support: The vocab stores Hugging Face-style Jinja2 chat templates, applied via the common/chat.cpp module.

4.6 Sampling (src/llama-sampler.cpp, src/llama-sampler.h)

The sampling system is a chain-of-responsibility:

Input: logits (raw model outputs, shape [n_vocab])
  ↓
Sampler Chain (each step transforms the candidate distribution):
  ├── TemperatureSampler    — Divide logits by temperature
  ├── TopKSampler           — Keep only top-k logits
  ├── TopPSampler (nucleus) — Keep cumulative probability p
  ├── MinPSampler           — Keep logits ≥ min_p × max_logit
  ├── TypicalSampler        — Keep logits with entropy below threshold
  ├── XTCSampler            — Extreme token classification filtering
  ├── DRYSampler            — Context-aware repetition penalty (n-gram blocking)
  ├── RepetitionPenalty     — Penalize recently generated tokens
  ├── FrequencyPenalty      — Penalize high-frequency tokens
  ├── PresencePenalty       — Penalize tokens that have appeared
  ├── MirostatSampler       — Adaptive entropy-based sampling
  ├── GrammarSampler        — Constrain output to a GBNF grammar
  ├── LevenshteinSampler    — Constrain via Levenshtein automata
  ├── PenalizeNgram         — N-gram repetition blocking
  ├── TailFreeSampler       — Remove tail probability mass
  ├── LocallyTypical        — Locally typical sampling
  └── GreedySampler         — Always pick most likely token (no randomness)
  ↓
Output: single token ID + optional probabilities

Samplers can be CPU-side (standard C++ implementations) or backend-side (executed as part of the GPU computation graph for zero-copy). Backend sampling is more efficient but requires the sampler to be representable as ggml operations.

4.7 KV Cache (src/llama-kv-cache.cpp, src/llama-kv-cache.h)

The KV cache stores computed Key and Value tensors from previous tokens, avoiding recomputation:

  • Standard KV cache: Linear storage of K/V for each layer, each sequence
  • iSWA KV cache (llama-kv-cache-iswa.*): Interleaved sliding-window attention cache — stores both full and windowed K/V for models with sliding window attention
  • Memory cost: Typically the largest memory consumer (~2 × n_layers × n_embd_kv × n_ctx × dtype_size)

Key operations:

  • set() / get(): Basic K/V storage and retrieval
  • cpy_k() / cpy_v(): Efficient copy for sequence management
  • head() / size(): Cache capacity tracking

4.8 Recurrent Memory (src/llama-memory-recurrent.cpp)

For state-space models like Mamba and RWKV, there's no KV cache. Instead, a recurrent state is maintained:

  • llama_memory_recurrent_context: Manages per-sequence hidden states (shape depends on architecture)
  • State copy/update operations during batch processing
  • Hybrid models (like Jamba) use both KV cache and recurrent memory

4.9 Grammar Engine (src/llama-grammar.cpp, src/llama-grammar.h)

Implements GBNF (GGML BNF) grammar parsing and guided generation:

  • Parses .gbnf grammar files (BNF-like notation with character sets, repetitions, alternatives)
  • Builds a deterministic finite automaton (DFA) from the grammar
  • During sampling, computes a grammar mask — a boolean vector over the vocabulary indicating which tokens are valid next
  • Enables structured output: JSON, code, mathematical expressions, etc.

5. Common Utilities (common/)

The common/ directory provides shared functionality used by all tools and examples:

5.1 Chat System (common/chat.cpp, common/chat.h)

  • Chat templates: Applies Jinja2-like templates (LLaMA 3, ChatML, etc.) to format conversations
  • PEG parser (common/peg-parser.*): A parsing expression grammar engine for output structuring
  • Auto parser (common/chat-auto-parser.*): Automatically detects model capabilities and applies appropriate parsing
  • Chat diff analyzer (common/chat-diff-analyzer.cpp): Tracks conversation state changes

5.2 Sampling Extensions (common/sampling.cpp, common/sampling.h)

  • Convenience wrappers for sampler chain construction
  • DRY repetition penalty (n-gram based, llama_sampler_init_dry_testing)
  • Frequent use of llama_sampler_chain with pre-tuned defaults

5.3 Speculative Decoding (common/speculative.cpp, common/speculative.h)

  • Draft-verification: Uses a smaller "draft" model (or same model with fewer layers) to generate candidate tokens, then verifies them with the target model
  • Batch acceptance: Process draft tokens in parallel for significant speedup
  • Target model and draft model can run on different devices

5.4 Grammar & JSON Tools

  • json-schema-to-grammar.*: Converts JSON Schema to GBNF grammar for constrained generation
  • json-partial.*: Handles partial/incremental JSON parsing for streaming
  • regex-partial.*: Partial regex matching for streaming validation
  • PEG parser (common/peg-parser.*): Full PEG parsing engine for complex output constraints

5.5 N-gram Cache (common/ngram-cache.*, common/ngram-map.*, common/ngram-mod.*)

  • Prefix tree-based n-gram tracking
  • Used by repetition penalties and speculative decoding

5.6 Other Utilities

  • common.cpp: arg parsing, model path resolution, GPU info
  • console.cpp: Terminal I/O with color and raw mode
  • log.cpp: Logging infrastructure
  • hf-cache.cpp: Hugging Face Hub integration (model downloads, caching)
  • dl.cpp / download.cpp: HTTP download with resume
  • fit.cpp: Parameter fitting (e.g., optimal alpha value for models)
  • preset.cpp: Configuration presets
  • llguidance.cpp: Integration with Microsoft's llguidance library
  • reasoning-budget.cpp: Token budget management for chain-of-thought
  • unicode.cpp: Unicode utilities

6. Tools (tools/)

6.1 llama-cli (tools/cli/)

The primary command-line interface for interactive use:

  • Conversation mode with chat templates
  • Grammar-constrained generation
  • Multi-modal input (images for vision models)
  • Completion-style interface (for code FIM)
  • Presets for quick configuration
  • Streaming output

6.2 llama-server (tools/server/)

An OpenAI-compatible HTTP server:

  • POST /v1/completions — Text completion
  • POST /v1/chat/completions — Chat completion
  • POST /v1/embeddings — Embedding extraction
  • POST /v1/rerank — Reranking
  • GET /v1/models — Model listing
  • Multi-user support with queuing (server-queue.*)
  • Concurrent decoding with configurable slot count
  • Tool/function calling support (server-tools.*)
  • CORS proxy
  • Web UI at http://localhost:8080

Server components:

  • server-http.*: HTTP parsing and routing (based on cpp-httplib)
  • server-context.*: Per-request context management
  • server-chat.*: Chat history and formatting
  • server-models.*: Multi-model management
  • server-queue.*: Request queuing and scheduling
  • server-task.*: Task lifecycle management
  • server-tools.*: Tool/function calling implementation

6.3 llama-perplexity (tools/perplexity/)

Evaluates model quality:

  • Perplexity measurement over text corpora
  • KL divergence computation
  • Cross-entropy scoring

6.4 llama-bench (tools/llama-bench/)

Performance benchmarking:

  • Tests prompt processing (pp) and text generation (tg) throughput
  • Multi-configuration comparison
  • JSON output for automation

6.5 llama-quantize (tools/quantize/)

Model quantization tool:

  • Converts F16/BF16 → any quantized format
  • Importance matrices for IQ quantization
  • Mixed quantization (different types for different layers)
  • Output quality estimation

6.6 Other Tools

  • gguf-split/: Split/merge GGUF files
  • imatrix/: Generate importance matrices for quantization
  • export-lora/: Export LoRA adapters from training
  • tokenize/: Tokenization debugging tool
  • results/: Analyze benchmark results
  • rpc/: RPC server for remote GPU compute
  • mtmd/: Multi-modal processing (CLIP-based vision encoder)
  • parser/: Template/debug parser analysis
  • completion/: Bash completion script generation
  • batched-bench/: Batched inference benchmarking
  • fit-params/: Parameter fitting utilities
  • cvector-generator/: C vector generation
  • tts/: Text-to-speech integration

7. Examples (examples/)

The examples directory provides reference implementations:

Example Description
simple/ Minimal inference: load model, tokenize, decode, sample — the simplest possible usage
simple-chat/ Minimal chat loop with template handling
batched/ Batched inference with multiple sequences
parallel/ Parallel decoding with independent sequences
speculative/ Speculative decoding implementation
speculative-simple/ Minimal speculative decoding
server/tools/server/ Production server (moved to tools/)
save-load-state/ Context state serialization
embedding/ Embedding extraction
eval-callback/ Custom evaluation callbacks
training/ Fine-tuning example
gguf/ GGUF file reading/writing
diffusion/ Diffusion model inference
retrieval/ RAG-style retrieval augmented generation
lookahead/ Lookahead decoding
lookup/ Lookup-based decoding
passkey/ Passkey retrieval task
idle/ Idle/keepalive example
llama.android/ Android app using JNI
llama.swiftui/ SwiftUI iOS/macOS app
batched.swift/ Swift batched inference
debug/ Debug utilities
model-conversion/ Model conversion scripts
convert-llama2c-to-ggml/ Karpathy's llama2.c format converter
sycl/ SYCL backend example
simple-cmake-pkg/ CMake package integration
gen-docs/ Documentation generation

8. Model Implementations (src/models/)

Each .cpp file in src/models/ implements the graph building logic for a specific architecture. Here are the major families:

8.1 Decoder-Only Transformer (Most common)

File Model(s)
llama.cpp LLaMA 1/2/3, Mistral, Yi, Hermes, etc.
llama4.cpp LLaMA 4 (Meta's latest)
qwen.cpp Qwen 1
qwen2.cpp Qwen 2
qwen3.cpp Qwen 3
qwen35.cpp Qwen 3.5
qwen2vl.cpp Qwen2-VL (vision-language)
qwen3vl.cpp Qwen3-VL (vision-language)
qwen3vlmoe.cpp Qwen3-VL MoE
deepseek.cpp DeepSeek V1
deepseek2.cpp DeepSeek V2/V3/R1 (with Multi-head Latent Attention)
deepseek2ocr.cpp DeepSeek V2 + OCR
phi2.cpp Phi-2
phi3.cpp Phi-3
gemma.cpp Gemma 1
gemma2.cpp Gemma 2
gemma3.cpp Gemma 3 (multimodal)
gemma3n.cpp Gemma 3N
gemma4.cpp Gemma 4
mistral3.cpp Mistral 3
mistral4.cpp Mistral 4

8.2 Mixture-of-Experts (MoE)

File Model(s)
mixtral.cpp → via llama.cpp Mixtral 8x7B
qwen2moe.cpp Qwen2 MoE
qwen35moe.cpp Qwen 3.5 MoE
qwen3moe.cpp Qwen 3 MoE
phimoe.cpp PhiMoE
dbrx.cpp DBRX
deepseek2.cpp DeepSeek V3 (MoE)
olmoe.cpp OLMoE
openai-moe.cpp OpenAI MoE
arctic.cpp Snowflake Arctic
jamba.cpp Jamba (Hybrid Attention + Mamba MoE)
exaone-moe.cpp EXAONE MoE
hunyuan-moe.cpp Hunyuan MoE
granite-moe.cpp Granite MoE
bailingmoe.cpp Bailing MoE
bailingmoe2.cpp Bailing MoE 2
grokmoe.cpp Grok MoE
lfm2moe.cpp LFM 2 MoE
glm4-moe.cpp GLM4 MoE
llada-moe.cpp LLADA MoE
ernie4-5-moe.cpp ERNIE 4.5 MoE
nemotron-h-moe.cpp Nemotron H MoE
llama4.cpp LLaMA 4 MoE variants
afmoe.cpp AF MoE

8.3 State-Space / Recurrent

File Model(s)
mamba.cpp Mamba
mamba2.cpp Mamba 2
mamba-base.cpp Base Mamba operations
rwkv6.cpp RWKV-6
rwkv7.cpp RWKV-7
rwkv6-base.cpp RWKV-6 base ops
rwkv6qwen2.cpp RWKV-6 + Qwen2 hybrid
arwkv7.cpp Attention+ RWKV-7

8.4 Encoder Models (BERT, Embeddings)

File Model(s)
bert.cpp BERT
modern-bert.cpp ModernBERT
nomic-bert.cpp Nomic BERT
nomic-bert-moe.cpp Nomic BERT MoE
jina-bert-v2.cpp Jina BERT v2
jina-bert-v3.cpp Jina BERT v3
eurobert.cpp EuroBERT
neo-bert.cpp NeoBERT
llama-embed.cpp LLaMA Embedding
pangu-embed.cpp Pangu Embedding
t5encoder.cpp T5 Encoder
gpt2.cpp GPT-2
gptneox.cpp GPT-NeoX, Pythia

8.5 Encoder-Decoder

File Model(s)
t5.cpp T5, Flan-T5
chameleon.cpp Chameleon (Meta)
minicpm.cpp MiniCPM
minicpm3.cpp MiniCPM 3

8.6 Quantized Base Models

File Model(s)
bitnet.cpp BitNet b1.58 (1-bit)
plamo.cpp PLaMo
plamo2.cpp PLaMo 2
plamo3.cpp PLaMo 3

8.7 Other Notable Architectures

File Model(s)
chatglm.cpp ChatGLM 3
glm4.cpp GLM-4
glm-dsa.cpp GLM DSA
bloom.cpp BLOOM
falcon.cpp Falcon
falcon-h1.cpp Falcon H1
command-r.cpp Cohere Command-R
cohere2.cpp Cohere 2
grok.cpp Grok-1
stablelm.cpp StableLM
starcoder.cpp StarCoder
starcoder2.cpp StarCoder 2
olmo.cpp OLMo
olmo2.cpp OLMo 2
openelm.cpp Apple OpenELM
internlm2.cpp InternLM 2
xverse.cpp Xverse
jais.cpp Jais
jais2.cpp Jais 2
exaone.cpp EXAONE
exaone4.cpp EXAONE 4
granite.cpp IBM Granite
granite-hybrid.cpp IBM Granite Hybrid
deci.cpp Deci
codeshell.cpp CodeShell
orion.cpp Orion
nemotron.cpp NVIDIA Nemotron
nemotron-h.cpp NVIDIA Nemotron H
hunyuan-dense.cpp Tencent Hunyuan Dense
hunyuan-vl.cpp Hunyuan Vision-Language
smollm3.cpp SmolLM 3
lfm2.cpp Liquid LFM 2
dream.cpp Dream
smallthinker.cpp SmallThinker
llada.cpp LLADA
seed-oss.cpp Seed OSS
dots1.cpp Dots 1
apertus.cpp Apertus
minimax-m2.cpp MiniMax M2
cogvlm.cpp CogVLM
rnd1.cpp RND-1
ernie4-5.cpp ERNIE 4.5
step35.cpp Step 3.5
maincoder.cpp MainCoder
kimi-linear.cpp Kimi K2 (Linear Attention)
mimo2.cpp MIMO 2
paddleocr.cpp PaddleOCR

9. Python Infrastructure

9.1 Model Conversion (convert_hf_to_gguf.py)

The massive (668KB) conversion script handles:

  • Loading Hugging Face models (PyTorch, SafeTensors)
  • Architecture-specific tensor mapping (HF names → GGUF names)
  • Vocabulary extraction (tokenizer.model, tokenizer.json, tokenizer_config.json)
  • Chat template preservation
  • Metadata embedding (model name, description, license, etc.)
  • Output to GGUF format

9.2 GGUF Python Library (gguf-py/)

Official Python library for GGUF manipulation:

  • gguf/ — Core GGUF reading/writing
  • examples/ — Usage examples
  • tests/ — Test suite
  • Used by the Hugging Face Spaces ecosystem (GGUF-my-repo, GGUF-editor)

9.3 Other Python Scripts

  • convert_llama_ggml_to_gguf.py: Legacy GGML to GGUF migration
  • convert_lora_to_gguf.py: LoRA adapter conversion
  • convert_hf_to_gguf_update.py: Auto-update script for new model support

10. Build System & CI

10.1 CMake (CMakeLists.txt, CMakePresets.json)

The primary build system supporting:

  • GPU backend selection (-DGGML_CUDA=ON, -DGGML_METAL=ON, -DGGML_VULKAN=ON, etc.)
  • Quantization kernel selection
  • Build type (Release/Debug/RelWithDebInfo)
  • Sanitizers (ASan, UBSan, TSAN)
  • Static/shared library builds
  • macOS framework generation (XCFramework for iOS/tvOS/visionOS)

10.2 GitHub Actions Workflows (.github/workflows/)

Comprehensive CI covering:

  • server.yml: Server build and test across platforms
  • build.yml: Main build matrix (Linux, macOS, Windows, various backends)
  • models.yml: Model conversion and validation
  • codeql.yml: Security scanning
  • stale.yml: Issue management
  • clang-tidy.yml: Static analysis
  • Containerized Snapdragon and CUDA builds

10.3 Devops (devops/)

  • Nix flake (flake.nix): Reproducible builds
  • Docker: Multi-platform container builds
  • NixOS module: Systemd service configuration

11. Data Flow: From User Input to Generated Token

A typical inference call flow through llama-simple:

1. llama_model_load_from_file()
   └─ llama_model_loader: parse GGUF, mmap weights, allocate tensors
      └─ model-specific loader (e.g., llama_model_llama::model_load())

2. llama_init_from_model()
   └─ Create llama_context with KV cache, graph scheduler, output buffers

3. llama_tokenize()
   └─ Text → [token IDs] using the vocab's tokenizer

4. llama_decode(ctx, batch)
   └─ llama_context::decode()
      └─ Split batch into ubatches
         └─ llama_context::process_ubatch()
            ├─ Build/reuse computation graph via llm_graph_context
            │  ├─ build_inp_embd()    — Token embeddings lookup
            │  ├─ build_inp_pos()     — Position encoding
            │  └─ For each layer:
            │     ├─ build_norm()     — Layer/RMS norm
            │     ├─ build_qkv()      — QKV projections
            │     ├─ build_attn()     — Self-attention + KV cache
            │     ├─ build_ffn()      — Feed-forward network
            │     └─ Residual connections
            ├─ graph_compute()        — Execute via ggml_backend_sched
            │  ├─ Dispatch ops to GPU backends (CUDA/Metal/Vulkan)
            │  └─ Fallback to CPU for unsupported ops
            └─ Extract logits from output buffer

5. llama_sampler_sample(smpl, ctx, -1)
   └─ Apply sampler chain to logits
      └─ Grammar mask, top-k, top-p, temperature, repetition penalty...
         └─ Return next token ID

6. llama_token_to_piece()
   └─ Token ID → text character(s)

7. Append token to batch, goto 4 (autoregressive loop)

12. Key Design Decisions & Trade-offs

12.1 Why C/C++ and Not Python?

  • Zero dependencies: No Python runtime, no PyTorch, no CUDA toolkit required at deployment
  • Portability: Runs on bare-metal, embedded systems, iOS/Android apps
  • Performance: Complete control over memory layout, SIMD intrinsics, and threading
  • Small binary footprint: A static build is ~3-10 MB

12.2 Why GGUF and Not Hugging Face/SafeTensors?

  • Trivially parseable: Binary format with no Python dependency for reading
  • MMAP-friendly: Tensor data is a single contiguous block suitable for memory-mapped I/O
  • Self-contained: Stores tokenizer, chat template, and all metadata in the same file
  • Quantized storage: Natively stores quantized types without conversion on load

12.3 Computation Graph vs. Eager Execution

  • Graph approach: Build a DAG of operations, then execute — enables graph optimization, backend scheduling, and memory reuse
  • Trade-off: Higher initial latency to build the graph, but subsequent executions with the same topology are fast
  • Reuse optimization: The llm_graph_result system avoids rebuilding graphs when the topology hasn't changed

12.4 Quantization Strategy

  • K-quants: Block-based quantization (Q2_K through Q6_K) with hierarchical scaling provides excellent accuracy/size trade-offs
  • IQ: Importance-aware quantization pushes quality boundaries at very low bitrates (1.5-2.5 bpw)
  • No re-quantization at runtime: Weights stay quantized; matmul kernels operate directly on quantized data

13. Performance Characteristics

Operation Bottleneck Optimization
Prompt processing (prefill) Matrix multiply (GEMM) GPU tensor cores, quantized matmul, batch parallelism
Token generation (decoding) Memory bandwidth KV cache reuse, quantized weights, memory-bound matmul
Large context (>32K) Attention computation Flash attention, page attention, sliding window
Batch inference Throughput vs. latency Dynamic batching, concurrent slots, KV cache sharing
Multi-modal Encoder compute Separable encoder/decoder graphs, cross-attention

Typical speeds (Q4_K_M, M2 Ultra, 7B model):

  • Prompt processing: ~5,000-10,000 tokens/second
  • Text generation: ~50-200 tokens/second (varies with context size and quantization)

14. Bindings & Ecosystem

The project has extensive bindings enabling use from virtually any language:

Language Package Maintainer
Python llama-cpp-python abetlen
Python easy-llama ddh0
Go go-llama.cpp go-skynet
Go yzma (no CGo) hybridgroup
Node.js node-llama-cpp withcatai
Rust llama_cpp-rs edgenai/utilityai
C# LLamaSharp SciSharp
Java java-llama.cpp kherud
Swift llama-cpp-swift srgtuszy
Ruby llama_cpp.rb yoshoku
Zig llama.cpp.zig deins
Flutter/Dart llama_cpp_dart netdur
Android llama.android in-repo

Popular downstream projects:

  • Ollama: The most popular user-facing LLM runner, built on llama.cpp
  • LM Studio: GUI application for model exploration
  • GPT4All: Desktop LLM client by Nomic AI
  • llamafile: Mozilla's single-file executable LLM (combines llama.cpp + Cosmopolitan libc)
  • text-generation-webui: oobabooga's powerful web interface
  • koboldcpp: Storytelling-focused fork
  • LocalAI: Kubernetes-native LLM serving
  • Jan: Open-source ChatGPT alternative

15. Development & Contribution

15.1 Code Organization Philosophy

  • src/: Core library (libllama) — only depends on ggml
  • ggml/: Tensor library (standalone, used by multiple projects)
  • common/: Shared utilities for tools — depends on src/
  • tools/: Standalone executables — depends on common/
  • examples/: Minimal reference implementations

15.2 Adding a New Model

The process (detailed in docs/development/HOWTO-add-model.md):

  1. Add architecture enum to llama-arch.h
  2. Add tensor name mappings in llama-arch.cpp
  3. Write model loader and graph builder in src/models/<name>.cpp
  4. Register the architecture in the model mapping function in llama-model.cpp
  5. Add pre-tokenizer type if needed in llama-vocab.cpp
  6. Add GGUF KV key definitions
  7. Update Python converter (convert_hf_to_gguf.py) for HF model mapping

16. Project Stats (Approximate)

Metric Value
Total source files (C/C++) ~250+
Model architectures 150+
GPU backends 15+
Quantization formats 30+
Sampler types 20+
Lines of C/C++ code ~80,000+ (src + ggml)
Python conversion scripts 4 (major)
Bindings (languages) 15+
Downstream projects 100+
Commits 10,000+
Contributors 1,000+
GitHub stars 75,000+

This document was generated by exploring the repository structure, source code, and documentation of llama.cpp (commit as of May 2026). It represents a snapshot of a rapidly evolving project. For the most current information, refer to the repository and its official documentation.