1. About LLama.cpp
llama.cpp is a high-performance, cross-platform inference engine for Large Language Models (LLMs) written entirely in C and C++. Its primary goal is to enable LLM inference with minimal setup, zero external dependencies, and state-of-the-art performance on a vast range of hardware — from consumer laptops and Raspberry Pis to multi-GPU servers.
Rather than wrapping Python ML frameworks (PyTorch, TensorFlow), llama.cpp is a from-scratch implementation built atop the ggml tensor library. This approach yields:
- Zero external runtime dependencies — no Python, no PyTorch, no CUDA runtime required
- First-class Apple Silicon support — ARM NEON, Accelerate, and Metal frameworks
- Extreme quantization — supports 1.5-bit through 8-bit integer quantization (Q1.5 through Q8.0, plus IQ and K-quant variants)
- Hardware diversity — CPU (x86 with AVX/AVX2/AVX512/AMX, ARM NEON, RISC-V with RVV), GPU (CUDA, Metal, Vulkan, SYCL, HIP, MUSA, WebGPU), and NPU (CANN, OpenVINO, ZenDNN) backends
- Server-mode — OpenAI API-compatible HTTP server for production deployments
- Multimodal support — vision (LLaVA, Qwen2-VL), audio (Whisper-style), and cross-attention encoder-decoder models
The project serves as the primary development playground for the ggml tensor library, driving innovations in quantized matrix multiplication, SIMD kernels, and GPU compute.
- Repository: github.com/ggml-org/llama.cpp
- License: MIT (Core C/C++ library under MIT; bindings and examples may vary)
- Core Language: C/C++ with Python scripts for model conversion
2. Architecture Overview
The system is organized into a clean layering:
┌─────────────────────────────────────────────────────────────┐
│ Tools & Applications │
│ llama-cli │ llama-server │ llama-perplexity │ llama-bench │
│ llama-quantize │ llama-tokenize │ llama-gguf-split │ ... │
├─────────────────────────────────────────────────────────────┤
│ Common Library (common/) │
│ Sampling │ N-gram cache │ PEG Parser │ JSON Schema → GBNF │
│ Chat templates │ HTTP client │ Presets │ Speculative │
├─────────────────────────────────────────────────────────────┤
│ libllama Core (src/) │
│ Model Loading │ Context Management │ Graph Building │
│ KV Cache │ Vocabulary │ Grammar Engine │ LoRA Adapters │
│ Quantization │ Samplers │ Chat Formats │ Session State │
├─────────────────────────────────────────────────────────────┤
│ Model Implementations (src/models/) │
│ 150+ architectures: LLaMA, Mistral, Qwen, DeepSeek, │
│ Gemma, Phi, Mamba, RWKV, BERT, Falcon, Grok, etc. │
├─────────────────────────────────────────────────────────────┤
│ ggml Tensor Library (ggml/) │
│ Tensor Ops │ Automatic Differentiation │ Optimizers │
│ Backend Abstraction │ Memory Allocator │ Threading │
├─────────────────────────────────────────────────────────────┤
│ GPU Backend Implementations (ggml/src/) │
│ CUDA │ Metal │ Vulkan │ SYCL │ HIP │ CANN │ WebGPU │ RPC │
│ OpenCL │ OpenVINO │ ZenDNN │ zDNN │ VirtGPU │ Hexagon │
└─────────────────────────────────────────────────────────────┘
2.1 File Format: GGUF
All models are stored in the GGUF (GGML Universal Format) binary format:
[Magic "GGUF"] [Version] [Tensor Count] [KV Metadata Count]
→ KV Pairs: (key-string, type, value) — architecture, hyperparameters, tokenizer config, etc.
→ Tensor Info: (name, dimensions, dtype, offset in blob)
→ Tensor Data Blob: contiguous binary weight data
GGUF is a self-describing format that evolved from earlier GGML/GGJT formats. Key features:
- Key-Value metadata: Stores architecture name, model hyperparameters (n_layer, n_head, n_embd, etc.), RoPE config, tokenizer data (vocab, merges, scores, chat template), and arbitrary metadata
- Alignable data section: Supports configurable alignment (default 32 bytes) for direct memory-mapped access
- Multi-file splits: Large models can be split across multiple
.gguffiles (model-00001-of-00003.gguf) - Extensible: New architectures add new KV keys and tensor names without breaking backward compatibility
The Python script convert_hf_to_gguf.py (668KB) handles conversion from Hugging Face PyTorch/SafeTensors to GGUF. There are also specialized converters for LoRA adapters, Llama 2c models, and GGML-to-GGUF migration.
3. The ggml Tensor Library (Core Foundation)
The ggml/ directory contains the foundational tensor computation library. It is a standalone project (also used by whisper.cpp and stable-diffusion.cpp).
3.1 Core Tensor System (ggml.h, ggml.c)
Key features:
- Multi-dimensional tensors: Up to 4 dimensions, with support for view/reshape/permute
- Data types: F32, F16, BF16, and ~30 quantized types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K through Q6_K, IQ1_S through IQ4_XS, TQ1_0/TQ2_0 ternary, MXFP4, NVFP4, and more)
- Computation graph: Build a graph of tensor operations (
ggml_cgraph), then compute forward (inference) or backward (gradients) - Automatic differentiation: Mark tensors as parameters with
ggml_set_param(), then compute gradients via backward pass - Optimizers: Built-in Adam, L-BFGS, and SGD for training/fine-tuning
- SIMD-optimized kernels: Hand-tuned assembly/C for AVX2, AVX512, ARM NEON, RISC-V RVV, WASM SIMD, etc.
- Thread pool: Parallel computation across multiple CPU threads
Key operations (ggml_ops): MatMul (regular and quantized), convolution, attention (scaled dot-product), RoPE, RMS norm, layer norm, softmax, SiLU/GELU/ReLU activations, ArgMax, cross-entropy loss, and more. Full operation list in docs/ops.md.
3.2 Quantization Engine (ggml-quants.c, ggml-quants.h)
The quantization subsystem is arguably the most performance-critical component. It implements:
Block quantization scheme: Weights are divided into small blocks (typically 32 elements). Each block stores:
- A shared scale (and optionally min value)
- Each weight is quantized to low-bit representation relative to the block scale
Quantized types (from lowest to highest precision):
| Type | Bits/Weight | Description |
|---|---|---|
| IQ1_S | 1.56 | 1-bit importance-weighted |
| IQ1_M | 1.75 | 1-bit importance-weighted medium |
| TQ1_0 | 1.69 | Ternary quantization |
| TQ2_0 | 2.06 | Ternary quantization |
| IQ2_XXS | 2.06 | 2-bit extremely extra-small |
| IQ2_XS | 2.31 | 2-bit extra-small |
| Q2_K | 2.56-3.06 | 2-bit K-quant (small/medium variants) |
| IQ3_XXS | 3.06 | 3-bit extremely extra-small |
| Q3_K | 3.50-4.25 | 3-bit K-quant (small/medium/large) |
| IQ4_NL | 4.50 | 4-bit importance-weighted no-list |
| Q4_K | 4.50-5.50 | 4-bit K-quant (small/medium) |
| Q5_K | 5.50-6.20 | 5-bit K-quant (small/medium) |
| Q6_K | 6.56 | 6-bit K-quant |
| Q8_0 | 8.00 | 8-bit block |
| F16 | 16.00 | Half-precision float |
| BF16 | 16.00 | Bfloat16 |
| F32 | 32.00 | Full-precision float |
"K-quant" (Q2_K through Q6_K) uses a clever super-block scheme: larger blocks (~256 elements) contain multiple sub-blocks with their own scales, plus a super-block scale. This provides superior accuracy vs. size compared to simple per-block quantization.
"IQ" (Importance-aware Quantization) applies non-uniform quantization that assigns more bits to important weights and fewer to unimportant ones, improving perplexity at very low bitrates.
Each quantized type has:
quantize_row_*: Convert float row → quantized rowdequantize_row_*: Convert quantized row → float rowggml_compute_forward_*_mat_mul: Compute matrix multiplication directly in quantized space (no dequant → matmul → requant pipeline)
3.3 Backend Abstraction (ggml-backend.h, ggml-backend-impl.h)
The backend system provides a device abstraction:
struct ggml_backend {
const char * name;
// Memory management
ggml_backend_buffer_type_t (*get_default_buffer_type)(...);
// Tensor operations
ggml_backend_graph_compute(...);
// Synchronization
ggml_backend_synchronize(...);
};
Supported backends:
| Backend | File(s) | Target |
|---|---|---|
| CPU | ggml-cpu.h + ggml.c |
All CPUs (x86, ARM, RISC-V) |
| CUDA | ggml-cuda.h |
NVIDIA GPUs |
| Metal | ggml-metal.h |
Apple Silicon (M1/M2/M3/M4) |
| Vulkan | ggml-vulkan.h |
Cross-vendor GPU |
| SYCL | ggml-sycl.h |
Intel & NVIDIA GPU |
| HIP | ggml-cuda.h (ROCm path) |
AMD GPU |
| MUSA | ggml-cuda.h (MUSA path) |
Moore Threads GPU |
| CANN | ggml-cann.h |
Ascend NPU |
| OpenVINO | ggml-openvino.h |
Intel CPU/GPU/NPU |
| WebGPU | ggml-webgpu.h |
Browser (via Emscripten) |
| RPC | ggml-rpc.h |
Remote GPU (network) |
| BLAS | ggml-blas.h |
BLAS libraries |
| OpenCL | ggml-opencl.h |
Adreno GPU |
| ZenDNN | ggml-zdnn.h |
AMD CPU |
| zDNN | ggml-zendnn.h |
IBM Z/LinuxONE |
| VirtGPU | ggml-virtgpu.h |
Virtual GPU API |
| Hexagon | ggml-hexagon.h |
Snapdragon (in progress) |
The scheduler (ggml_backend_sched) orchestrates computation across multiple backends simultaneously. It partitions the computation graph and offloads compatible operations to GPU backends while keeping others on CPU — enabling CPU+GPU hybrid inference for models larger than VRAM.
3.4 Memory Allocator (ggml-alloc.h, ggml-alloc.c)
The allocator uses a multi-pool strategy tailored for the computation graph pattern:
- Temporary allocator for intermediate tensors
- Persistent allocator for weights and long-lived tensors
- Hash-based (
ggml-alloc) and address-based (ggml-alloc-addr) strategies - Integrated with backend memory management for GPU allocations
3.5 Optimization Module (ggml-opt.h, ggml-opt.cpp)
Provides training infrastructure:
- Adam/AdamW: Standard adaptive optimization with weight decay
- L-BFGS: Limited-memory Broyden–Fletcher–Goldfarb–Shanno for smaller optimization tasks
- SGD: Stochastic gradient descent with momentum
- Datasets:
ggml_opt_datasetabstraction for training data batching - Used by
llama.cppfor embedding model fine-tuning
4. Core llama.cpp Library (src/)
4.1 Model Architecture (src/llama-model.cpp + src/models/*.cpp)
The model system supports ~150+ architectures. Each architecture is defined by:
Architecture enum (llama-arch.h): A single entry for each model type (e.g., LLM_ARCH_LLAMA, LLM_ARCH_QWEN2, LLM_ARCH_DEEPSEEK2, LLM_ARCH_MAMBA2)
Tensor naming (LLM_TN helper): Maps logical tensor roles (e.g., LLM_TENSOR_ATTN_Q, LLM_TENSOR_FFN_GATE) to GGUF tensor names (e.g., blk.0.attn_q.weight, blk.0.ffn_gate.weight).
Model class hierarchy (src/models/models.h as base):
llama_model (abstract base)
├── llama_model_llama — LLaMA 1/2/3, Mistral, Yi, etc.
├── llama_model_qwen2 — Qwen 2
├── llama_model_deepseek2 — DeepSeek V2/V3 (with MLA)
├── llama_model_mamba — Mamba state-space model
├── llama_model_rwkv6 — RWKV-6
├── llama_model_gemma3 — Gemma 3 (multimodal)
├── llama_model_qwen2vl — Qwen2-VL (vision-language)
├── ... and 140+ more
Each model class implements:
model_load(): Load architecture-specific tensors from GGUF, set up hyperparameters- Graph building: Typically
build_llama(),build_qwen2(), etc., which construct the computation graph usingllm_graph_contextprimitives (attention, FFN, norms, etc.)
Key hyperparameter structures:
llama_hparams: Architecture-level parameters (n_layer, n_head, n_embd, n_ff, etc.)llama_cparams: Context-level parameters (n_ctx, n_batch, n_ubatch, pooling type, etc.)
4.2 Graph Building (src/llama-graph.h)
The graph construction is the heart of inference. The llm_graph_context class provides:
Layer building blocks:
build_norm(): Layer norm or RMS norm (with learnable weights)build_qkv(): Compute Q, K, V projections (supports fused and separate paths)build_ffn(): Feed-forward network with various activation styles (SiLU, GELU, ReLU, SwiGLU, GeGLU, etc.)build_moe_ffn(): Mixture-of-Experts FFN with routingbuild_attn()/build_attn_mha(): Multi-head attention with KV cache support
Input abstractions (llm_graph_input_i):
Each input type abstracts a portion of the ubatch data into graph tensors:
llm_graph_input_embd: Token embeddings or pre-computed embeddingsllm_graph_input_pos: Position IDsllm_graph_input_attn_kv: KV cache indices and attention maskllm_graph_input_rs: Recurrent state copy indices (for Mamba/RWKV)llm_graph_input_no_cache: Full self-attention without caching
Graph reuse optimization: The system tracks whether a new batch would produce the same graph topology as a previous one. If so, it reuses the existing graph and only updates input tensors — avoiding expensive re-graph-building.
4.3 Context Management (src/llama-context.cpp, src/llama-context.h)
The llama_context struct manages the runtime state:
Key responsibilities:
- Batch decoding: Splits large logical batches into micro-batches (
ubatch) that fit in GPU memory - Graph scheduling: Reserves backend scheduler, builds/reuses computation graphs
- Memory management: KV cache allocation, state save/load, per-sequence memory buffers
- Output buffers: Logits, embeddings, and sampling output management
- Multi-sequence support: Up to
n_seq_maxconcurrent sequences with independent KV cache slots - LoRA adapter switching: Hot-swap LoRA adapters without reloading the base model
Key methods:
decode(): Process a batch of tokens through the modelencode(): For encoder-decoder models, process the encoderprocess_ubatch(): The inner loop that feeds one micro-batch through the graphstate_save_file()/state_load_file(): Full context state serializationstate_seq_*(): Per-sequence state save/load (for speculative decoding rollback)
4.4 Model Loading (src/llama-model-loader.cpp)
The loader handles GGUF file parsing:
- Multi-file support: Handles sharded models (e.g.,
<prefix>-00001-of-00003.gguf) - Memory mapping: Uses
mmap()where supported for zero-copy weight loading - Weight encoding: Supports BF16 and quantized tensor format conversion during load
- Metadata extraction: Reads all KV pairs to configure hyperparameters, tokenizer, and chat template
- Tensor allocation: Allocates tensor storage via the backend-specific buffer types (CPU RAM, GPU VRAM)
- Layer offloading: Distributes layers across GPU (specified by count) and CPU
4.5 Vocabulary & Tokenization (src/llama-vocab.cpp, src/llama-vocab.h)
Supports multiple tokenizer models:
| Tokenizer Type | Enum | Models |
|---|---|---|
| SentencePiece (Unigram) | LLAMA_VOCAB_TYPE_SPM |
LLaMA, Mistral, Gemma |
| BPE (byte-level) | LLAMA_VOCAB_TYPE_BPE |
GPT-2, Falcon, Qwen |
| WordPiece | LLAMA_VOCAB_TYPE_WPM |
BERT |
| Unigram | LLAMA_VOCAB_TYPE_UGM |
T5, Flan-T5 |
| RWKV | LLAMA_VOCAB_TYPE_RWKV |
RWKV-6/7 |
| PLaMo-2 | LLAMA_VOCAB_TYPE_PLAMO2 |
PLaMo-2 |
Pre-tokenizers (llama_vocab_pre_type): Handles architecture-specific text preprocessing (e.g., Digamma for LLaMA 3, byte fallback for DeepSeek, GPT-4o regex splitting). There are ~50 pre-tokenizer variants.
Key methods:
tokenize(): Text → token IDs (with add_special/parse_special flags)detokenize(): Token IDs → text (with remove_special/unparse_special)token_to_piece(): Single token → text fragmentbyte_to_token()/token_to_byte(): For byte-level tokenizers
Chat template support: The vocab stores Hugging Face-style Jinja2 chat templates, applied via the common/chat.cpp module.
4.6 Sampling (src/llama-sampler.cpp, src/llama-sampler.h)
The sampling system is a chain-of-responsibility:
Input: logits (raw model outputs, shape [n_vocab])
↓
Sampler Chain (each step transforms the candidate distribution):
├── TemperatureSampler — Divide logits by temperature
├── TopKSampler — Keep only top-k logits
├── TopPSampler (nucleus) — Keep cumulative probability p
├── MinPSampler — Keep logits ≥ min_p × max_logit
├── TypicalSampler — Keep logits with entropy below threshold
├── XTCSampler — Extreme token classification filtering
├── DRYSampler — Context-aware repetition penalty (n-gram blocking)
├── RepetitionPenalty — Penalize recently generated tokens
├── FrequencyPenalty — Penalize high-frequency tokens
├── PresencePenalty — Penalize tokens that have appeared
├── MirostatSampler — Adaptive entropy-based sampling
├── GrammarSampler — Constrain output to a GBNF grammar
├── LevenshteinSampler — Constrain via Levenshtein automata
├── PenalizeNgram — N-gram repetition blocking
├── TailFreeSampler — Remove tail probability mass
├── LocallyTypical — Locally typical sampling
└── GreedySampler — Always pick most likely token (no randomness)
↓
Output: single token ID + optional probabilities
Samplers can be CPU-side (standard C++ implementations) or backend-side (executed as part of the GPU computation graph for zero-copy). Backend sampling is more efficient but requires the sampler to be representable as ggml operations.
4.7 KV Cache (src/llama-kv-cache.cpp, src/llama-kv-cache.h)
The KV cache stores computed Key and Value tensors from previous tokens, avoiding recomputation:
- Standard KV cache: Linear storage of K/V for each layer, each sequence
- iSWA KV cache (
llama-kv-cache-iswa.*): Interleaved sliding-window attention cache — stores both full and windowed K/V for models with sliding window attention - Memory cost: Typically the largest memory consumer (~2 × n_layers × n_embd_kv × n_ctx × dtype_size)
Key operations:
set()/get(): Basic K/V storage and retrievalcpy_k()/cpy_v(): Efficient copy for sequence managementhead()/size(): Cache capacity tracking
4.8 Recurrent Memory (src/llama-memory-recurrent.cpp)
For state-space models like Mamba and RWKV, there's no KV cache. Instead, a recurrent state is maintained:
llama_memory_recurrent_context: Manages per-sequence hidden states (shape depends on architecture)- State copy/update operations during batch processing
- Hybrid models (like Jamba) use both KV cache and recurrent memory
4.9 Grammar Engine (src/llama-grammar.cpp, src/llama-grammar.h)
Implements GBNF (GGML BNF) grammar parsing and guided generation:
- Parses
.gbnfgrammar files (BNF-like notation with character sets, repetitions, alternatives) - Builds a deterministic finite automaton (DFA) from the grammar
- During sampling, computes a grammar mask — a boolean vector over the vocabulary indicating which tokens are valid next
- Enables structured output: JSON, code, mathematical expressions, etc.
5. Common Utilities (common/)
The common/ directory provides shared functionality used by all tools and examples:
5.1 Chat System (common/chat.cpp, common/chat.h)
- Chat templates: Applies Jinja2-like templates (LLaMA 3, ChatML, etc.) to format conversations
- PEG parser (
common/peg-parser.*): A parsing expression grammar engine for output structuring - Auto parser (
common/chat-auto-parser.*): Automatically detects model capabilities and applies appropriate parsing - Chat diff analyzer (
common/chat-diff-analyzer.cpp): Tracks conversation state changes
5.2 Sampling Extensions (common/sampling.cpp, common/sampling.h)
- Convenience wrappers for sampler chain construction
- DRY repetition penalty (n-gram based,
llama_sampler_init_dry_testing) - Frequent use of
llama_sampler_chainwith pre-tuned defaults
5.3 Speculative Decoding (common/speculative.cpp, common/speculative.h)
- Draft-verification: Uses a smaller "draft" model (or same model with fewer layers) to generate candidate tokens, then verifies them with the target model
- Batch acceptance: Process draft tokens in parallel for significant speedup
- Target model and draft model can run on different devices
5.4 Grammar & JSON Tools
json-schema-to-grammar.*: Converts JSON Schema to GBNF grammar for constrained generationjson-partial.*: Handles partial/incremental JSON parsing for streamingregex-partial.*: Partial regex matching for streaming validation- PEG parser (
common/peg-parser.*): Full PEG parsing engine for complex output constraints
5.5 N-gram Cache (common/ngram-cache.*, common/ngram-map.*, common/ngram-mod.*)
- Prefix tree-based n-gram tracking
- Used by repetition penalties and speculative decoding
5.6 Other Utilities
common.cpp: arg parsing, model path resolution, GPU infoconsole.cpp: Terminal I/O with color and raw modelog.cpp: Logging infrastructurehf-cache.cpp: Hugging Face Hub integration (model downloads, caching)dl.cpp/download.cpp: HTTP download with resumefit.cpp: Parameter fitting (e.g., optimal alpha value for models)preset.cpp: Configuration presetsllguidance.cpp: Integration with Microsoft's llguidance libraryreasoning-budget.cpp: Token budget management for chain-of-thoughtunicode.cpp: Unicode utilities
6. Tools (tools/)
6.1 llama-cli (tools/cli/)
The primary command-line interface for interactive use:
- Conversation mode with chat templates
- Grammar-constrained generation
- Multi-modal input (images for vision models)
- Completion-style interface (for code FIM)
- Presets for quick configuration
- Streaming output
6.2 llama-server (tools/server/)
An OpenAI-compatible HTTP server:
POST /v1/completions— Text completionPOST /v1/chat/completions— Chat completionPOST /v1/embeddings— Embedding extractionPOST /v1/rerank— RerankingGET /v1/models— Model listing- Multi-user support with queuing (
server-queue.*) - Concurrent decoding with configurable slot count
- Tool/function calling support (
server-tools.*) - CORS proxy
- Web UI at
http://localhost:8080
Server components:
server-http.*: HTTP parsing and routing (based on cpp-httplib)server-context.*: Per-request context managementserver-chat.*: Chat history and formattingserver-models.*: Multi-model managementserver-queue.*: Request queuing and schedulingserver-task.*: Task lifecycle managementserver-tools.*: Tool/function calling implementation
6.3 llama-perplexity (tools/perplexity/)
Evaluates model quality:
- Perplexity measurement over text corpora
- KL divergence computation
- Cross-entropy scoring
6.4 llama-bench (tools/llama-bench/)
Performance benchmarking:
- Tests prompt processing (pp) and text generation (tg) throughput
- Multi-configuration comparison
- JSON output for automation
6.5 llama-quantize (tools/quantize/)
Model quantization tool:
- Converts F16/BF16 → any quantized format
- Importance matrices for IQ quantization
- Mixed quantization (different types for different layers)
- Output quality estimation
6.6 Other Tools
gguf-split/: Split/merge GGUF filesimatrix/: Generate importance matrices for quantizationexport-lora/: Export LoRA adapters from trainingtokenize/: Tokenization debugging toolresults/: Analyze benchmark resultsrpc/: RPC server for remote GPU computemtmd/: Multi-modal processing (CLIP-based vision encoder)parser/: Template/debug parser analysiscompletion/: Bash completion script generationbatched-bench/: Batched inference benchmarkingfit-params/: Parameter fitting utilitiescvector-generator/: C vector generationtts/: Text-to-speech integration
7. Examples (examples/)
The examples directory provides reference implementations:
| Example | Description |
|---|---|
simple/ |
Minimal inference: load model, tokenize, decode, sample — the simplest possible usage |
simple-chat/ |
Minimal chat loop with template handling |
batched/ |
Batched inference with multiple sequences |
parallel/ |
Parallel decoding with independent sequences |
speculative/ |
Speculative decoding implementation |
speculative-simple/ |
Minimal speculative decoding |
server/ → tools/server/ |
Production server (moved to tools/) |
save-load-state/ |
Context state serialization |
embedding/ |
Embedding extraction |
eval-callback/ |
Custom evaluation callbacks |
training/ |
Fine-tuning example |
gguf/ |
GGUF file reading/writing |
diffusion/ |
Diffusion model inference |
retrieval/ |
RAG-style retrieval augmented generation |
lookahead/ |
Lookahead decoding |
lookup/ |
Lookup-based decoding |
passkey/ |
Passkey retrieval task |
idle/ |
Idle/keepalive example |
llama.android/ |
Android app using JNI |
llama.swiftui/ |
SwiftUI iOS/macOS app |
batched.swift/ |
Swift batched inference |
debug/ |
Debug utilities |
model-conversion/ |
Model conversion scripts |
convert-llama2c-to-ggml/ |
Karpathy's llama2.c format converter |
sycl/ |
SYCL backend example |
simple-cmake-pkg/ |
CMake package integration |
gen-docs/ |
Documentation generation |
8. Model Implementations (src/models/)
Each .cpp file in src/models/ implements the graph building logic for a specific architecture. Here are the major families:
8.1 Decoder-Only Transformer (Most common)
| File | Model(s) |
|---|---|
llama.cpp |
LLaMA 1/2/3, Mistral, Yi, Hermes, etc. |
llama4.cpp |
LLaMA 4 (Meta's latest) |
qwen.cpp |
Qwen 1 |
qwen2.cpp |
Qwen 2 |
qwen3.cpp |
Qwen 3 |
qwen35.cpp |
Qwen 3.5 |
qwen2vl.cpp |
Qwen2-VL (vision-language) |
qwen3vl.cpp |
Qwen3-VL (vision-language) |
qwen3vlmoe.cpp |
Qwen3-VL MoE |
deepseek.cpp |
DeepSeek V1 |
deepseek2.cpp |
DeepSeek V2/V3/R1 (with Multi-head Latent Attention) |
deepseek2ocr.cpp |
DeepSeek V2 + OCR |
phi2.cpp |
Phi-2 |
phi3.cpp |
Phi-3 |
gemma.cpp |
Gemma 1 |
gemma2.cpp |
Gemma 2 |
gemma3.cpp |
Gemma 3 (multimodal) |
gemma3n.cpp |
Gemma 3N |
gemma4.cpp |
Gemma 4 |
mistral3.cpp |
Mistral 3 |
mistral4.cpp |
Mistral 4 |
8.2 Mixture-of-Experts (MoE)
| File | Model(s) |
|---|---|
mixtral.cpp → via llama.cpp |
Mixtral 8x7B |
qwen2moe.cpp |
Qwen2 MoE |
qwen35moe.cpp |
Qwen 3.5 MoE |
qwen3moe.cpp |
Qwen 3 MoE |
phimoe.cpp |
PhiMoE |
dbrx.cpp |
DBRX |
deepseek2.cpp |
DeepSeek V3 (MoE) |
olmoe.cpp |
OLMoE |
openai-moe.cpp |
OpenAI MoE |
arctic.cpp |
Snowflake Arctic |
jamba.cpp |
Jamba (Hybrid Attention + Mamba MoE) |
exaone-moe.cpp |
EXAONE MoE |
hunyuan-moe.cpp |
Hunyuan MoE |
granite-moe.cpp |
Granite MoE |
bailingmoe.cpp |
Bailing MoE |
bailingmoe2.cpp |
Bailing MoE 2 |
grokmoe.cpp |
Grok MoE |
lfm2moe.cpp |
LFM 2 MoE |
glm4-moe.cpp |
GLM4 MoE |
llada-moe.cpp |
LLADA MoE |
ernie4-5-moe.cpp |
ERNIE 4.5 MoE |
nemotron-h-moe.cpp |
Nemotron H MoE |
llama4.cpp |
LLaMA 4 MoE variants |
afmoe.cpp |
AF MoE |
8.3 State-Space / Recurrent
| File | Model(s) |
|---|---|
mamba.cpp |
Mamba |
mamba2.cpp |
Mamba 2 |
mamba-base.cpp |
Base Mamba operations |
rwkv6.cpp |
RWKV-6 |
rwkv7.cpp |
RWKV-7 |
rwkv6-base.cpp |
RWKV-6 base ops |
rwkv6qwen2.cpp |
RWKV-6 + Qwen2 hybrid |
arwkv7.cpp |
Attention+ RWKV-7 |
8.4 Encoder Models (BERT, Embeddings)
| File | Model(s) |
|---|---|
bert.cpp |
BERT |
modern-bert.cpp |
ModernBERT |
nomic-bert.cpp |
Nomic BERT |
nomic-bert-moe.cpp |
Nomic BERT MoE |
jina-bert-v2.cpp |
Jina BERT v2 |
jina-bert-v3.cpp |
Jina BERT v3 |
eurobert.cpp |
EuroBERT |
neo-bert.cpp |
NeoBERT |
llama-embed.cpp |
LLaMA Embedding |
pangu-embed.cpp |
Pangu Embedding |
t5encoder.cpp |
T5 Encoder |
gpt2.cpp |
GPT-2 |
gptneox.cpp |
GPT-NeoX, Pythia |
8.5 Encoder-Decoder
| File | Model(s) |
|---|---|
t5.cpp |
T5, Flan-T5 |
chameleon.cpp |
Chameleon (Meta) |
minicpm.cpp |
MiniCPM |
minicpm3.cpp |
MiniCPM 3 |
8.6 Quantized Base Models
| File | Model(s) |
|---|---|
bitnet.cpp |
BitNet b1.58 (1-bit) |
plamo.cpp |
PLaMo |
plamo2.cpp |
PLaMo 2 |
plamo3.cpp |
PLaMo 3 |
8.7 Other Notable Architectures
| File | Model(s) |
|---|---|
chatglm.cpp |
ChatGLM 3 |
glm4.cpp |
GLM-4 |
glm-dsa.cpp |
GLM DSA |
bloom.cpp |
BLOOM |
falcon.cpp |
Falcon |
falcon-h1.cpp |
Falcon H1 |
command-r.cpp |
Cohere Command-R |
cohere2.cpp |
Cohere 2 |
grok.cpp |
Grok-1 |
stablelm.cpp |
StableLM |
starcoder.cpp |
StarCoder |
starcoder2.cpp |
StarCoder 2 |
olmo.cpp |
OLMo |
olmo2.cpp |
OLMo 2 |
openelm.cpp |
Apple OpenELM |
internlm2.cpp |
InternLM 2 |
xverse.cpp |
Xverse |
jais.cpp |
Jais |
jais2.cpp |
Jais 2 |
exaone.cpp |
EXAONE |
exaone4.cpp |
EXAONE 4 |
granite.cpp |
IBM Granite |
granite-hybrid.cpp |
IBM Granite Hybrid |
deci.cpp |
Deci |
codeshell.cpp |
CodeShell |
orion.cpp |
Orion |
nemotron.cpp |
NVIDIA Nemotron |
nemotron-h.cpp |
NVIDIA Nemotron H |
hunyuan-dense.cpp |
Tencent Hunyuan Dense |
hunyuan-vl.cpp |
Hunyuan Vision-Language |
smollm3.cpp |
SmolLM 3 |
lfm2.cpp |
Liquid LFM 2 |
dream.cpp |
Dream |
smallthinker.cpp |
SmallThinker |
llada.cpp |
LLADA |
seed-oss.cpp |
Seed OSS |
dots1.cpp |
Dots 1 |
apertus.cpp |
Apertus |
minimax-m2.cpp |
MiniMax M2 |
cogvlm.cpp |
CogVLM |
rnd1.cpp |
RND-1 |
ernie4-5.cpp |
ERNIE 4.5 |
step35.cpp |
Step 3.5 |
maincoder.cpp |
MainCoder |
kimi-linear.cpp |
Kimi K2 (Linear Attention) |
mimo2.cpp |
MIMO 2 |
paddleocr.cpp |
PaddleOCR |
9. Python Infrastructure
9.1 Model Conversion (convert_hf_to_gguf.py)
The massive (668KB) conversion script handles:
- Loading Hugging Face models (PyTorch, SafeTensors)
- Architecture-specific tensor mapping (HF names → GGUF names)
- Vocabulary extraction (tokenizer.model, tokenizer.json, tokenizer_config.json)
- Chat template preservation
- Metadata embedding (model name, description, license, etc.)
- Output to GGUF format
9.2 GGUF Python Library (gguf-py/)
Official Python library for GGUF manipulation:
gguf/— Core GGUF reading/writingexamples/— Usage examplestests/— Test suite- Used by the Hugging Face Spaces ecosystem (GGUF-my-repo, GGUF-editor)
9.3 Other Python Scripts
convert_llama_ggml_to_gguf.py: Legacy GGML to GGUF migrationconvert_lora_to_gguf.py: LoRA adapter conversionconvert_hf_to_gguf_update.py: Auto-update script for new model support
10. Build System & CI
10.1 CMake (CMakeLists.txt, CMakePresets.json)
The primary build system supporting:
- GPU backend selection (
-DGGML_CUDA=ON,-DGGML_METAL=ON,-DGGML_VULKAN=ON, etc.) - Quantization kernel selection
- Build type (Release/Debug/RelWithDebInfo)
- Sanitizers (ASan, UBSan, TSAN)
- Static/shared library builds
- macOS framework generation (XCFramework for iOS/tvOS/visionOS)
10.2 GitHub Actions Workflows (.github/workflows/)
Comprehensive CI covering:
- server.yml: Server build and test across platforms
- build.yml: Main build matrix (Linux, macOS, Windows, various backends)
- models.yml: Model conversion and validation
- codeql.yml: Security scanning
- stale.yml: Issue management
- clang-tidy.yml: Static analysis
- Containerized Snapdragon and CUDA builds
10.3 Devops (devops/)
- Nix flake (
flake.nix): Reproducible builds - Docker: Multi-platform container builds
- NixOS module: Systemd service configuration
11. Data Flow: From User Input to Generated Token
A typical inference call flow through llama-simple:
1. llama_model_load_from_file()
└─ llama_model_loader: parse GGUF, mmap weights, allocate tensors
└─ model-specific loader (e.g., llama_model_llama::model_load())
2. llama_init_from_model()
└─ Create llama_context with KV cache, graph scheduler, output buffers
3. llama_tokenize()
└─ Text → [token IDs] using the vocab's tokenizer
4. llama_decode(ctx, batch)
└─ llama_context::decode()
└─ Split batch into ubatches
└─ llama_context::process_ubatch()
├─ Build/reuse computation graph via llm_graph_context
│ ├─ build_inp_embd() — Token embeddings lookup
│ ├─ build_inp_pos() — Position encoding
│ └─ For each layer:
│ ├─ build_norm() — Layer/RMS norm
│ ├─ build_qkv() — QKV projections
│ ├─ build_attn() — Self-attention + KV cache
│ ├─ build_ffn() — Feed-forward network
│ └─ Residual connections
├─ graph_compute() — Execute via ggml_backend_sched
│ ├─ Dispatch ops to GPU backends (CUDA/Metal/Vulkan)
│ └─ Fallback to CPU for unsupported ops
└─ Extract logits from output buffer
5. llama_sampler_sample(smpl, ctx, -1)
└─ Apply sampler chain to logits
└─ Grammar mask, top-k, top-p, temperature, repetition penalty...
└─ Return next token ID
6. llama_token_to_piece()
└─ Token ID → text character(s)
7. Append token to batch, goto 4 (autoregressive loop)
12. Key Design Decisions & Trade-offs
12.1 Why C/C++ and Not Python?
- Zero dependencies: No Python runtime, no PyTorch, no CUDA toolkit required at deployment
- Portability: Runs on bare-metal, embedded systems, iOS/Android apps
- Performance: Complete control over memory layout, SIMD intrinsics, and threading
- Small binary footprint: A static build is ~3-10 MB
12.2 Why GGUF and Not Hugging Face/SafeTensors?
- Trivially parseable: Binary format with no Python dependency for reading
- MMAP-friendly: Tensor data is a single contiguous block suitable for memory-mapped I/O
- Self-contained: Stores tokenizer, chat template, and all metadata in the same file
- Quantized storage: Natively stores quantized types without conversion on load
12.3 Computation Graph vs. Eager Execution
- Graph approach: Build a DAG of operations, then execute — enables graph optimization, backend scheduling, and memory reuse
- Trade-off: Higher initial latency to build the graph, but subsequent executions with the same topology are fast
- Reuse optimization: The
llm_graph_resultsystem avoids rebuilding graphs when the topology hasn't changed
12.4 Quantization Strategy
- K-quants: Block-based quantization (Q2_K through Q6_K) with hierarchical scaling provides excellent accuracy/size trade-offs
- IQ: Importance-aware quantization pushes quality boundaries at very low bitrates (1.5-2.5 bpw)
- No re-quantization at runtime: Weights stay quantized; matmul kernels operate directly on quantized data
13. Performance Characteristics
| Operation | Bottleneck | Optimization |
|---|---|---|
| Prompt processing (prefill) | Matrix multiply (GEMM) | GPU tensor cores, quantized matmul, batch parallelism |
| Token generation (decoding) | Memory bandwidth | KV cache reuse, quantized weights, memory-bound matmul |
| Large context (>32K) | Attention computation | Flash attention, page attention, sliding window |
| Batch inference | Throughput vs. latency | Dynamic batching, concurrent slots, KV cache sharing |
| Multi-modal | Encoder compute | Separable encoder/decoder graphs, cross-attention |
Typical speeds (Q4_K_M, M2 Ultra, 7B model):
- Prompt processing: ~5,000-10,000 tokens/second
- Text generation: ~50-200 tokens/second (varies with context size and quantization)
14. Bindings & Ecosystem
The project has extensive bindings enabling use from virtually any language:
| Language | Package | Maintainer |
|---|---|---|
| Python | llama-cpp-python |
abetlen |
| Python | easy-llama |
ddh0 |
| Go | go-llama.cpp |
go-skynet |
| Go | yzma (no CGo) |
hybridgroup |
| Node.js | node-llama-cpp |
withcatai |
| Rust | llama_cpp-rs |
edgenai/utilityai |
| C# | LLamaSharp |
SciSharp |
| Java | java-llama.cpp |
kherud |
| Swift | llama-cpp-swift |
srgtuszy |
| Ruby | llama_cpp.rb |
yoshoku |
| Zig | llama.cpp.zig |
deins |
| Flutter/Dart | llama_cpp_dart |
netdur |
| Android | llama.android |
in-repo |
Popular downstream projects:
- Ollama: The most popular user-facing LLM runner, built on llama.cpp
- LM Studio: GUI application for model exploration
- GPT4All: Desktop LLM client by Nomic AI
- llamafile: Mozilla's single-file executable LLM (combines llama.cpp + Cosmopolitan libc)
- text-generation-webui: oobabooga's powerful web interface
- koboldcpp: Storytelling-focused fork
- LocalAI: Kubernetes-native LLM serving
- Jan: Open-source ChatGPT alternative
15. Development & Contribution
15.1 Code Organization Philosophy
src/: Core library (libllama) — only depends on ggmlggml/: Tensor library (standalone, used by multiple projects)common/: Shared utilities for tools — depends on src/tools/: Standalone executables — depends on common/examples/: Minimal reference implementations
15.2 Adding a New Model
The process (detailed in docs/development/HOWTO-add-model.md):
- Add architecture enum to
llama-arch.h - Add tensor name mappings in
llama-arch.cpp - Write model loader and graph builder in
src/models/<name>.cpp - Register the architecture in the model mapping function in
llama-model.cpp - Add pre-tokenizer type if needed in
llama-vocab.cpp - Add GGUF KV key definitions
- Update Python converter (
convert_hf_to_gguf.py) for HF model mapping
16. Project Stats (Approximate)
| Metric | Value |
|---|---|
| Total source files (C/C++) | ~250+ |
| Model architectures | 150+ |
| GPU backends | 15+ |
| Quantization formats | 30+ |
| Sampler types | 20+ |
| Lines of C/C++ code | ~80,000+ (src + ggml) |
| Python conversion scripts | 4 (major) |
| Bindings (languages) | 15+ |
| Downstream projects | 100+ |
| Commits | 10,000+ |
| Contributors | 1,000+ |
| GitHub stars | 75,000+ |
This document was generated by exploring the repository structure, source code, and documentation of llama.cpp (commit as of May 2026). It represents a snapshot of a rapidly evolving project. For the most current information, refer to the repository and its official documentation.