MatMul Proof-of-Work
Specification of the BTX MatMul proof-of-work consensus algorithm: finite-field matrix multiplication, transcript compression, Freivalds verification, and DoS mitigation.
Overview
BTX replaces traditional hash-based proof-of-work with MatMul PoW, an AI-native proof-of-work system based on matrix multiplication over a finite field. The protocol, derived from the cuPOW construction (Komargodski, Schen, Weinstein 2025), achieves a multiplicative overhead of only ~16.5% above a bare matrix multiply at the production parameters. Miners perform real linear-algebra computation rather than brute-force hashing, making the work inherently useful as a GPU compute benchmark and a stepping stone toward externally useful matrix workloads in a future v2.
Consensus Parameters
| Parameter | Symbol | Value | Role |
|---|---|---|---|
| Matrix dimension | n | 512 | Compute cost O(n³); ~134M field multiplications per attempt |
| Transcript block size | b | 16 | Hashing granularity; (n/b)³ = 32,768 intermediates per attempt |
| Noise rank | r | 8 | Security parameter; denoising cost O(n²r) |
| Field modulus | q | 2³¹ − 1 (2,147,483,647) | Mersenne prime M31; GPU-friendly int32 arithmetic |
| Pre-hash epsilon bits | ε | 10 (18 after height 55,000) | Sigma gate: target « ε before expensive MatMul path |
| Freivalds rounds | k | 2 | False-positive probability < 2−62 |
| Validation window | — | 1,000 blocks | Depth beyond which full Phase 2 revalidation is not required |
| Block time | — | 90 s (steady-state); 250 ms (fast-mine, h < 50,000) | Target spacing for difficulty adjustment |
Note: b (transcript block size) and r (noise rank)
are independent parameters. Changing b affects transcript size and cache behavior;
changing r affects security conjecture strength and denoising overhead.
Finite-Field Arithmetic (FM31)
All arithmetic operates in the Mersenne prime field Fq where q = 231 − 1. This prime was chosen for GPU friendliness: reduction is a single add-and-mask on int32 hardware, and the field supports native Apple Metal and CUDA int32 pipelines.
Reduction
The reduce64 function uses a double Mersenne fold, proven safe for all uint64
inputs (the single-fold variant is buggy for x ≥ 262). The dot()
accumulator is the only sanctioned path, applying per-step reduction with a
length-independent safety proof.
Matrix Generation from Seeds
Matrices A and B are deterministically expanded from 32-byte seeds carried in the block header. The expansion uses SHA-256 as a PRF with LE32 counters and rejection sampling for uniform distribution mod M31. Domain separation tags prevent cross-component collisions:
matmul_noise_EL_v1— left noise factor ELmatmul_noise_ER_v1— right noise factor ERmatmul_noise_FL_v1— left noise factor FLmatmul_noise_FR_v1— right noise factor FRmatmul_compress_v1— compression vector σ
Mining Algorithm (Solve)
The mining loop for a single nonce attempt proceeds as follows:
- Sigma gate (pre-hash): Compute a cheap SHA-256 pre-hash from the candidate
header. If the pre-hash does not satisfy
target « ε, skip the expensive MatMul entirely. This filters out ~99.9% of nonces at ε=10. - Noise generation: Derive four low-rank factor matrices from the oracle seed σ: EL ∈ Fqn×r, ER ∈ Fqr×n, FL ∈ Fqn×r, FR ∈ Fqr×n. Form E = EL·ER and F = FL·FR (each rank ≤ r).
- Noisy MatMul: Compute C' = (A + E)·(B + F) using the canonical block decomposition with block size b=16. The computation iterates over (n/b)³ = 32,768 intermediate b×b partial-sum blocks in a fixed order (i, j, l).
- Transcript compression: Each intermediate block is compressed to a single field element via a random inner-product derived from σ. The compressed elements are fed into a rolling SHA-256d hasher, reducing hashed data from ~33.5 MB (naive) to ~131 KB.
- Difficulty check: If H(compressed transcript) < target, the nonce is valid. Denoise: C = C' − A·F − E·(B + F). Emit the block with matmul_digest, seed_a, seed_b, and the product matrix C'.
Transcript Compression
Naive full-block hashing would require SHA-256 over ~33.5 MB per nonce attempt, adding 30–100% overhead on GPU miners. BTX uses field-algebraic inner-product compression to reduce this by ~256×:
| Approach | Bytes Hashed | SHA-256 Time (SHA-NI) | Overhead vs MatMul |
|---|---|---|---|
| Naive full-block | 33.5 MB | ~67 ms | 30–100% |
| Field-algebraic compression | ~131 KB | ~0.26 ms | < 1% |
Security is preserved: the random inner-product family is pairwise independent, so forging a different b×b block that produces the same compressed element requires guessing a field element (probability 1/q ≈ 2−31 per intermediate). A single σ-derived compression vector is reused across all 32,768 intermediates; Carter-Wegman per-intermediate binding plus σ-unpredictability makes per-intermediate vectors unnecessary.
On GPU architectures, the compression dot-product is fused into the matmul kernel to avoid the ~33.5 MiB data transfer penalty. Compressed elements (~128 KiB total) are written to a ring buffer and consumed by a CPU SHA-256 thread.
Verification (Freivalds' Algorithm)
Full verification requires recomputing the entire O(n³) transcript — the dominant engineering constraint. BTX uses a two-phase validation model:
Phase 1: Header Checks (O(1))
- Validate header format, timestamp, and nBits target.
- Verify matmul_digest matches the claimed SHA-256d of the compressed transcript.
- Check proof-of-work: H(header) < target.
Phase 2: Transcript Recomputation (O(n³))
- Reconstruct A, B from seeds. Regenerate noise. Recompute the canonical block MatMul.
- Verify the compressed transcript matches matmul_digest.
- Apply Freivalds' algorithm as a fast product check: verify (A+E)·((B+F)·r) = C'·r for k=2 random vectors r. Each round has false-positive probability 1/(231−1); with k=2, error < 2−62.
Graduated Peer Punishment
Phase 1 pass followed by Phase 2 fail triggers graduated response: disconnect →
discourage → ban (after 3rd offense in 24h). The fMatMulStrictPunishment
flag enables immediate bans for post-stabilization networks.
DoS Mitigation
Phase 2 verification at n=512 costs approximately 5 ms per block. Rate limits bound CPU exposure:
- Per-peer budget: 32 Phase 2 verifications per minute per peer.
- Global budget: 512 Phase 2 verifications per minute across all peers (~2.56 s CPU/min).
- Validation window: Blocks deeper than 1,000 from the tip skip Phase 2 (assumevalid model for IBD).
- Fast-phase scheduling: During the 250 ms fast-mine phase, Phase 2 is deferred to a bounded queue that must drain completely after the transition to 90 s blocks.
Block Header Fields
The MatMul block header extends the standard Bitcoin header with:
| Field | Size | Description |
|---|---|---|
nNonce64 | 8 bytes | 64-bit mining nonce |
matmul_digest | 32 bytes | SHA-256d of compressed transcript |
matmul_dim | 2 bytes | Matrix dimension (512) |
seed_a | 32 bytes | Deterministic seed for matrix A |
seed_b | 32 bytes | Deterministic seed for matrix B |
Block size with seed-derived matrices is approximately 200 KiB, compared to ~2.2 MiB if full matrices were included. All serialization uses little-endian byte order; big-endian architectures are unsupported.
Overhead Analysis
At production parameters (n=512, b=16, r=8), total protocol overhead is ~16.5% above a bare n×n matrix multiply:
| Component | Cost | % of Baseline |
|---|---|---|
| Noisy MatMul (A+E)·(B+F) | 134M muls | baseline |
| Transcript compression dot-products | 8.4M muls | 6.3% |
| Denoising (rank-r factor products) | 8.4M muls | 6.3% |
| Noise generation + matrix additions | 4.7M ops | 3.5% |
| Rolling SHA-256d on compressed elements | 131 KB hashed | < 0.1% |
| Total overhead | ~22M muls | ~16.5% |
On GPUs where the compression dot-product is fused into the matmul kernel, measured total overhead drops to ~10–12%.
Difficulty Adjustment Interaction
MatMul PoW difficulty is governed by the ASERT algorithm (see the Difficulty Adjustment specification). The sigma pre-hash gate interacts with the difficulty target: at higher difficulty, the pre-hash filter rejects more nonces before the expensive MatMul path, keeping average GPU utilization efficient. The pre-hash epsilon upgrade at height 55,000 (from 10 to 18 bits) further reduces wasted compute as the chain matures.
GPU Optimization
The field choice (M31 = 231 − 1) was selected specifically for GPU friendliness. Key optimization points:
- M31 reduction is a single add-and-mask on int32 ALUs, native on Apple Metal and CUDA.
- The compression dot-product must be fused into the matmul kernel to avoid a ~33.5 MiB device-to-host transfer per attempt.
- Compressed elements (~128 KiB) are written to a ring buffer consumed asynchronously by a CPU SHA-256 thread.
- The 512×512 working set (~2 MiB) fits comfortably in GPU L2 cache on modern hardware.
- Denoising runs only once per found block (not per attempt), so its O(n²r) cost is negligible in practice.
Security Properties
- Completeness: An honest miner performing the full MatMul is accepted with probability ε per attempt, determined by the difficulty target.
- Hardness: Under the direct-product conjecture for random rank-r matrix products, no adversary gains meaningful advantage with less than a full matrix multiplication.
- Collision resistance: Per-intermediate collision probability is 2−31; per-block union bound gives ≤ 2−16. A dual-projection upgrade path exists for 2−62 per intermediate if future risk assessment warrants.
- ASIC resistance: ASICs are not cryptographically impossible but are economically misaligned with commodity AI/GPU hardware. The MatMul workload is identical to standard GEMM operations that AI accelerators are optimized for.
Monetary Policy
| Parameter | Value |
|---|---|
| Total supply | 21,000,000 BTX |
| Initial block subsidy | 20 BTX |
| Halving interval | 525,000 blocks |
| Premine | None |
| Fast-mine emission (blocks 0–49,999) | 1,000,000 BTX (4.76% of supply) |
The issuance follows an infinite geometric series: 20 × 525,000 × 2 = 21M.
MoneyRange() and GetBlockSubsidy() enforce the hard cap at consensus level.