k-tuplet-search

HOWTO — GPU k-Tuplet Search (kt_filter_v8)

kt_filter_v8 is the production GPU engine for searching large prime k-tuplets. It uses a staged CUDA filter cascade (wheel → L2/ext-L2 → line-sieve → Fermat-2) followed by host-side BPSW proving. The practical record-search examples focus on k ∈ {19, 20, 21}; k ≥ 22 is exploratory frontier search.

kt_filter_v5 is the oracle engine used for cross-checking correctness. Both are built from src/cuda/.


Prerequisites

Requirement Notes
NVIDIA GPU Tested on RTX 5090 (sm_120). Override arch for other cards — see below.
CUDA toolkit CUDA 13.x for sm_120; CUDA 11.8+ for sm_89; set NVCC_ARCH for the target GPU
libgmp-dev sudo apt install libgmp-dev
gcc / g++ Standard C11/C++17 build tools

Build

git clone https://github.com/nmicic/k-tuplet-search.git
cd k-tuplet-search/src/cuda

make kt_filter_v8          # production engine
make kt_filter_v5          # oracle / correctness reference
make clean

Override GPU architecture (default is sm_120 for RTX 5090 / Blackwell):

make kt_filter_v8 NVCC_ARCH=sm_89   # RTX 4090 (Ada Lovelace)
make kt_filter_v8 NVCC_ARCH=sm_86   # RTX 3090 (Ampere)
make kt_filter_v8 NVCC_ARCH=sm_80   # A100 (Ampere)

If nvcc reports Value 'sm_120' is not defined, the toolkit is too old for Blackwell. Use CUDA 13.x for RTX 5090, or override NVCC_ARCH for the GPU you actually have. For RTX 4090 / sm_89, use CUDA 11.8 or newer.

Both binaries land in src/cuda/.


Verify the build

# Print version + build SHA
./kt_filter_v8 -V

# Run embedded test suite
./kt_filter_v8 --test

# List all supported admissible patterns
./kt_filter_v8 --list-patterns

--list-patterns prints the pattern names accepted by --pattern, plus each pattern’s k, diameter, and offsets. For the long-form option reference, see docs/CUDA_CLI_REFERENCE.md.

Release validation snapshot, 2026-05-18: the production sm_120 build was checked on an RTX 5090 host with CUDA 13.2. ./kt_filter_v8 --test completed with All 52 tests passed, including known-record replay, wheel parity, 47# wheel oracle checks, checkpoint/resume, and pattern-list coverage.


Primorial reference

The --primorial N flag selects the sieve wheel upper-prime (0-indexed):

Flag Wheel Notes
(default / --primorial 11) 37# Default; works for K ≤ 19 at low bits
--primorial 13 43# Recommended for K=19 and K=21_P1
--primorial 14 47# Recommended for K=20 and K=21_P0/K=22+

Higher primorial = fewer candidates reach the Fermat stage = faster per-candidate work after a longer one-time wheel init (~2–3 min for 47#).

--wheel-expr supports non-plain-primorial wheels:

./kt_filter_v8 --pattern KT19_P0 --bits 100 --wheel-expr '47#/31/17' --random
./kt_filter_v8 --pattern KT20_P1 --bits 102 --wheel-expr '47#/31' --random
./kt_filter_v8 --pattern KT19_P0 --bits 100 --wheel-expr '43#' --random

X#/Y/Z means “all primes through X, except Y and Z”. Dropped primes must be smaller than X; use 43#, not 47#/47, for the lower plain wheel. The engine normalizes drop order in logs and checkpoints, so 47#/31/17 is recorded as 47#/17/31, and resume works with either order. This is a wheel-shape campaign feature, not the full CC-style fingerprint-pool engine.

Operational notes:

Example release-candidate measurements on RTX 5090, CUDA 13.2, sm_120:

Pattern / wheel Stage-0 admissible residues Predicted wheel alloc
KT19_P0 @ 43#/31 10,036,224 79.8 MiB
KT19_P0 @ 47#/31 281,014,272 2220.5 MiB
KT19_P0 @ 47#/29 357,654,528 2826.1 MiB
KT20_P1 @ 47#/31 123,941,664 980.6 MiB
KT20_P1 @ 47#/29 146,476,512 1158.9 MiB
KT20_P1 @ 47#/23 268,540,272 2124.7 MiB

The /23 variant is valid, but can be heavier than /29 or /31 for these patterns. Choose wheel-expression cells from measured admissible counts, not from the dropped prime value alone.


Validate known records (correctness gate)

Replays records from the corpus by narrowing search to the known prefix. Use this after any engine change to confirm correctness is preserved.

# K=16 — 5 records, ~10s total (37# default)
./kt_filter_v8 --validate-known 16

# K=17 — 5 records, ~30s total (37# default)
./kt_filter_v8 --validate-known 17

# K=18 — 5 records (37# default)
./kt_filter_v8 --validate-known 18

# K=19 — 5 records, ~2 min total (must specify 43#)
./kt_filter_v8 --validate-known 19 --primorial 13

# K=20 — 5 records (~2 min total; 47# wheel init takes 2–3 min before first check)
./kt_filter_v8 --validate-known 20 --primorial 14 --validate-per-record-budget 90

Exit code 0 = gate=PASS. Non-zero = one or more records were not reproduced within the configured per-record budget.

The K=16–19 gates are the fastest correctness check. K=20/21 records are larger and can need a higher --validate-per-record-budget, especially with 47# wheel initialization.


Search — K=19 (KT19_P0, 43#)

K=19 is an active frontier. Known records sit at ~89–102 bits. KT19_P0 is the best-characterized pattern; KT19_P1/KT19_P2/KT19_P3 also have records.

# Sequential search — 30-minute cell at 100 bits
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --max-time 1800

# Random search (recommended for distributed/long runs — each instance picks
# a random starting point from /dev/urandom, no coordination needed)
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random --max-time 1800

# Save found records to a file
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random \
    --max-time 1800 --output found_kt19.txt

# Quiet mode (suppress progress lines)
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random \
    --max-time 1800 --full-quiet

Progress reporter output format:

[reporter] t=10.03s batches=8356 cand=6.20e+16 surv=0 hits=0 cand/s=6.01e+15
[reporter] kills: stage0=216082944 L2=204535799 ext=9887215 linesieve=1659020 fermat2=910

A healthy kill funnel satisfies: stage0 > L2 > ext > linesieve > fermat2. Any inversion signals a wheel-alignment regression.


Search — K=20 (KT20_P1, 47#)

K=20 is also an active frontier. Known records are at ~93–100 bits. The 47# wheel-CRT join takes ~2–3 minutes before the first batch; this is normal.

# Random search — 30-minute cell at 102 bits
./kt_filter_v8 --pattern KT20_P1 --primorial 14 --bits 102 --random --max-time 1800

# KT20_P0 is equally valid
./kt_filter_v8 --pattern KT20_P0 --primorial 14 --bits 102 --random --max-time 1800

Search — K=21 (sub-tuple of K=19 / K=20)

KT21_P1 ⊃ KT19_P0 (extends by offsets +82, +84). KT21_P0 ⊃ KT20_P1 (extends by offset +84). Any K=21 find is simultaneously a K=19 or K=20 find.

# KT21_P1 — use 43# (inherits from KT19_P0)
./kt_filter_v8 --pattern KT21_P1 --primorial 13 --bits 102 --random --max-time 1800

# KT21_P0 — use 47# (inherits from KT20_P1)
./kt_filter_v8 --pattern KT21_P0 --primorial 14 --bits 104 --random --max-time 1800

Search — K=22+ (frontier / exploratory)

Luhn’s record pages currently maintain large known examples through k=21; k≥22 searches here are exploratory. Use --smoke to verify the engine accepts a pattern before a long run.

# Sanity smoke (a few seconds, GPU at 100% = engine accepts the pattern)
./kt_filter_v8 --pattern KT22_P0 --primorial 14 --bits 104 \
    --max-batches 200 --smoke

# Production run
./kt_filter_v8 --pattern KT22_P0 --primorial 14 --bits 104 --random --max-time 1800
./kt_filter_v8 --pattern KT22_P1 --primorial 14 --bits 104 --random --max-time 1800
./kt_filter_v8 --pattern KT22_P2 --primorial 14 --bits 104 --random --max-time 1800
./kt_filter_v8 --pattern KT22_P3 --primorial 14 --bits 104 --random --max-time 1800

The engine catalog extends through KT28_P9; inspect the full list with ./kt_filter_v8 --list-patterns.


Each GPU runs an independent instance. Use --gpu-device N to pin each instance to a specific GPU. In --random mode each instance draws from /dev/urandom independently — no explicit sharding needed.

# 4-GPU launch (run in separate terminals or tmux panes)
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random \
    --max-time 1800 --gpu-device 0 &
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 101 --random \
    --max-time 1800 --gpu-device 1 &
./kt_filter_v8 --pattern KT20_P1 --primorial 14 --bits 102 --random \
    --max-time 1800 --gpu-device 2 &
./kt_filter_v8 --pattern KT20_P1 --primorial 14 --bits 103 --random \
    --max-time 1800 --gpu-device 3 &

Wheel-expression campaign example:

./kt_filter_v8 --pattern KT19_P0 --bits 99  --wheel-expr '47#/31' --random \
    --max-time 43200 --gpu-device 0 --gpu-batch-size 2097152 --gpu-streams 3 &
./kt_filter_v8 --pattern KT19_P0 --bits 99  --wheel-expr '47#/29' --random \
    --max-time 43200 --gpu-device 1 --gpu-batch-size 2097152 --gpu-streams 3 &
./kt_filter_v8 --pattern KT20_P1 --bits 100 --wheel-expr '47#/29' --random \
    --max-time 43200 --gpu-device 2 --gpu-batch-size 2097152 --gpu-streams 3 &
./kt_filter_v8 --pattern KT20_P1 --bits 100 --wheel-expr '47#/23' --random \
    --max-time 43200 --gpu-device 3 --gpu-batch-size 2097152 --gpu-streams 3 &

For exhaustive sequential coverage across GPUs, use prefix-lane sharding:

# Divide [2^100, 2^101) into 4 lanes
for id in 0 1 2 3; do
  ./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 \
      --prefix-lanes 4 --prefix-lane-id $id --gpu-device $id \
      --exhaustive --max-time 86400 &
done

Checkpoint / resume

# Run with checkpoint (saved every 60s by default)
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random \
    --max-time 3600 --checkpoint /tmp/kt19_ckpt.json

# Resume from checkpoint
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 --random \
    --max-time 3600 --checkpoint /tmp/kt19_ckpt.json --resume

# Adjust checkpoint interval (default 60s)
./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 \
    --checkpoint /tmp/ckpt.json --ckpt-interval 300

Checkpoints include the normalized wheel expression. A checkpoint from --wheel-expr '47#/31' is not compatible with a plain --primorial 14 run, and pre---wheel-expr checkpoints are intentionally not resumed under --wheel-expr.


Output — found records

When a k-tuplet survives all stages (GPU filter + host BPSW prove), the binary:

  1. Prints *** FOUND k=N ... *** to stdout
  2. Appends a JSON line to novel_records_gpu{N}.jsonl in the working directory (where N is the GPU device index)
  3. If --output FILE is given, also appends to that file
# Inspect any finds (multi-GPU run, from each process working directory)
cat novel_records_gpu0.jsonl
cat novel_records_gpu1.jsonl

Correctness oracle — v5 vs v8

kt_filter_v5 is the reference implementation. Running the same search on both engines and comparing kill-funnel ratios is a quick sanity check after large v8 changes. Treat throughput as configuration-dependent; compare correctness signals first (surv, kill ratios, and found-record replay).

# Build v5
make kt_filter_v5

# Run both on the same small bit range for a few seconds
./kt_filter_v5 --pattern KT19_P0 --primorial 13 --bits 98 \
    --max-time 30 2>&1 | grep 'kills:\|surv='

./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 98 \
    --max-time 30 2>&1 | grep 'kills:\|surv='

Compare the L2/(stage0) and ext/L2 kill ratios between the two runs. Any divergence > 1% in these ratios is a correctness signal worth investigating.

v5 also supports random oracle runs for sampling-path comparisons:

./kt_filter_v5 --pattern KT19_P0 --bits 98 --random \
    --random-seed 0x1234 --chunk-tiles 1 --max-time 30

Unlike v8, v5 samples raw consecutive candidate batches with stride=1; the v8 primorial-alignment bug class does not apply to v5 random cursors.


Performance reference (RTX 5090, sm_120)

Throughput depends strongly on pattern, wheel, bit range, batch size, stream count, GPU clocks, and whether you are reading raw coverage or useful wheel-admissible candidate rate. The public tree does not ship historical benchmark histories; run a local 30-second anchor on the target machine:

./kt_filter_v8 --pattern KT19_P0 --primorial 13 --bits 100 \
    --max-time 30 --full-quiet

Note: cand/s numbers include all wheel-admissible candidates before the filter cascade. Higher is not strictly better — use kill-funnel ratios as the correctness signal.

Benchmark helpers write generated JSONL under ignored bench/ paths. See TESTING.md for the validation and benchmark workflow.


Known issues

No public-release blocking correctness issue is documented here.


CPU engine

The CPU engine (kt_gmp_v1) lives in src/cpu/ with its own HOWTO_RUN.md. It is slower but requires no GPU and serves as an independent correctness reference.