This is the long-form reference for src/cuda/kt_filter_v8, the production GPU
engine. The built-in --help output is intentionally compact; this file
explains the operational meaning of each public option.
Build first:
cd src/cuda
make kt_filter_v8
Then run commands from src/cuda/ unless the example says otherwise.
| Option | Meaning |
|---|---|
--help, -h |
Print compact CLI help and exit. |
--version, -V |
Print binary name, build SHA, and compile date. |
--list-patterns |
Print the compiled admissible-pattern catalog and exit. |
--test |
Run the embedded CUDA self-test suite. |
--smoke |
Run a short one-batch sanity check and exit when counters prove the kernel ran. |
--list-patternsUse this to discover legal --pattern names:
./kt_filter_v8 --list-patterns
Output columns are:
| Column | Meaning |
|---|---|
| pattern name | Name passed to --pattern, for example KT19_P0. |
k |
Tuple length. |
| diameter | Last offset in the pattern. |
| offsets | Additive shifts {b_i}; a hit proves all n + b_i prime. |
The list is sorted by pattern name and currently contains 97 CUDA-engine
patterns across k=3..28. The catalog is generated from
tools/patterns/catalog/*.json; use Norman Luhn’s
https://pzktupel.de/ktpatt_hl.php page as the mathematical reference before
making external claims.
| Option | Meaning |
|---|---|
--pattern NAME |
Select an exact compiled pattern, e.g. KT19_P0. Prefer this for real searches. |
--k N, --target N |
Select tuple length. If no pattern is given, the engine chooses the first matching pattern for that k; this is mainly a compatibility path. |
--bits N |
Search bases in the bit band [2^(N-1), 2^N). Required for normal search. |
--primorial N |
Select the wheel by upper-prime index in the built-in prime table. Default is 11, meaning 37#. |
--wheel-expr X#[/Y...] |
Select a structural wheel expression. Examples: 47#, 47#/31, 47#/17/31. Overrides --primorial. |
--threads N |
Parsed for CPU-CLI compatibility. The current GPU survivor-prove path is effectively serial; do not use this as a performance knob. |
Useful primorial values:
| Flag | Wheel | Typical use |
|---|---|---|
--primorial 11 |
37# |
Default; fast wheel build, useful for replay and lower-memory checks. |
--primorial 12 |
41# |
Intermediate wheel for experiments. |
--primorial 13 |
43# |
Recommended for KT19_P0 and related k=19/21 examples. |
--primorial 14 |
47# |
Recommended for k=20, k=21_P0, and exploratory k>=22 examples. |
Values below the implementation floor or above 14 are clamped with a warning.
Larger wheels reduce candidate density but take more startup time and memory.
Wheel expressions use the built-in prime catalog up to 47. X# means all
primes up to X; /Y drops a smaller prime from that wheel. For example,
47#/17/31 builds the Stage-0 wheel from all primes up to 47, except
17 and 31. Use 43#, not 47#/47, for the lower plain wheel.
Dropped primes are not lost as correctness checks: they are simply no longer
part of Stage-0. Dropped primes that also appear in the L2 sieve are still
caught there; with the current 47# ceiling that means 41 or 43. Dropped
primes <=37 bypass the GPU sieve stages and are rejected only by Fermat-2
and/or host GMP/BPSW proving, which is correct but more expensive. This is
useful for non-plain-primorial campaign cells and future wheel-pool runners,
but small-prime drops can cause a throughput cliff.
Grammar and validation:
| Form | Meaning |
|---|---|
X# |
Plain primorial wheel through prime X; equivalent to the matching --primorial index. |
X#/Y |
Wheel through X, with prime Y omitted from Stage-0. |
X#/Y/Z/... |
Wheel through X, with multiple distinct primes omitted from Stage-0. |
X and every dropped prime must be present in the built-in wheel catalog
through 47. Drops must be distinct and strictly smaller than X; drops
outside X# and dropping X itself are rejected. Expressions that would leave
an empty wheel or overflow the implementation modulus are rejected before search
starts. The parser normalizes drop order in logs and checkpoints; for example
47#/31/17 is stored as 47#/17/31, so resume works with either input order.
Only one wheel expression is active in a process. To cover a pool such as
47#/31, 47#/29, and 47#/23, run separate processes, typically pinned to
separate GPUs with --gpu-device.
Memory and startup time depend on the pattern-specific admissible residue count. For a new expression, run a bounded smoke first:
./kt_filter_v8 --pattern KT19_P0 --bits 99 --wheel-expr '47#/31' \
--smoke --max-batches 1 --gpu-device 0
The startup log prints the predicted peak allocation and final Stage-0 count:
[wheel] KT19_P0 @ 47#/31 streaming: predicted peak alloc = 2220.539 MiB
Stage 0: 47#/31 wheel admissibility, n_primes=14, n_admissible=281014272
| Option | Meaning |
|---|---|
--sequential |
Compatibility flag for the default sequential search mode. |
--random |
Alias for --prefix-mode random; randomizes chunk anchors for broad sampling. |
--prefix 0bXXX |
Restrict search to bases whose high bits match the binary prefix. |
--prefix-mode sequential |
Walk the selected range deterministically. |
--prefix-mode random |
Draw deterministic or random chunk anchors inside the selected range. |
--random-seed HEX, --seed HEX |
Set the random-mode seed explicitly. Use this for reproducible runs. |
--chunk-tiles N |
Number of tiles per random anchor interval/chunk. Default is 500 in random mode. |
--prefix-lanes N |
Split one prefix/range into N non-overlapping lanes for multi-GPU coverage. |
--prefix-lane-id ID |
Select this process’s lane, from 0 to N-1. |
--exhaustive |
Sequentially walk the selected range and print a PREFIX EXHAUSTED banner when the lane completes. |
Use --random for open-ended sampling. Use --exhaustive only when the range
is intentionally small enough to finish or when you are running a controlled
prefix-lane coverage job.
| Option | Meaning |
|---|---|
--max-time SEC |
Stop after this wall-clock budget. |
--max-batches N |
Stop after this many GPU batches. Useful for smoke tests. |
--checkpoint FILE |
Periodically write an atomic checkpoint with cursor, counters, identity, and seed. |
--resume [FILE] |
Resume from a checkpoint. If no file follows --resume, the engine uses --checkpoint FILE. |
--ckpt-interval N |
Checkpoint interval in seconds. Default is 60; minimum is clamped to 1. |
Checkpoint identity must match the current command’s important search identity fields: pattern, bits, prefix, lane count, lane id, normalized wheel expression, and compatible mode/seed. Mismatches print a clear message and start fresh rather than silently resuming the wrong search.
| Option | Meaning |
|---|---|
--output FILE |
Append human-readable found-record lines to FILE. |
--log-file FILE |
Alias for --output. |
--quiet |
Reduce nonessential output. |
--full-quiet |
Suppress progress reporter output; useful for scripts. |
--report N |
Set progress report interval in seconds. |
--report-interval-sec N |
Same as --report; 0 disables periodic reports. |
--bench-jsonl FILE |
In --validate-known, emit one JSON row per replay record. |
Confirmed finds are always printed to stdout and persisted as
novel_records_gpu<N>.jsonl in the working directory, where N is the CUDA
device index. Novel-record JSONL includes wheel_expr and primorial_n so a
find can be attributed to the exact wheel campaign without cross-referencing
stdout. Files under bench/ are generated runtime artifacts and are ignored by
Git.
| Option | Meaning |
|---|---|
--validate-known [k] |
Replay known records from known/records.json / tools/records_manifest.tsv. With no k, defaults to the fast replay scope. |
--validate-known --k N |
Equivalent way to select the replay tuple length. |
--validate-known-require-coverage |
Fail if records are skipped because the selected wheel cannot cover them under current limits. Use this when validating a production primorial. |
--validate-per-record-budget SEC |
Override the default 60-second per-record replay budget. Also available through KT_VALIDATE_PER_RECORD_BUDGET_SEC. |
--kpi-target-base DEC |
KPI harness mode: stop when the specified base is emitted. Mutually exclusive with --validate-known. |
For correctness checks, prefer --validate-known first. For production-path
validation after engine changes, use longevity_gpu/scripts/validate_records_external.sh;
that script exercises both sequential and random prefix modes against known
records.
| Option | Meaning |
|---|---|
--gpu-device N |
Select CUDA device. Default is 0. |
--gpu-batch-size N |
Requested threads per kernel launch. Default is 524288; RTX 5090 runs generally use larger values such as 2097152. |
--gpu-streams N |
Concurrent CUDA streams. Default is 3; maximum is 8. |
--gpu-arch sm_NN |
Informational label only. Actual architecture is chosen at compile time by NVCC_ARCH. |
These are mostly for debugging and A/B checks. Normal searches should leave all stages enabled.
| Option | Meaning |
|---|---|
--no-stage-l2 |
Disable the first post-wheel forbidden-residue stage. |
--no-stage-ext-l2 |
Disable the extended L2 forbidden-residue stage. |
--no-stage-line |
Disable the line-sieve stage. |
--no-stage-fermat |
Disable the Fermat-2 prefilter before host proving. |
The normal cascade is:
wheel -> L2 -> ext-L2 -> line-sieve -> Fermat-2 -> host GMP/BPSW prove
These flags are default-off instrumentation or host-side validators. They are useful for diagnostics but should not be treated as performance improvements.
| Option | Meaning |
|---|---|
--enable-f18-yield |
Write per (line_prime, residue) yield counters to bench/yield_counters_<bin>_<pid>.jsonl. |
--enable-f22-reservoir |
Write a 1000-slot post-Fermat survivor reservoir to bench/reservoir_<bin>_<pid>.jsonl. |
--enable-f23-texture |
Write a 128-slot survivor texture reservoir to bench/texture_<bin>_<pid>.jsonl. |
--enable-f24-cascade |
Enumerate and score k-2 sub-tuplets of the active pattern at startup; host-side diagnostic. |
--enable-f1-validator |
Validate baked KT19_P0/37# masks against runtime-generated masks, then exit. This is a correctness check, not a hot-kernel optimization. |
--enable-f2-rcu |
Enable dual-bank constant-memory mirror infrastructure for RCU-style filter-table updates; currently infrastructure/diagnostic scope. |
The following CPU flags are accepted so wrapper scripts can share arguments between CPU and GPU binaries. They are echoed or stored but do not change the GPU hot path:
--no-line-sieve
--no-bitvec
--bitvec
--opt-fermat
--no-opt-fermat
--opt-mont-fermat
--no-opt-mont-fermat
--opt-prefetch
--no-opt-prefetch
--opt-bitscan
--no-opt-bitscan
--opt-line-cap N
--pin
--pin-base N
--sieve-only
Debug-only options:
| Option | Meaning |
|---|---|
--verbose-rotation |
Print high-volume random-anchor rotation diagnostics. |
--inject-cursor-offset HEX |
Misalignment-injection regression-test hook. Do not use for normal searches. |
Unknown options are rejected with a non-zero exit code.