k-tuplet-search

GPU Filter Engines

Two GPU engines live here, built from the same Makefile:

kt_filter_v8 (production)

kt_filter_v8.cu is the production search engine. It is decomposed into multiple modules:

Module File Role
Main kernel kt_filter_v8.cu 2-D wheel-offset launch, filter cascade
Lanes kt_lanes.c Per-GPU prefix-lane assignment
Checkpoint kt_checkpoint.c Active checkpoint / resume
Novel record kt_novel_record.c fsync’d hit persistence
Records kt_records.c Record corpus loader
CLI kt_cli.c Argument parser
Signal kt_signal.cu SIGINT / SIGTERM handler
Reporter kt_reporter.cu Periodic progress reporter
Tests kt_tests.cu Embedded unit test suite

Key features:

Build: make kt_filter_v8

Command-line options are documented in ../../docs/CUDA_CLI_REFERENCE.md. Use --wheel-expr when you want one process to search a specific quotient wheel such as 47#/31; run multiple GPU-pinned processes for a wheel pool. Measure unfamiliar expressions with --smoke --max-batches 1 before long runs because Stage-0 table size and downstream Fermat-2 load are pattern-dependent.

kt_filter_v5 (oracle / reference)

kt_filter_v5.cu is the initial GPU implementation. It uses a streams-based double-buffered launch (similar to the CC v15.cu ancestor) and a single-module build. It achieves similar aggregate throughput to CC v15, but lacks most v8 features:

v5 is kept as an oracle for cross-checking v8 filter correctness. It is not used for production campaigns.

Build: make kt_filter_v5

Difference summary

Feature v5 v8
Launch style streams + double-buffer 2-D wheel-offset
Throughput (RTX 5090) benchmark locally benchmark locally
Random sampling raw-batch chunks wheel-offset chunks
Exhaustive mode no yes
Checkpoint / resume no yes
Prefix lanes no yes
Extended / quotient wheels no 47#, 47#/Y[/Z...]
Novel record persistence shared JSONL per-GPU JSONL
Embedded tests legacy suite current suite
Module decomposition single file 8 modules

Building

cd src/cuda
make kt_filter_v8          # production binary
make kt_filter_v5          # oracle binary
make test                  # build v8 + run --test + external smokes
make clean

Default GPU architecture: sm_120 (RTX 5090 / Blackwell). Override with:

make kt_filter_v8 NVCC_ARCH=sm_89   # RTX 4090

sm_120 requires a CUDA toolkit new enough to know Blackwell; the release candidate was validated with CUDA 13.2 on RTX 5090. Older toolkits can still be used for syntax/build checks against older architectures by overriding NVCC_ARCH. For RTX 4090 / sm_89, use CUDA 11.8 or newer.