Two GPU engines live here, built from the same Makefile:
kt_filter_v8.cu is the production search engine. It is decomposed into
multiple modules:
| Module | File | Role |
|---|---|---|
| Main kernel | kt_filter_v8.cu |
2-D wheel-offset launch, filter cascade |
| Lanes | kt_lanes.c |
Per-GPU prefix-lane assignment |
| Checkpoint | kt_checkpoint.c |
Active checkpoint / resume |
| Novel record | kt_novel_record.c |
fsync’d hit persistence |
| Records | kt_records.c |
Record corpus loader |
| CLI | kt_cli.c |
Argument parser |
| Signal | kt_signal.cu |
SIGINT / SIGTERM handler |
| Reporter | kt_reporter.cu |
Periodic progress reporter |
| Tests | kt_tests.cu |
Embedded unit test suite |
Key features:
--wheel-expr X#[/Y...] for non-plain-primorial Stage-0 wheels--exhaustive mode with exact coverage tracking--prefix-mode {sequential|random} with per-GPU lane assignment--test suite and --validate-known gateBuild: make kt_filter_v8
Command-line options are documented in
../../docs/CUDA_CLI_REFERENCE.md.
Use --wheel-expr when you want one process to search a specific quotient
wheel such as 47#/31; run multiple GPU-pinned processes for a wheel pool.
Measure unfamiliar expressions with --smoke --max-batches 1 before long runs
because Stage-0 table size and downstream Fermat-2 load are pattern-dependent.
kt_filter_v5.cu is the initial GPU implementation. It uses a streams-based
double-buffered launch (similar to the CC v15.cu ancestor) and a single-module
build. It achieves similar aggregate throughput to CC v15, but lacks most v8
features:
stride=1; it does not use v8’s wheel-offset
tile geometry.--exhaustive modev5 is kept as an oracle for cross-checking v8 filter correctness. It is not used for production campaigns.
Build: make kt_filter_v5
| Feature | v5 | v8 |
|---|---|---|
| Launch style | streams + double-buffer | 2-D wheel-offset |
| Throughput (RTX 5090) | benchmark locally | benchmark locally |
| Random sampling | raw-batch chunks | wheel-offset chunks |
| Exhaustive mode | no | yes |
| Checkpoint / resume | no | yes |
| Prefix lanes | no | yes |
| Extended / quotient wheels | no | 47#, 47#/Y[/Z...] |
| Novel record persistence | shared JSONL | per-GPU JSONL |
| Embedded tests | legacy suite | current suite |
| Module decomposition | single file | 8 modules |
cd src/cuda
make kt_filter_v8 # production binary
make kt_filter_v5 # oracle binary
make test # build v8 + run --test + external smokes
make clean
Default GPU architecture: sm_120 (RTX 5090 / Blackwell). Override with:
make kt_filter_v8 NVCC_ARCH=sm_89 # RTX 4090
sm_120 requires a CUDA toolkit new enough to know Blackwell; the release
candidate was validated with CUDA 13.2 on RTX 5090. Older toolkits can still be
used for syntax/build checks against older architectures by overriding
NVCC_ARCH. For RTX 4090 / sm_89, use CUDA 11.8 or newer.