The published search engine is a staged heterogeneous pipeline:
d chain positionsThe speed comes from avoiding expensive primality work almost all the time.
For a first-kind Cunningham chain, a root p generates:
p, 2p+1, 4p+3, 8p+7, ...
For each small prime q, the host precomputes the root residues that would force one of the first d projected chain positions to be divisible by q. These forbidden-residue tables are built once at startup and then uploaded to the GPU.
In the published v13 engine, the filtering primes are organized in three layers:
5, 7, 11, 13, 17, 1923, 29, 3137..863This is the logic behind --depth: the code is not asking only whether p is prime, but whether any of the first d chain positions are already doomed modulo tracked small primes.
Implementation: src/cuda/cc18_filter_cuda_CpC_v13.cu
The engine does not scan all odd integers.
First, it enumerates CRT-surviving base residues modulo:
LATTICE_M = 5 * 7 * 11 * 13 * 17 * 19 = 1,616,615
Then, for each surviving base, it generates a wheel over:
WHEEL_PERIOD = 23 * 29 * 31 = 20,677
So the concrete candidate parameterization is:
n = base + (tile * WHEEL_PERIOD + k_offset) * LATTICE_M
where:
base is one of the CRT-surviving residuestile is the coarse batch/sweep coordinatek_offset is a wheel position surviving the wheel primesThe code also partitions wheel positions into mod 6 buckets so each (tile, base) pair only scans the bucket that can produce candidates in the desired n ≡ 5 (mod 6) class.
Two coverage modes exist:
full: all CRT-surviving baseslegacy: only the older r % 6 == 5 subsetThe published release defaults to full.
Before each run, the lattice can be narrowed by:
--prefix)sequential or random)--prefix-lanes, --prefix-lane-id)This is how the same engine is used for:
Sequential prefix mode stops after one complete sweep of its assigned prefix range. Random mode keeps sampling until a time limit or manual stop.
The hot path is Kernel 1 in src/cuda/cc18_filter_cuda_CpC_v13.cu.
Each block works on one (tile, base) work unit. Candidate wheel positions then pass through three rejection stages:
37..6167..97101..863Only candidates that pass all three stages are emitted as survivors.
Important implementation details:
mod 6kM_mod lookup table is prefetched only after the first two filters passThis is the stage that delivers the campaign-scale throughput.
The engine uses two CUDA streams and two copies of the batch/survivor buffers.
This lets it overlap:
In practice, the pipeline runs as:
That overlap is essential for keeping the GPU saturated.
Two proving paths exist in v13:
The CPU path does the following for each survivor:
n from (base_idx, tile_idx, k_offset)(n-1)/2 is also prime, to find the true rootThis is also where non-roots are identified and logged.
The published campaign path favors CPU confirmation because it preserves:
Implementations:
src/cuda/cc18_filter_cuda_CpC_v13.cusrc/cpu/cc_gmp_v33_03.cConfirmed roots are written out and later turned into the released analysis material, including:
p+1 breaker analysisThis is why the published code should be read as a campaign-and-analysis version, not only as a raw speed benchmark.
The released code reflects the campaign version used to generate the published snapshot and derived analysis.
Some later or alternate optimizations were explored separately. One example is a reverse-check proving path (cc18_filter_cuda_CpC_v14.cu), which can reduce CPU prover work by checking from higher chain positions inward, leveraging lower prime density in the outer shells. That path was not used for the published snapshot because the forward path was better suited for logging, inspection, and post-campaign analysis.
Another later branch is the experimental throughput-focused cc18_filter_cuda_CpC_v15.cu. It keeps the same three-stage filter structure, but changes the hot path with more cache-friendly data layouts: interleaved kmod tables for the L2 and ext-L2 stages, constant-memory ext-L2 bitmasks, packed line-sieve kill masks, and cooperative per-prime residue setup inside the filter kernel. On Blackwell / RTX 5090 hardware, this branch has reached roughly 96-98 B candidates/sec in some CC19-style runs, but it should still be treated as an optimization track rather than the published reference baseline.
| Metric | Value |
|---|---|
| GPU throughput | 57-65 B candidates/sec |
| Experimental v15 throughput | ~96-98 B candidates/sec on RTX 5090 in some CC19-style runs |
| GPU survivor rate | 0.0012% |
| GPU rejection rate | 99.9988% |
| Primary search band | 89-91 bit |
| Published data snapshot | 929,574 roots |