cunningham-chain-search

Search Pipeline

Overview

The published search engine is a staged heterogeneous pipeline:

  1. precompute forbidden residues for the first d chain positions
  2. build a CRT-and-wheel search lattice
  3. reject almost all candidates on the GPU with a 3-stage modular sieve
  4. hand the tiny survivor set to the proving path
  5. record confirmed roots and derive analysis datasets

The speed comes from avoiding expensive primality work almost all the time.

Phase 1: Precomputed Forbidden Residues

For a first-kind Cunningham chain, a root p generates:

p, 2p+1, 4p+3, 8p+7, ...

For each small prime q, the host precomputes the root residues that would force one of the first d projected chain positions to be divisible by q. These forbidden-residue tables are built once at startup and then uploaded to the GPU.

In the published v13 engine, the filtering primes are organized in three layers:

This is the logic behind --depth: the code is not asking only whether p is prime, but whether any of the first d chain positions are already doomed modulo tracked small primes.

Implementation: src/cuda/cc18_filter_cuda_CpC_v13.cu

Phase 2: CRT Bases and Wheel Lattice

The engine does not scan all odd integers.

First, it enumerates CRT-surviving base residues modulo:

LATTICE_M = 5 * 7 * 11 * 13 * 17 * 19 = 1,616,615

Then, for each surviving base, it generates a wheel over:

WHEEL_PERIOD = 23 * 29 * 31 = 20,677

So the concrete candidate parameterization is:

n = base + (tile * WHEEL_PERIOD + k_offset) * LATTICE_M

where:

The code also partitions wheel positions into mod 6 buckets so each (tile, base) pair only scans the bucket that can produce candidates in the desired n ≡ 5 (mod 6) class.

Two coverage modes exist:

The published release defaults to full.

Phase 3: Search Partitioning

Before each run, the lattice can be narrowed by:

This is how the same engine is used for:

Sequential prefix mode stops after one complete sweep of its assigned prefix range. Random mode keeps sampling until a time limit or manual stop.

Phase 4: GPU Filter Kernel

The hot path is Kernel 1 in src/cuda/cc18_filter_cuda_CpC_v13.cu.

Each block works on one (tile, base) work unit. Candidate wheel positions then pass through three rejection stages:

  1. L2 bitmask filter Small-prime bitmask tests for 37..61
  2. ext-L2 byte filter Byte-table tests for 67..97
  3. line sieve A deeper per-candidate sieve for 101..863

Only candidates that pass all three stages are emitted as survivors.

Important implementation details:

This is the stage that delivers the campaign-scale throughput.

Phase 5: Double-Buffered Handoff

The engine uses two CUDA streams and two copies of the batch/survivor buffers.

This lets it overlap:

In practice, the pipeline runs as:

That overlap is essential for keeping the GPU saturated.

Phase 6: Proving and Chain Confirmation

Two proving paths exist in v13:

The CPU path does the following for each survivor:

  1. reconstruct n from (base_idx, tile_idx, k_offset)
  2. run trial division with the chain-screen primes
  3. run GMP probable-prime testing on the candidate
  4. walk backward if (n-1)/2 is also prime, to find the true root
  5. follow the chain forward from the true root to measure the full chain length

This is also where non-roots are identified and logged.

The published campaign path favors CPU confirmation because it preserves:

Implementations:

Phase 7: Output and Analysis

Confirmed roots are written out and later turned into the released analysis material, including:

This is why the published code should be read as a campaign-and-analysis version, not only as a raw speed benchmark.

Published vs Alternate Variants

The released code reflects the campaign version used to generate the published snapshot and derived analysis.

Some later or alternate optimizations were explored separately. One example is a reverse-check proving path (cc18_filter_cuda_CpC_v14.cu), which can reduce CPU prover work by checking from higher chain positions inward, leveraging lower prime density in the outer shells. That path was not used for the published snapshot because the forward path was better suited for logging, inspection, and post-campaign analysis.

Another later branch is the experimental throughput-focused cc18_filter_cuda_CpC_v15.cu. It keeps the same three-stage filter structure, but changes the hot path with more cache-friendly data layouts: interleaved kmod tables for the L2 and ext-L2 stages, constant-memory ext-L2 bitmasks, packed line-sieve kill masks, and cooperative per-prime residue setup inside the filter kernel. On Blackwell / RTX 5090 hardware, this branch has reached roughly 96-98 B candidates/sec in some CC19-style runs, but it should still be treated as an optimization track rather than the published reference baseline.

Performance

Metric Value
GPU throughput 57-65 B candidates/sec
Experimental v15 throughput ~96-98 B candidates/sec on RTX 5090 in some CC19-style runs
GPU survivor rate 0.0012%
GPU rejection rate 99.9988%
Primary search band 89-91 bit
Published data snapshot 929,574 roots