Why 52?

Fast modular squaring with AVX512IFMA seems to be the entrypoint here, I found it simply by googling the instruction VPMADD52LUQ. It cites Efficient Software Implementations of Modular Exponentiation stating

Our developments build on top of the Almost Montgomery Square (AMS) optimization of [8]

But also cites Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures as the reference for the 52-bit representation:

These instructions multiply eight 52-bit unsigned integers residing in wide 512-bit registers, produce the low (VPMADD52LUQ) and high (VPMADD52HUQ) halves of the 104-bit products, and add the results to 64-bit accumulators (i. e., SIMD elements), placing them in the destination register. They are designed for supporting big number multiplications, when the inputs are stored in a ”redundant representation” using radix 252 (as explained in [9]).

This paper though, cites Parallel Cryptographic Arithmetic Using a Redundant Montgomery Representation as a starting place.

For example, Page and Smart [19] suggested using SIMD architectures to calculate several exponentiations in parallel, and using a “redundant Montgomery representation” (which we call here Non Reduced) to avoid conditional final subtractions in Montgomery Multiplications.

This seems to the the origin of the “redundant” terminology we find all over the code, but the Gueron paper states “which we call here Non Reduced”, so perhaps we’ll find that Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures is the right place to start for the current RSA implementation.

It does look like that’s the case. The windows—which they refer to as digits—of large numbers does come from Software Implementations of Modular Exponentiations, which I’ve transcluded below for simplicity’s sake.

Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures [dsp-9X1T]

This was published in Arithmetic of Finite Fields, but I’ve grabbed just this chapter here.

This paper goes into detail about the algorithms used, originally for AVX2, the x86 instructions that those algorithms employ, and so on, but what I was really looking for is the origin of the 52-bit representation found in the current AVX-512 implementation found in OpenSSL, soon AWS-LC, and maybe some day in BoringSSL. It turns out, this paper contains that origin, only here we’re talking about the conversion of 32-bit numbers to 29-bit numbers (arbitrarily, I think) and are given the following equation as the way in which to convert numbers from one radix to another.

\[ A = \sum^{k-1}_{i=0} X_i \times 2^{m \times i} \]

Where \(A\) is the value we wish to represent, \(X_i\) is the value of the \(i\)th window, or as the paper calls it, the \(i\)th digit, and \(m\) is the window size in bits.

So, say we want to represent the 16-bit number 24023 in windows of 5 bits.

0101110111010111

We think of as the following windows:

0 10111 01110 10111

Where \(m = 5\) and \(k = 4\), giving us

0   10111              01110             10111
0 + (23 * 2^(5 * 2)) + (14 *2^(5 * 1)) + (23 * 2^(5 * 0))
0 + 23552            + 448             + 23               = 24023

This also holds if we choose 3-bit windows:

0   101               110               111               010               111
0 + (5 * 2^(3 * 4)) + (6 * 2^(3 * 3)) + (7 * 2^(3 * 2)) + (2 * 2^(3 * 1)) + (7 * 2^(3 * 0))
0 + 20480           + 3072            + 448             + 16              + 7               = 24023

This representation is the one used in the AVX-512 implementation where \(m = 52\), carving up, for instance, the 1024-bit bignums into 52-bit windows.

Why 52?

Related