Fast modular squaring with
AVX512IFMA seems to be the
entrypoint here, I found it simply by googling the instruction
`VPMADD52LUQ`

. It
cites *Efficient Software Implementations of Modular
Exponentiation* stating

Our developments build on top of the Almost Montgomery Square (AMS) optimization of [8]

But also cites *Software Implementation of Modular Exponentiation,
Using Advanced Vector Instructions Architectures* as
the reference for the 52-bit representation:

These instructions multiply eight 52-bit unsigned integers residing in wide 512-bit registers, produce the low (VPMADD52LUQ) and high (VPMADD52HUQ) halves of the 104-bit products, and add the results to 64-bit accumulators (i. e., SIMD elements), placing them in the destination register. They are designed for supporting big number multiplications, when the inputs are stored in a ”redundant representation” using radix 252 (as explained in [9]).

This paper though, cites *Parallel Cryptographic Arithmetic Using a
Redundant Montgomery
Representation* as a
starting place.

For example, Page and Smart [19] suggested using SIMD architectures to calculate several exponentiations in parallel, and using a “redundant Montgomery representation” (which we call here Non Reduced) to avoid conditional final subtractions in Montgomery Multiplications.

This seems to the the origin of the “redundant” terminology we find
all over the code, but the Gueron paper states “which we call here Non
Reduced”, so perhaps we’ll find that *Software Implementation of
Modular Exponentiation, Using Advanced Vector Instructions
Architectures* is the right place to start for the current RSA
implementation.

It does look like that’s the case. The windows—which they refer to as
*digits*—of large numbers does come from *Software Implementations of
Modular Exponentiations*, which I’ve transcluded below for
simplicity’s sake.

Software Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures [dsp-9X1T]

This was published in Arithmetic of Finite Fields, but I’ve grabbed just this chapter here.

This paper goes into detail about the algorithms used, originally for AVX2, the x86 instructions that those algorithms employ, and so on, but what I was really looking for is the origin of the 52-bit representation found in the current AVX-512 implementation found in OpenSSL, soon AWS-LC, and maybe some day in BoringSSL. It turns out, this paper contains that origin, only here we’re talking about the conversion of 32-bit numbers to 29-bit numbers (arbitrarily, I think) and are given the following equation as the way in which to convert numbers from one radix to another.

\[ A = \sum^{k-1}_{i=0} X_i \times 2^{m \times i} \]

Where \(A\) is the value we wish to represent, \(X_i\) is the value of
the \(i\)th *window*, or as the paper calls it, the \(i\)th digit, and
\(m\) is the window size in bits.

So, say we want to represent the 16-bit number 24023 in windows of 5 bits.

`0101110111010111`

We think of as the following windows:

`0 10111 01110 10111`

Where \(m = 5\) and \(k = 4\), giving us

```
0 10111 01110 10111
0 + (23 * 2^(5 * 2)) + (14 *2^(5 * 1)) + (23 * 2^(5 * 0))
0 + 23552 + 448 + 23 = 24023
```

This also holds if we choose 3-bit windows:

```
0 101 110 111 010 111
0 + (5 * 2^(3 * 4)) + (6 * 2^(3 * 3)) + (7 * 2^(3 * 2)) + (2 * 2^(3 * 1)) + (7 * 2^(3 * 0))
0 + 20480 + 3072 + 448 + 16 + 7 = 24023
```

This representation is the one used in the AVX-512 implementation
where \(m = 52\), carving up, for instance, the 1024-bit `bignum`

s
into 52-bit windows.