Cryptography 🤝 Vectorization

Using AES-XTS as a lens into vectorization techniques.

Cryptography 🤝 Vectorization

Using AES-XTS as a lens into vectorization parallelization techniques.

Agenda

AES

AES is the Advanced Encryption Standard, defined in FIPS 197 published in 2001.

AES

AES comes in different flavors whose distinction is the size of the key: 128, 192, or 256 bits.

Before [[en|de]]cryption, keys are expanded into “round keys”, leaving the only difference between AES-[[128|192|256]] to be the number of rounds:

AES

What is meant by “round”?

Each round has 4 transformations: SubBytes, ShiftRows, MixColumns, and AddRoundKey. The algorithm runs each of these transformations on the 16-byte state block for \(n\) rounds.

round :: [Word32] -> [Word32] -> [Word32]
round rk = addRoundKey rk . mixColumns . shiftRows . subBytes

roundLast :: [Word32] -> [Word32] -> [Word32]
roundLast rk = addRoundKey rk . shiftRows . subBytes

encrypt_ :: [Word32] -> [Word32] -> [Word32]
encrypt_ plain expandedKey =
  go (addRoundKey (take 4 expandedKey)
     (matrify plain)) (drop 4 expandedKey) 1
  where
    go :: [Word32] -> [Word32] -> Int -> [Word32]
    go state ek 14 = matrify $ roundLast ek state
    go state ek i = go (round (take 4 ek) state) (drop 4 ek) (i + 1)

AES

What is meant by “round”?

Intel AES-NI implements a single round with vaes[enc|dec]:

vpxor	($key2), $state, $state
vaesenc	0x10($key2), $state, $state
vaesenc	0x20($key2), $state, $state
vaesenc	0x30($key2), $state, $state
vaesenc	0x40($key2), $state, $state
vaesenc	0x50($key2), $state, $state
vaesenc	0x60($key2), $state, $state
vaesenc	0x70($key2), $state, $state
vaesenc	0x80($key2), $state, $state
vaesenc	0x90($key2), $state, $state
vaesenc	0xa0($key2), $state, $state
vaesenc	0xb0($key2), $state, $state
vaesenc	0xc0($key2), $state, $state
vaesenc	0xd0($key2), $state, $state
vaesenclast	0xe0($key2), $state, $state

XTS

XOR-Encrypt-XOR with Cipher Stealing (XTS) is defined in IEEE 1619, and is a confidentiality only AES mode of operation.

It is the standard encryption used for data at rest.

XTS

XTS sits on top of AES to chain together many 16-byte blocks. In its simplest form, it is literally XOR, AESEncrypt, XOR:

x = zipWith xor twk (take 4 pt)
e = AE.encrypt_ x key
x' = zipWith xor twk e

XTS

XTS sits on top of AES to chain together many 16-byte blocks. In its simplest form, it is literally XOR, AESEncrypt, XOR:

vpxor	$tw, $state, $state

vpxor	($key2), $state, $state
vaesenc	0x10($key2), $state, $state
vaesenc	0x20($key2), $state, $state
vaesenc	0x30($key2), $state, $state
vaesenc	0x40($key2), $state, $state
vaesenc	0x50($key2), $state, $state
vaesenc	0x60($key2), $state, $state
vaesenc	0x70($key2), $state, $state
vaesenc	0x80($key2), $state, $state
vaesenc	0x90($key2), $state, $state
vaesenc	0xa0($key2), $state, $state
vaesenc	0xb0($key2), $state, $state
vaesenc	0xc0($key2), $state, $state
vaesenc	0xd0($key2), $state, $state
vaesenclast	0xe0($key2), $state, $state

vpxor	$tw, $state, $state

XTS

XTS’s initialization vector (IV) is also called a tweak. Between each block’s encryption or decryption, it gets updated.

\[ T' = \begin{cases} (T \ll 1) \oplus \texttt{0x87} & T \gg 127 = 1 \\ T \ll 1 & \text{otherwise} \end{cases} \]

XTS

When the input length is not a multiple of 16, cipher stealing is used.

XTS

When the input length is not a multiple of 16, cipher stealing is used.

Recap

Algorithmic Characteristics

Vectorization 101

Vectorization is the idea of creating a “vector” of data from a set of scalars, and then using Same Instruction Multiple Data (SIMD) to do a computation on the elements of that vector in parallel.

\[ \mathbf{v} = \begin{bmatrix} 7 \\ 7 \\ 7 \\ 7 \end{bmatrix} \]

Vectorization 101

On x86_64, the %[xyz]mm registers are vector registers.

Vectorization 101

Vectorized instructions are composed of a big pile of mnemonics.

Vectorization 101

vpaddq %ymm0,%ymm1,%ymm1

Where %ymm0 and %ymm1 contain \(\mathbf{v}\) and \(\mathbf{w}\) respectively:

\[ \mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ \]

\[ \mathbf{w} = \begin{bmatrix} 5 \\ 6 \\ 7 \\ 8 \end{bmatrix} \\ \]

Vectorization 101

vpaddq %ymm0,%ymm1,%ymm1

%ymm1 ends up holding:

\[ \begin{bmatrix} 6 \\ 8 \\ 10 \\ 12 \end{bmatrix} \\ \]

Vectorization 101

* Data Independence

* Linearity

Vectorization in XTS

AES blocks in XTS are independent.

vbroadcasti32x4 ($key1), $t0

# ARK
vpternlogq    \x96, $t0, $tw1, $st1

# round 1
vbroadcasti32x4 0x10($key1), $t0
vaesenc  $t0, $st1, $st1

# round 2
vbroadcasti32x4 0x20($key1), $t0
vaesenc  $t0, $st1, $st1

# Continued for 14 rounds...

vmovdqu8 	 $st1,($output)

Vectorization in XTS

AES blocks in XTS are independent.

vbroadcasti32x4 ($key1), $t0

# ARK
vpternlogq    \x96, $t0, $tw1, $st1
vpternlogq    \x96, $t0, $tw2, $st2

# round 1
vbroadcasti32x4 0x10($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2

# round 2
vbroadcasti32x4 0x20($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2

# Continued for 14 rounds...

vmovdqu8 	 $st1,($output)
vmovdqu8 	 $st2,0x40($output)

Vectorization in XTS

AES blocks in XTS are independent.

vbroadcasti32x4 ($key1), $t0

# ARK
vpternlogq    \x96, $t0, $tw1, $st1
vpternlogq    \x96, $t0, $tw2, $st2
vpternlogq    \x96, $t0, $tw3, $st3
vpternlogq    \x96, $t0, $tw4, $st4

# round 1
vbroadcasti32x4 0x10($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2
vaesenc  $t0, $st3, $st3
vaesenc  $t0, $st4, $st4

# round 2
vbroadcasti32x4 0x20($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2
vaesenc  $t0, $st3, $st3
vaesenc  $t0, $st4, $st4

# Continued for 14 rounds...

vmovdqu8 	 $st1,($output)
vmovdqu8 	 $st2,0x40($output)
vmovdqu8 	 $st3,0x80($output)
vmovdqu8 	 $st4,0xc0($output)

Vectorization in XTS

Tweak generation in XTS does have “some” linearity.

If:

\[ T' = \begin{cases} (T \ll 1) \oplus \texttt{0x87} & T \gg 127 = 1 \\ T \ll 1 & \text{otherwise} \end{cases} \]

Then:

\[ T'' = \begin{cases} (T \ll 2) \oplus \texttt{0x87} & T \gg 126 = 1 \\ T \ll 2 & \text{otherwise} \end{cases} \]

Vectorization in XTS

Tweak generation in XTS does have “some” linearity.

# Tweaks 0-3
vpsllvq const_dq3210(%rip),%zmm0,%zmm4
vpsrlvq const_dq5678(%rip),%zmm1,%zmm2
vpclmulqdq 	 \x0,$ZPOLY,%zmm2,%zmm3
vpxorq 	 %zmm2,%zmm4,%zmm4{%k2}
vpxord 	 %zmm4,%zmm3,%zmm9

Bonus Optimizations

Bonus Optimizations

Instruction-Level Parallelization

vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2
vaesenc  $t0, $st3, $st3
vaesenc  $t0, $st4, $st4

Bonus Optimizations

Why stop at 16?

The Front-end of the pipeline on recent Intel microarchitectures can allocate four µOps per cycle, while the Back-end can retire four µOps per cycle.

(VTune TMA docs)

Bonus Optimizations

Instruction-Level Parallelization

# round 3
vbroadcasti32x4 0x30($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2
vaesenc  $t0, $st3, $st3
vaesenc  $t0, $st4, $st4

# Generate next 4 tweaks
vpsrldq		\xf, $t0, %zmm13
vpclmulqdq	\x0,$ZPOLY, %zmm13, %zmm14
vpslldq		\x1, $t0, %zmm16
vpxord		%zmm14, %zmm16, %zmm16

# round 4
vbroadcasti32x4 0x40($key1), $t0
vaesenc  $t0, $st1, $st1
vaesenc  $t0, $st2, $st2
vaesenc  $t0, $st3, $st3
vaesenc  $t0, $st4, $st4

Bonus Optimizations

Instruction-Level Parallelization

# Tweaks 0-3
vpsllvq const_dq3210(%rip),%zmm0,%zmm4
vpsrlvq const_dq5678(%rip),%zmm1,%zmm2
vpclmulqdq 	 \x0,$ZPOLY,%zmm2,%zmm3
vpxorq 	 %zmm2,%zmm4,%zmm4{%k2}
vpxord 	 %zmm4,%zmm3,%zmm9

# Tweaks 4-7
vpsllvq const_dq7654(%rip),%zmm0,%zmm5
vpsrlvq const_dq1234(%rip),%zmm1,%zmm6
vpclmulqdq 	 \x0,$ZPOLY,%zmm6,%zmm7
vpxorq 	 %zmm6,%zmm5,%zmm5{%k2}
vpxord 	 %zmm5,%zmm7,%zmm10

References