Using AES-XTS as a lens into vectorization techniques.
Using AES-XTS as a lens into
AES is the Advanced Encryption Standard, defined in FIPS 197 published in 2001.
AES comes in different flavors whose distinction is the size of the key: 128, 192, or 256 bits.
Before [[en|de]]cryption, keys are expanded into “round keys”, leaving the only difference between AES-[[128|192|256]] to be the number of rounds:
What is meant by “round”?
Each round has 4 transformations: SubBytes
, ShiftRows
,
MixColumns
, and AddRoundKey
. The algorithm runs each of these
transformations on the 16-byte state
block for \(n\) rounds.
round :: [Word32] -> [Word32] -> [Word32]
round rk = addRoundKey rk . mixColumns . shiftRows . subBytes
roundLast :: [Word32] -> [Word32] -> [Word32]
= addRoundKey rk . shiftRows . subBytes
roundLast rk
encrypt_ :: [Word32] -> [Word32] -> [Word32]
=
encrypt_ plain expandedKey take 4 expandedKey)
go (addRoundKey (drop 4 expandedKey) 1
(matrify plain)) (where
go :: [Word32] -> [Word32] -> Int -> [Word32]
14 = matrify $ roundLast ek state
go state ek = go (round (take 4 ek) state) (drop 4 ek) (i + 1) go state ek i
What is meant by “round”?
Intel AES-NI implements a single round with vaes[enc|dec]
:
$key2), $state, $state
vpxor (0x10($key2), $state, $state
vaesenc 0x20($key2), $state, $state
vaesenc 0x30($key2), $state, $state
vaesenc 0x40($key2), $state, $state
vaesenc 0x50($key2), $state, $state
vaesenc 0x60($key2), $state, $state
vaesenc 0x70($key2), $state, $state
vaesenc 0x80($key2), $state, $state
vaesenc 0x90($key2), $state, $state
vaesenc 0xa0($key2), $state, $state
vaesenc 0xb0($key2), $state, $state
vaesenc 0xc0($key2), $state, $state
vaesenc 0xd0($key2), $state, $state
vaesenc 0xe0($key2), $state, $state vaesenclast
XOR-Encrypt-XOR with Cipher Stealing (XTS) is defined in IEEE 1619, and is a confidentiality only AES mode of operation.
It is the standard encryption used for data at rest.
XTS sits on top of AES to chain together many 16-byte blocks. In its
simplest form, it is literally XOR
, AESEncrypt
, XOR
:
= zipWith xor twk (take 4 pt)
x = AE.encrypt_ x key
e = zipWith xor twk e x'
XTS sits on top of AES to chain together many 16-byte blocks. In its
simplest form, it is literally XOR
, AESEncrypt
, XOR
:
$tw, $state, $state
vpxor
$key2), $state, $state
vpxor (0x10($key2), $state, $state
vaesenc 0x20($key2), $state, $state
vaesenc 0x30($key2), $state, $state
vaesenc 0x40($key2), $state, $state
vaesenc 0x50($key2), $state, $state
vaesenc 0x60($key2), $state, $state
vaesenc 0x70($key2), $state, $state
vaesenc 0x80($key2), $state, $state
vaesenc 0x90($key2), $state, $state
vaesenc 0xa0($key2), $state, $state
vaesenc 0xb0($key2), $state, $state
vaesenc 0xc0($key2), $state, $state
vaesenc 0xd0($key2), $state, $state
vaesenc 0xe0($key2), $state, $state
vaesenclast
$tw, $state, $state vpxor
XTS’s initialization vector (IV) is also called a tweak. Between each block’s encryption or decryption, it gets updated.
\[ T' = \begin{cases} (T \ll 1) \oplus \texttt{0x87} & T \gg 127 = 1 \\ T \ll 1 & \text{otherwise} \end{cases} \]
When the input length is not a multiple of 16, cipher stealing is used.
When the input length is not a multiple of 16, cipher stealing is used.
Vectorization is the idea of creating a “vector” of data from a set of scalars, and then using Same Instruction Multiple Data (SIMD) to do a computation on the elements of that vector in parallel.
On x86_64, the %[xyz]mm
registers are vector registers.
%xmm
registers can hold 128 bits.%ymm
registers can hold 256 bits.%zmm
registers can hold 512 bits.Vectorized instructions are composed of a big pile of mnemonics.
%ymm0,%ymm1,%ymm1 vpaddq
Where %ymm0
and %ymm1
contain \(\mathbf{v}\) and \(\mathbf{w}\) respectively:
\[ \mathbf{v} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ \]
\[ \mathbf{w} = \begin{bmatrix} 5 \\ 6 \\ 7 \\ 8 \end{bmatrix} \\ \]
%ymm0,%ymm1,%ymm1 vpaddq
%ymm1
ends up holding:
AES blocks in XTS are independent.
$key1), $t0
vbroadcasti32x4 (
# ARK
$0x96, $t0, $tw1, $st1
vpternlogq \
# round 1
0x10($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc
# round 2
0x20($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc
# Continued for 14 rounds...
$st1,($output) vmovdqu8
AES blocks in XTS are independent.
$key1), $t0
vbroadcasti32x4 (
# ARK
$0x96, $t0, $tw1, $st1
vpternlogq \$0x96, $t0, $tw2, $st2
vpternlogq \
# round 1
0x10($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc
# round 2
0x20($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc
# Continued for 14 rounds...
$st1,($output)
vmovdqu8 $st2,0x40($output) vmovdqu8
AES blocks in XTS are independent.
$key1), $t0
vbroadcasti32x4 (
# ARK
$0x96, $t0, $tw1, $st1
vpternlogq \$0x96, $t0, $tw2, $st2
vpternlogq \$0x96, $t0, $tw3, $st3
vpternlogq \$0x96, $t0, $tw4, $st4
vpternlogq \
# round 1
0x10($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc $t0, $st3, $st3
vaesenc $t0, $st4, $st4
vaesenc
# round 2
0x20($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc $t0, $st3, $st3
vaesenc $t0, $st4, $st4
vaesenc
# Continued for 14 rounds...
$st1,($output)
vmovdqu8 $st2,0x40($output)
vmovdqu8 $st3,0x80($output)
vmovdqu8 $st4,0xc0($output) vmovdqu8
Tweak generation in XTS does have “some” linearity.
If:
\[ T' = \begin{cases} (T \ll 1) \oplus \texttt{0x87} & T \gg 127 = 1 \\ T \ll 1 & \text{otherwise} \end{cases} \]
Then:
\[ T'' = \begin{cases} (T \ll 2) \oplus \texttt{0x87} & T \gg 126 = 1 \\ T \ll 2 & \text{otherwise} \end{cases} \]
Tweak generation in XTS does have “some” linearity.
# Tweaks 0-3
%rip),%zmm0,%zmm4
vpsllvq const_dq3210(%rip),%zmm1,%zmm2
vpsrlvq const_dq5678($0x0,$ZPOLY,%zmm2,%zmm3
vpclmulqdq \%zmm2,%zmm4,%zmm4{%k2}
vpxorq %zmm4,%zmm3,%zmm9 vpxord
$t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc $t0, $st3, $st3
vaesenc $t0, $st4, $st4 vaesenc
The Front-end of the pipeline on recent Intel microarchitectures can allocate four µOps per cycle, while the Back-end can retire four µOps per cycle.
# round 3
0x30($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc $t0, $st3, $st3
vaesenc $t0, $st4, $st4
vaesenc
# Generate next 4 tweaks
$0xf, $t0, %zmm13
vpsrldq \$0x0,$ZPOLY, %zmm13, %zmm14
vpclmulqdq \$0x1, $t0, %zmm16
vpslldq \%zmm14, %zmm16, %zmm16
vpxord
# round 4
0x40($key1), $t0
vbroadcasti32x4 $t0, $st1, $st1
vaesenc $t0, $st2, $st2
vaesenc $t0, $st3, $st3
vaesenc $t0, $st4, $st4 vaesenc
# Tweaks 0-3
%rip),%zmm0,%zmm4
vpsllvq const_dq3210(%rip),%zmm1,%zmm2
vpsrlvq const_dq5678($0x0,$ZPOLY,%zmm2,%zmm3
vpclmulqdq \%zmm2,%zmm4,%zmm4{%k2}
vpxorq %zmm4,%zmm3,%zmm9
vpxord
# Tweaks 4-7
%rip),%zmm0,%zmm5
vpsllvq const_dq7654(%rip),%zmm1,%zmm6
vpsrlvq const_dq1234($0x0,$ZPOLY,%zmm6,%zmm7
vpclmulqdq \%zmm6,%zmm5,%zmm5{%k2}
vpxorq %zmm5,%zmm7,%zmm10 vpxord