Communications in Cryptology IACR CiC


Dates are inconsistent
3 results sorted by publication date
Vincent Hwang
Published 2024-07-08 PDFPDF

We survey various mathematical tools used in software works multiplying polynomials in \[ \frac{\mathbb{Z}_q[x]}{\left\langle {x^n - \alpha x - \beta} \right\rangle}. \] In particular, we survey implementation works targeting polynomial multiplications in lattice-based cryptosystems Dilithium, Kyber, NTRU, NTRU Prime, and Saber with instruction set architectures/extensions Armv7-M, Armv7E-M, Armv8-A, and AVX2.

There are three emphases in this paper: (i) modular arithmetic, (ii) homomorphisms, and (iii) vectorization. For modular arithmetic, we survey Montgomery, Barrett, and Plantard multiplications. For homomorphisms, we survey (a) various homomorphisms such as Cooley–Tukey FFT, Good–Thomas FFT, Bruun's FFT, Rader's FFT, Karatsuba, and Toom–Cook; (b) various algebraic techniques for adjoining nice properties to the coefficient rings, including localization, Schönhage's FFT, Nussbaumer's FFT, and coefficient ring switching; and (c) various algebraic techniques related to the polynomial moduli, including twisting, composed multiplication, evaluation at $\infty$, truncation, incomplete transformation, striding, and Toeplitz matrix-vector product. For vectorization, we survey the relations between homomorphisms and vector arithmetic.

We then go through several case studies: We compare the implementations of modular multiplications used in Dilithium and Kyber, explain how the matrix-to-vector structure was exploited in Saber, and review the design choices of transformations for NTRU and NTRU Prime with vectorization. Finally, we outline several interesting implementation projects.

Décio Luiz Gazzoni Filho, Tomás S. R. Silva, Julio López
Published 2024-07-08 PDFPDF

We present a solution to the open problem of designing a linear-time, unbiased and timing attack-resistant shuffling algorithm for fixed-weight sampling. Although it can be implemented without timing leakages of secret data in any architecture, we illustrate with ARMv7-M and ARMv8-A implementations; for the latter, we take advantage of architectural features such as NEON and conditional instructions, which are representative of features available on architectures targeting similar systems, such as Intel. Our proposed algorithm improves asymptotically upon the current approach based on constant-time sorting networks ($O(n)$ versus $O(n \log^2 n)$), and an implementation of the new algorithm applied to NTRU is also faster in practice, by a factor of up to $6.91\ (591\%)$ on ARMv8-A cores and $12.89\ (1189\%)$ on the Cortex-M4; it also requires fewer uniform random bits. This translates into performance improvements for NTRU encapsulation, compared to state-of-the-art implementations, of up to 50% on ARMv8-A cores and 72% on the Cortex-M4, and small improvements to key generation (up to 2.7% on ARMv8-A cores and 6.1% on the Cortex-M4), with negligible impact on code size and a slight improvement in RAM usage for the Cortex-M4.

Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López
Published 2024-04-09 PDFPDF

Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by $1.54$–$3.07\times$, $1.08$–$1.33\times$, $1.11$–$1.50\times$ and $1.20$–$1.98\times$, respectively, over the previous state-of-the-art.