Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs

. Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs).


Introduction
In the 1990s, Shor [Sho97] described an efficient quantum algorithm to solve hard problems (integer factorization and discrete logarithms) for classical computers, compromising many existing cryptosystems.Post-quantum cryptography (PQC) seeks to develop new public-key cryptosystems based on hard computational problems resistant to quantum attacks, such as those based on lattices.Research into efficient implementation flourished as multiprecision integer arithmetic gave way to polynomial multiplication modulo "small" (CPU word size) integers for lattice-based schemes, introducing new challenges and performance tradeoffs.
In their Turing lecture [HP19], computer architecture pioneers Hennessy and Patterson claim that "(a)n era without Dennard scaling, along with reduced Moore's Law and Amdahl's Law in full effect means inefficiency limits improvement in performance to only a few percent per year", suggesting that further hardware improvements must come from domain-specific architectures tailored to specific applications.This is seen in the addition of SIMD extensions and special-purpose instructions (such as for symmetric cryptography and binary polynomial multiplication) to CPUs, as well as increasingly popular accelerators such as programmable graphics processing units (GPUs), FPGAs and others.Such developments have opened up new research agendas for efficient implementation of cryptosystems.
The meteoric rise of artificial intelligence, machine learning and deep learning in the 2010s, and their demand for teraflops-level linear algebra performance, led to an introduction of domain-specific architectures such as tensor cores for NVIDIA's GPUs [MCL + 18], IBM Power ISA's Matrix-Multiply Assist (MMA) [MBB + 21], Intel's Advanced Matrix Extensions (AMX) [Int22], and ARMv9-A's Scalable Matrix Extensions (SME) [WMS22].The first three have launched in production hardware in 2017, 2021 and 2023, respectively, whereas the latter, to our knowledge, hasn't shipped as of 2023.Although less publicized, a matrix multiplication accelerator has shipped in hundreds of millions of mobile phones, tablets and personal computers, specifically those made by Apple, starting with the 2019 A13 system-on-chip (SoC) used in the iPhone 11 [Rod20,Section 7.6].It is a coprocessor named AMX (no relationship to Intel AMX), also present in newer generations of Apple SoCs, such as the M-series powering ARM-based Apple personal computers.
Our work investigates repurposing AMX to accelerate the core polynomial multiplication routine of lattice-based post-quantum cryptosystems, by implementing NTRU [HPS98], specifically the proposal [CDH + 20] that advanced to round 3 of NIST's PQC standardization process [Nat17], for Apple's M1 and M3 SoCs.Despite NIST's preference for Kyber [SAB + 20], NTRU is standardized by IEEE Std 1363.1 [Ins09] and ASC X9.98 [Ame17].It is also representative of a wider class of lattice-based schemes based on NTT-unfriendly rings such as SABER [DKR + 20] and FrodoKEM [BCD + 16, NAB + 20]; the latter is recommended by the German BSI [Bun21] and under standardization by ISO [Int23c].In case of cryptanalytic attacks on Kyber, NTRU may be the target of renewed interest.
A potential roadblock is that AMX is undocumented by Apple; programmers are intended to access it via calls to Apple's Accelerate framework [App23], which provides accelerated versions of the industry-standard BLAS library [LHKK79] and the proprietary BNNS neural network API.Fortunately, extensive reverse-engineering efforts [Joh22b,Han23,Caw24] document AMX's instruction set and performance characteristics, allowing us to develop a full NTRU implementation with AMX-accelerated polynomial multiplication routines, as presented in this paper.The authors were not directly involved in these reverseengineering efforts, working only with the information made public in these references; in the interest of full disclosure, the first author has corresponded with one of the reverseengineering researchers (Peter Cawley), in particular providing him access to M3 hardware to investigate functional changes in its AMX unit over previous generation SoCs.
Related works.Google's Tensor Processing Unit is described in [JYP + 17].It achieves a peak throughput of 92 TOPS/s and 15-30× faster neural network inference than CPUs and GPUs of that era.[MBB + 21] reports a 2× speedup by using MMA over regular vector code for double-precision matrix multiplication in the POWER10 CPU.
Some works demonstrate speedups from NVIDIA tensor cores, including for PQC schemes [MCL + 18, WZF + 22, LSH + 22, LSZH22].However, GPU architectures, being highthroughput but high-latency, need aggressive batching (thousands to tens of thousands of parallel KEM/signature operations) to realize their performance potential, severely limiting applicable use cases.We view direct comparisons of batch and online implementations (for CPUs and, as we shall show, AMX) as unfair, and refrain from doing so in this paper.
We found no references to Apple AMX in the scientific literature, and Apple does not officially document AMX functionality or performance characteristics.We resort to reverse engineering efforts of [Joh22b,Han23,Caw24], which we summarize in Section 3.
Lastly, we consider ARMv8-A implementations of NTRU.Karatsuba and Toom-Cook algorithms are implemented in [NG21], achieving speedups on the M1 of up to 6.68× and 8.49× for encapsulation and decapsulation, respectively, versus the reference implementation.[CCHY23] set new speed records for the HPS2048677 and HRSS701 parameter sets using the TMVP algorithm.They also optimized polynomial inversion and constant-time sorting, speeding up key generation, encapsulation and decapsulation by 7.67×, 2.48× and 1.77×, respectively, compared to [NG21] on the Cortex-A72 core.
Our contributions.We present the first (to our knowledge) implementation targeting AMX in the scientific literature, and also the first cryptographic implementation on a CPU-coupled matrix multiplication accelerator; previous works targeted GPU tensor cores.
Unlike GPUs, CPU-coupled matrix multiplication accelerators (coprocessors or instruction set extensions) can achieve high throughput without sacrificing latency.Indeed, we set new NTRU speed records on Apple M1/M3 SoCs, outperforming previous state-of-the-art NEON implementations in polynomial multiplication, key generation, encapsulation and decapsulation by 1.54-3.07×,1.08-1.33×,1.11-1.50×and 1.20-1.98×,respectively.Notably, we achieve such speedups under conditions previously seen in CPU implementations only: latency (cycle count) for a single execution of the scheme's operations, with no batching at all.Usually, accelerator (e.g.GPU) implementations require enormous batching levels to achieve claimed speedups, with high latencies for a single execution.
SIMD extensions required a paradigm shift from a scalar to a vector (1D) view of computation; matrix multiplication accelerators further shift towards a matrix (2D) view.AMX's peak throughput is only achieved for its main matrix (outer product1 ) operation; vector operation throughput is similar to NEON.Also, its instruction set is rather limited, seemingly designed to implement only a restricted set of algorithms.Our main contribution is repurposing AMX for a new application (polynomial multiplication) while realizing its performance potential; this required a deep rethinking of polynomial multiplier architectures to expose two-dimensional parallelism, maximizing the ratio of matrix to vector operations to reap as much of its throughput as possible.We conjecture that the techniques we propose for AMX may be applicable to many similar accelerators soon to reach the market.
Our implementation is available at https://github.com/dgazzoni/NTRU-AMXunder various licenses (such as Apache and MIT) for existing code reused in our implementation, and Creative Commons CC0 ("No Rights Reserved") for the authors' own code.It also includes an optimized implementation of NIST's randombytes() function based on AES-CTR-DRBG, using ARMv8 Cryptographic Extensions' AES instructions, which may be of independent interest for other NTRU implementations and PQC schemes.

Preliminaries
Notation.Let a(x) = a 0 + a 1 x + . . .a n−1 x n−1 .A "slice" of the i-th through j-th coefficients (j ≥ i) of a(x) is written as a i:j (x) = a i x i + a i+1 x i+1 + . . .+ a j x j .Boldface variable names refer to an associated row vector representation: a i:j = [a i , a i+1 , . . ., a j ].Let x i:j = [x i , x i+1 , . . ., x j ]; then a i:j (x) = a i:j x T i:j .Finally, i : k : j represents ranges with a non-unit step of k; e.g. a i:k:j = [a i , a i+k , a i+2k , . . ., a j ].We also combine slice notation with C's array indexing notation, i.e.X[i : j] selects elements of index i, . . ., j of X.
The NTRU cryptosystem.Based on the pioneering work of [HPS98], NTRU [CDH + 20] is a key encapsulation mechanism (KEM) merging the previous NTRUEncrypt [ZCH + 19] and NTRU-HRSS-KEM [SHRS17] proposals, providing the suggested NTRU-HPS and NTRU-HRSS parameter sets, respectively.Based on structured lattices, it uses polynomial arithmetic modulo x n − 1 for prime n with coefficients reduced either modulo 3 or a power of two, q.The original cryptosystem [HPS98] is a partially correct, probabilistic public-key encryption (PKE) scheme.The NIST submission uses the techniques of [HPS96] to obtain a perfectly correct, deterministic PKE, and constructs a KEM using a generic transformation from this PKE [HHK17] which is IND-CCA2 secure in the random oracle model (ROM), and in the quantum-accessible ROM under a non-standard assumption.

Let Φ
we define the canonical representative of an element in S 3 , denoted as S 3 , as the polynomial of degree at most n − 2 with coefficients in {−1, 0, 1}.Algorithms 2.1, 2.2 and 2.3 define the NTRU PKE, where most of the execution time is spent; for a full description of the scheme, including the Sample f,g routine, see [CDH + 20].Lift(m), for a polynomial m(x), is defined as S 3 (m) for HPS and Φ 1 • S 3 (m/Φ 1 ) for HRSS parameter sets.Algorithms 2.2 and 2.3 require 1 and 3 polynomial multiplications (respectively) in the HPS parameter sets, and Lift adds an extra multiplication for HRSS.At first glance, Algorithm 2.1 uses 5 polynomial multiplications, but polynomial inversion modulo q in line 3 is usually realized as modulo-2 inversion followed by 4 Newton iterations to lift the result to Z q /(Φ 1 Φ n ), which require an extra 8 multiplications, for a total of 13.The submission includes four parameter sets: HPS2048509, HPS2048677, HPS4096821, and HRSS701.The first three use fixed-weight sample spaces proposed by Hoffstein, Pipher, and Silverman [HPS96,HPS98]; the last uses arbitrary-weight sample spaces proposed by Hülsing, Rijneveld, Schanck, and Schwabe [HRSS17].Table 1 lists parameters and sizes.

Apple's AMX coprocessor
In this section, we compile and summarize information about AMX from reverse engineering efforts and analyses of Apple patents by [Joh22b,Han23,Caw24].Such information is, by its nature, conjectural, although backed by extensive functional and performance experiments.This section seeks to make this paper as self-contained as possible, and is also the first (to our knowledge) presentation of AMX in the scientific literature.
We do not lay claim to any of these discoveries; our main contribution lies in cataloguing techniques to repurpose AMX to efficiently perform polynomial multiplication modulo "small" integers, and its application to NTRU -although other lattice-based PQC cryptosystems should also benefit, a task left as future work in Section 6.
AMX is a coprocessor from Apple to accelerate matrix multiplication operations, first introduced in the Apple A13 SoC powering the iPhone 11 [Rod20].Functionally, the M1 and M3 AMX units are almost identical; there are two added features in the M3's which we did not find useful for our implementation.Performance-wise, our results of Section 5 indicate that M3 delivers improvements over M1.

Programmer's model
AMX exposes 80 64-byte registers, split into eight X and eight Y registers (X 0 to X 7 and Y 0 to Y 7 ), and 64 Z registers better viewed as rows of a matrix, denoted as Z[0], . . ., Z[63]. Figure 1, redrawn from an Apple patent [SBG + 16, Figure 2], is a visualization of intended register file organization.Some instructions can concatenate together either the X or Y registers for bytewise addressing as 512-byte circular buffers, for which we use the slice notation defined in Section 2. Figure 2 shows different X/Y register addressing notations.X and Y are inputs and Z are outputs for most instructions, with a few exceptions.Data cannot be moved between CPU and AMX registers directly; it must go through memory.
Different data types can be operated on (8-, 16-or 32-bit integers, or 16-, 32-or 64-bit floating-point values).Input/output lane widths may be identical or mixed in specific combinations [Caw24], but we only use 16-bit integer inputs and outputs.The inputs to outer product operations (X i and Y i ) are 32-element vectors of 16 bits each, resulting in a 32 × 32 output matrix of 16-bit elements; this is smaller than Z's available storage space (16,384 out of an available 32,768 bits).As such, each row of the output matrix is mapped to either the even or odd rows of Z, which are fully populated with output coefficients, as the size of a row of Z (512 bits) matches the size of a row of the output matrix.
CPUs in a cluster can share an AMX unit through per-CPU replication of architectural state [Joh22a,Han23].Instructions are tagged with their source CPU to identify their copy of the state; thus, multiple CPUs may interleave instruction execution [Han23].Special instructions (set/clr) control AMX's enabled/disabled status for each thread [Caw24].

Programming interface
AMX instructions are inserted into the CPU's instruction stream, and once no longer speculative, are dispatched to AMX via the CPU's store units [Han23].They are encoded as A64 instructions within a reserved opcode space [Caw24]; given A64's fixed 32-bit instruction encoding, only 10 bits remain, 5 of which encode the instruction's opcode.The remaining 5 bits either encode the index of a scalar 64-bit register through which parameters are passed, or an immediate.This layout is shown in Figure 3. [Caw24] provides preprocessor macros to emit AMX instructions, with an intrinsics-like syntax.

Instruction set
We now list all known AMX instructions, reviewing those used by our polynomial multiplication implementation, as well as parameters of interest; for an exhaustive specification, see [Caw24].We propose a taxonomy of instructions in Table 2, categorized by functionality.We first consider loads and stores.The least-significant 56 bits of their 64-bit argument are treated as a pointer to the source/target memory address; remaining bits encode parameters as per Table 3.While many other instructions can address X/Y registers at arbitrary byte positions, loads and stores require aligning to the start of a register (64-byte boundary); register indices are encoded in bits 58-56 of the argument (or 61-56 for Z rows).Memory accesses can be 64 bytes (bit 62 = 0) or 128 bytes (bit 62 = 1) wide.We illustrate the macros of [Caw24] with an example: loading from an array with base address v to register X 4 .We show a version using just their macros, a second using helper macros defined by us, and a simplified notation used in the algorithms of Section 4: We briefly review functionalities of interest for other instructions.Due to the default role of X and Y as input and Z as output registers, data may need to be moved from Z to X or Y, for which we use the extrh instruction.Its basic operation is copying a row of Z to a byte-addressable 64-byte slice of either X or Y.For a visualization of extrh in use, refer to Figure 4. Our algorithmic notation for an example operation of extracting Z[63] to Y 0 is: The main AMX arithmetic instruction we use is mac16, which supports vector and matrix modes.We start by reviewing the vector-mode version.Let x, y, z be 32-element vectors of 16-bit integers.Vector-mode mac16 computes pointwise multiply-accumulates (MACs): z ← z + x • y, where + is vector addition and • is the Hadamard (pointwise) product.A row of Z is used for z, and x and y are 32-element slices of X and Y.Each of x, y, z can be optionally "skipped" to realize different operations: vector addition (z ← z+x or z ← z + y), Hadamard product without accumulation (z ← x • y), or copying x or y to z, i.e. the extrh instruction in reverse.We now define an algorithmic notation, assuming for the sake of example the computation of vector addition z ← z + x (thus, the y input is skipped), in which we choose z to be the first row (i.e. that of index 0) of Z and x to be elements 64 to 95 of X (i.e.bytes 128 to 191, or equivalently, the full X 2 register): We now turn to matrix-mode mac16.Let Z be a 32 × 32 matrix, and x and y be 32-element (row) vectors.Matrix-mode mac16 realizes an outer product operation with accumulation, as in Figure 1: Z ← Z + y T x.Concretely, the operands x, y, Z are mapped to slices of X and Y, and even or odd rows of Z, respectively.As with vector-mode mac16, operands can be skipped, although skipping Z appears to be the only sensible possibility, which disables accumulation.It is also possible to update only a subset of output rows and columns, effectively realizing outer products of smaller dimensions.Consider an example operation of computing the outer product Y T 0 X 0 (i.e.elements 0 to 31 of each of these registers), which are accumulated with, and saved to, the even rows of Z, while updating only columns 0 to 15 of the output matrix.Our algorithmic notation for this operation is: When the "columns" parameter is omitted, it is understood that all columns of the output matrix are updated.Our notation does not distinguish between vector-and matrixmode mac16 directly, but rather, the mode is implied from the operands of the instruction.A caveat: although AMX's Y registers are best understood as column vectors, our notation assumes vectors (say v) to be row vectors, while column vectors are denoted v T .We extend this convention to AMX's X and Y registers, writing outer products as Y T i X j .Finally, vecint is a vector (pointwise) instruction which generalizes vector-mode mac16; in particular, a single instruction can compute z ← z+x+y, attaining twice the throughput as vector-mode mac16, a fact which we've used to speed up some of our algorithms.As an example, this is how we denote an example operation of 16-bit integer addition of elements 16 to 47 of the X register (i.e., bytes 32 to 95 of X, incorporating elements of both X 0 and X 1 ) with elements 0 to 31 of the Y register (i.e.Y 0 ), accumulating to Z[1]:

Performance considerations
The M1 has two AMX coprocessors, one per core cluster (performance and efficiency), implementing a second generation of AMX instructions [Caw24].It runs at the same clock speed as the cores and accesses memory via the L2 cache [Joh22b].It supports out-of-order execution independently of the CPU, with buffers for 28 to 32 operations [Joh22b].
Outer product latency and throughput is discussed in [Joh22b] for floating-point operations.The performance cluster's AMX unit has an array of pipelined MAC units with 4-cycle latency, capable of executing one 32-or 64-bit outer product operation per cycle, or a 16-bit operation every two cycles.To achieve maximum performance, inter-instruction data dependencies must be avoided; either by using multiple accumulators, i.e. different subsets of Z as a destination (such as even and odd rows), or by issuing instructions from different CPUs, due to independent per-CPU AMX register files (see Section 3.1).
AMX microbenchmarking code is supplied in [Caw24].We have run this code on an M1based laptop, and report selected throughput figures, restricted to single-threaded results, presumably running on the AMX performance-cluster unit.Multiply-accumulates (MACs) count as 2 operations, and we report best-case scenarios, often using multiple accumulators; performance may degrade, significantly in certain cases, if fewer accumulators are used.
Outer products achieve up to 3.05 TFLOPS/s throughput with 16-bit floating-point or 8-bit integer arithmetic; throughput drops to 1.53 TOPS/s for 16-bit integer arithmetic.In vector mode, up to 381 GFLOPS/s is achievable with floating-point or 8-bit integer MACs, dropping to 191 GOPS/s for 16-bit integer arithmetic.Replacing MACs with additions or multiplications alone further halves throughput (due to performing half as many operations per instruction).As a comparison, peak NEON throughput on M1 performance cores for 16-bit integer data is 204.8GOPS/s for MACs and 102.4 GOPS/s for additions or multiplications alone, based on the microarchitectural investigations of [Joh22a].
Thus, AMX's peak throughput is ≈ 8× higher for matrix (outer product) operations versus vector operations; the latter's peak performance approaches NEON's.That is, outer products are much cheaper than vector operations.Throughout the paper, we consistently strive to reduce vector operation count, which is the bottleneck of our computations.
While our implementation takes the usual constant-time precautions (foregoing branches and memory accesses conditional/indexed on secret data), it implicitly relies on constanttime execution of AMX instructions.Recently, ARM and Intel have started guaranteeing data-independent timing for some instructions [ARM,Int23b,Int23a].Given AMX's lack of official documentation, it is impossible to obtain such assurances, although experiments in Section 5.1 suggest AMX displays data-independent timing.The previous lack of constant-time guarantees for CPUs hasn't prevented implementers from striving for, and achieving in practice, implementations with apparent constant-time characteristics, and may have played a key part in incentivizing manufacturers to provide such guarantees.Our paper is a first step in that direction for CPU-coupled matrix-multiplication accelerators.

Implementing polynomial multiplication on AMX
We now describe our AMX implementation of schoolbook multiplication in Z 2 16 /(x n − 1).

Challenges and strategies for efficient implementation
We start by summarizing the strategies that were evaluated for efficient implementation of polynomial multiplication on AMX, why we discarded some of them, and how we evolved and optimized our implementation to eventually achieve new speed records on Apple SoCs.
Many techniques for efficient polynomial multiplication in ARMv8 are based on vector (1D) operations, a natural match for the NEON SIMD instruction set.However, as stated in Section 3.4, vector operations achieve a fraction of AMX's peak throughput, comparable to or even slower than the CPU's NEON units.There is little hope for AMX implementations to outperform NEON if restricted to vector operations; matrix (2D) operations must be used to approach AMX's peak theoretical performance.
The Toeplitz Matrix-Vector Product (TMVP) algorithm [CCHY23] is the current stateof-the-art in NEON performance.By casting polynomial multiplication in the language of linear algebra, it may appear perfectly suited to matrix accelerators such as AMX.However, its main operation is a matrix-vector product.AMX's outer product instruction computes matrix-matrix products very efficiently; however, the cost of a 32 × 32 by 32 × 1 matrix-vector product (the natural AMX block size for 16-bit integers) using this instruction is essentially identical to that of a 32×32 by 32×32 matrix-matrix product.We thoroughly investigated other approaches, but found no strategies that outperform NEON for matrix-vector products.Further evidence comes from our benchmarks of (floating-point) matrix-vector products using Apple's BLAS library, whose performance does not exceed the peak theoretical throughput of NEON.As such, we also disregard the use of TMVP.
Our key observation leading to efficient polynomial multiplication in AMX was the similarity between outer products and operand-scanning multiplication, resulting in the algorithm of Section 4.2 for multiplying 32-coefficient slices of polynomials, using the 32 × 32 natural AMX block size.However, for each outer product operation, dozens of vector operations are used to shift partial products and perform sum-reduction to arrive at the polynomial product; as the throughput of matrix AMX operations is ≈ 8× higher than vector operations (see Section 3.4), it is imperative to increase the balance of matrix to vector operations to approach the peak AMX theoretical throughput.Extrapolating the performance of a basic block (≥ 90 clock cycles) to polynomial degrees used in NTRU shows that this technique alone is not competitive with state-of-the-art NEON implementations.
Next, we shifted to a size-32 blocked view of polynomial multiplication.We noticed that a blockwise product scanning approach allows computing and accumulating all partial products in each block column, deferring all vector operations for shifting and sum-reduction of that block column to the end.This reduces vector operation count asymptotically from O(n 2 ) to O(n), improving matrix/vector instruction balance and thus AMX throughput.This strategy and the resulting algorithms are described in Section 4.3.
The product of degree-n polynomials, as in the previous paragraph, results in a degree-2n polynomial; however, in NTRU and other schemes, a reduction is applied to obtain a degree-n output.In Section 4.4, we show how to perform reduction concurrently, in an integrated fashion, with blockwise product scanning; this reduces, by approximately half, the number of vector operations for shifts and sum-reduction compared to the strategy of the previous paragraph, again improving matrix/vector instruction balance.The importance of this optimization is highlighted by the results of Table 4, comparing its performance to polynomial multiplication without reduction (Section 4.3) -for NTRU, of course, polynomial reduction must be implemented, further widening the performance gap.While the previous paragraph's strategy achieves competitive and even record-setting performance, we noted a claim in [Han23] of performance degradation from concurrent accesses by AMX and the CPU to the same memory page.In Section 4.5, we investigate allocation of input arrays in separate memory pages via the POSIX mmap() function.
Performance results are summarized in Table 5, for both AMX and NEON, from geometric means of measurements presented later in Table 6.Briefly, per-page memory allocation benefits AMX considerably (M1 more so than M3), while NEON displays mixed results regardless of SoC.This also suggests that concurrent processing of the same dataset with NEON and AMX may, counterintuitively, have detrimental performance effects.All implementations of Section 4 are based on the O(n 2 ) schoolbook method.In Section 5.1, we investigate the performance of subquadratic algorithms, showing experimental evidence that the Karatsuba algorithm with a single recursion level does not outperform schoolbook in AMX until well outside the range of polynomial degrees of cryptographic interest.We refer to that section for hypotheses to explain this somewhat counterintuitive result, and evidence that other strategies based on subquadratic algorithms are also not expected to outperform schoolbook in AMX, for the polynomial degrees used in NTRU.

Basic block: multiplication of 32-coefficient slices
The basic block of our implementation multiplies slices of 32 coefficients from each input polynomial.We define a specific notation for polynomial slices starting at indices that are a multiple of 32: a 32k:32k+31 (x) = a k (x) (note boldface k).We also define a notation for the product of 32-coefficient polynomial slices: c k,l (x) = a k (x)b l (x).To clarify, we have: where c m = i+j=m a i b j .As before, c k,l is the associated row vector representation.Throughout the rest of Section 4, we resort to examples using a hypothetical 1/8-size AMX unit for space reasons.Boldface indices will then refer to 4-coefficient slices as appropriate for these examples, rather than 32-coefficient slices for full AMX; thus, a k would refer to a 4k:4k+3 .It will be clear from the context whether we refer to 4-or 32-coefficient slices.
We first note that the outer product b T l a k generates the required partial products; shifting each row produces the usual multiplication parallelogram.We then perform sum-reduction of partial products in each column (an operation which we refer throughout the rest of Section 4 as flattening to avoid confusion with polynomial reduction) to obtain c k,l .This can be visualized by considering an analogous operation in a hypothetical 1/8-size AMX unit, operating on 4-coefficient slices starting at the constant coefficient: We then flatten the parallelogram by sum-reduction of columns and multiplication by the corresponding power of x to obtain a 3 b 3 On AMX, outer products are easily realized by loading a k and b l to slices of the X or Y registers, and executing the matrix-mode mac16 instruction.By default, results are accumulated with the current values of Z, although there is a flag to skip current contents of Z, so as to realize multiplication without accumulation.The result occupies half of the Z register file, either the odd or even rows of Z; we assume the latter for the sake of example.
To realize shifts and flattenings, we initialize the two ends of a 3-register-wide slice of X (say X 0 and X 2 ) as well as Z[1], the first odd row of Z, with zeros.For each even row of Z except the first, we use the extrh instruction to extract it to the remaining X register of the slice (X 1 in this example).Recall that AMX allows addressing any 64-byte slice of the full 512-byte X register; we use this feature to select slices containing the appropriate number of zero 16-bit values to realize the desired shifting.These slices are used as input to a pair of vecint or vector-mode mac16 instructions, performing additions only, in order to accumulate the least significant coefficients to Z[0], and the most significant coefficients to Z[1]. Figure 4 illustrates this operation, again on a hypothetical 1/8-size AMX unit.Performance is further improved by working on pairs of even rows of Z, extracting one to X and the other to Y.As discussed in Section 3.3, a single vecint can accumulate slices of both X and Y registers to the desired Z row.Microbenchmarks indicate that vecint, on its own, achieves twice the arithmetic throughput of addition with X or Y alone; however, this improvement is not fully realized as there appears to be contention with resources used to execute extrh, and moreover, microbenchmarks suggest there is a penalty for unaligned accesses to the X and Y registers, as required to shift the input operands.Algorithm 4.1 formalizes this procedure for polynomial multiplication of 32-coefficient slices.The reader may wish to review notations in Sections 2 and 3, in particular Figure 2. Notes: since AMX stores 64 bytes (in this case, 32 coefficients of 16 bits each) at a time, output array must be allocated with enough space for 64 coefficients 1 ▷ load array of 32 zeros to each indicated register ▷ implement using e.g.mac16 with Y and Z skipped 7: for i ∈ {2, 4, 6, . . ., 30} do 8:

Polynomial multiplication by product-scanning of basic blocks
We turn to the issue of realizing a full multiplication of n-coefficient inputs.Since NTRU requires n to be prime, and the basic block of Section 4.2 works on 32-coefficient slices, we work around this issue by zero-padding input polynomials to n ′ = 32⌈n/32⌉ coefficients.
In principle, we could apply Algorithm 4.1 to each slice and accumulate results using vector instructions to realize a full-size multiplication.However, we propose a different architecture, based on blockwise product scanning with lazy sum-reduction, to increase the balance of faster matrix (outer product) to slower vector (extraction/addition) operations.We again illustrate with a small example for a hypothetical 1/8-size AMX unit, multiplying 8-coefficient input polynomials split into 4-coefficient slices as in Section 4. 1 a 0 and b T 0 a 1 are aligned columnwise and can be accumulated (by matrix addition) prior to shifting and flattening; that is, shifting and flattening are performed lazily.In general, outer products of the form b T j a i and b T l a k are aligned, and can be accumulated, if i + j = k + l.Note that matrix addition does not need to be performed explicitly in AMX, as the outer product instructions accumulate with the existing contents of the Z registers at no extra cost.The accumulation procedure is shown as Algorithm 4.2, a subroutine of our polynomial multiplication algorithm given later.Algorithm 4.2 AccumulateOuterProducts(a, b, j, r): accumulate outer products b T l a k such that k + l = j to Z[r : 2 : 62 + r], where r = 0 or 1.
We now consider a full-size AMX unit and realistic polynomial sizes, say n ′ coefficients.The naïve approach suggested at the beginning of this section would call for (n ′ /32) 2 = O(n 2 ) applications of Algorithm 4.1, that is O(n 2 ) outer products and shifts/flattenings, plus extra operations to combine the results to obtain the desired polynomial multiplication Moving on to practical implementation issues, the Z register, where outer products accumulate, can only store two 32 × 32 matrices, one in the even rows and another in the odd rows.Careful sequencing of shift and flattening operations is needed to avoid spills and reloads of Z's contents.Consider the flattened parallelogram in Figure 5(d), which is split into three sub-parallelograms, each corresponding to one of three matrices of Figure 5(c), formed from accumulating outer products; we denote them, from right to left, as M 0 , M 1 and M 2 , respectively.Working through the final flattening step (i.e.sum-reduction of columns, not shown in the figure), from the rightmost column towards the leftmost one, we see that the first columns (0 to 3) contain only elements from M 0 .As we move left, columns 4 to 6 contain elements from M 0 and M 1 ; column 7 from M 1 only; columns 8 to 10 from M 1 and M 2 ; and columns 11 to 14 from M 2 only.A general pattern emerges: each column contains either elements from M j only, or from M j and M j+1 .
This suggests a structure for our polynomial multiplication algorithm, described in terms of full-size (n ′ coefficients) input polynomials.Formalizing the concept from the previous paragraph, we define M j = k+l=j b T l a k ; the computation of each M j is performed by Algorithm 4.2.We first compute M 0 and M 1 , storing them in the even and odd rows of Z, respectively, as needed to perform shifts and flattenings to obtain the first 64 coefficients c 0 , . . ., c 63 of c(x) = a(x)b(x), a procedure we formalize as Algorithm 4.3.We now compute M 2 overwriting the even rows of Z, and use both M 1 (still available in the odd rows of Z) and M 2 to obtain the next 32 coefficients of c(x), i.e. c 64 , . . ., c 95 , by the procedure of Algorithm 4.4.The pattern continues for j = 3, . . ., 2(n/32) − 3 with M j overwriting even or odd rows of Z, according to whether j is even or odd, respectively, and using M j−1 and M j to compute c 32j , . . ., c 32j+31 using Algorithm 4.4.Finally, the case j = 2(n ′ /32) − 2 is handled by a third procedure, Algorithm 4.5, which computes the final batch of coefficients of c(x) from M j−1 and M j .We formalize the complete procedure as Algorithm 4.6.

Integrated reduction modulo x n − 1
The algorithms of Section 4.3, culminating in Algorithm 4.6, multiply polynomials of n × n coefficients and return the full result with 2n − 1 coefficients.In NTRU, the result must be reduced modulo x n − 1, a special form which allows efficient implementation as a postprocessing step using e.g.NEON instructions, but also allowing reduction to be merged with multiplication in an integrated procedure, reducing the number of shifting/flattening operations by ≈ 50% compared to Algorithm 4.6, improving performance considerably.
To exemplify the integrated procedure, we again resort to a hypothetical 1/8-size AMX unit, now performing polynomial multiplication modulo x 10 − 1, which illustrates the main issues faced with larger polynomials on a full-size AMX unit.Although 10 coefficients are enough to represent polynomial inputs and outputs for this case, note that each block in this hypothetical 1/8-size AMX unit performs a 4 × 4 outer-product operation.Thus, we work with the next multiple of 4 coefficients, which is 12.We refer to Figure 6 as we work through the procedure, and assume input polynomials have a i = b j = 0 for i, j ≥ 10.
The results of all required outer product calculations are shown explicitly in Figure 6(a).The actual procedure accumulates 4 × 4 outer product results of a given color (white, light gray or gray) into a single per-color 4 × 4 matrix, as in Figure 5(c); for didactic reasons, we omit this step in the course of the explanation.Zero partial products are struck out in the figure; for some of these, this is a natural result of having a i or b j as input for i, j ≥ 10, and for the others, we explicitly disable some columns during outer product computations.
Viewing Figure 6(a) as a block matrix of 4 × 4 submatrices, note that submatrices in and above the secondary (block) diagonal are constructed identically to the example of Figure 5, as formalized in Algorithm 4.2.Notice the pattern followed by partial products a i b j in each row: those to its left and right are of the form a i+1 b j and a i−1 b j , respectively.The rightmost columns of each submatrix in the secondary (block) diagonal have i = 0, and the pattern continues to their right, with i − 1 taken modulo n (10 in the example).For each partial product a i b j , with k its row index modulo 4 and l its column index, as indicated to the left and below the matrix in Figure 6 Algorithm 4.5 FlattenLastTwoBlocks(c, j): sum-reduction of columns from the two most significant multiplication sub-parallelograms.
We now present Algorithm 4.7, a counterpart of Algorithm 4.2 for the integrated polynomial multiplication and reduction procedure.It follows Algorithm 4.2 in accumulating outer products with matching columns (i.e. the submatrices of the same color in Figure 6(a)), a step which we recall was omitted from the example for ease of exposition.
We transform Figure 6(a) into Figure 6(b) by removing struck-out partial products and shifting rows of Figure 6(a) by the row index modulo 4; indices i, j of partial products a i b j in each column add exactly to the column index modulo 10.As we are working modulo x 10 − 1, the final result must be restricted to indices from 0 to 9; we mark this boundary with a thick vertical line in the figure.Some partial products remain in columns with index k ≥ 10, i.e. to the left of this line.We move them to the column with index k ′ = k mod 10 in the same row to arrive at Figure 6 We leverage the algorithms of this section and of Section 4.3 in a procedure for integrated polynomial multiplication and reduction, formalized as Algorithm 4.9, which we use as the main polynomial multiplication routine in NTRU, i.e. the poly_Rq_mul() function.

Working around memory access slowdowns
An initial implementation of polynomial multiplication modulo x n − 1, applying all optimizations mentioned in Sections 4.3 and 4.4, outperforms the analogous routines from the

Experimental results
We benchmark our implementation on a 2020 Apple MacBook Air laptop with the M1 SoC (P-core clock speed of 3200 MHz), and a 2023 Apple MacBook Pro laptop with the M3 Max SoC (P-core clock speed of 4064 MHz).All implementations are compiled with Apple clang 15 using the -O3 optimization flag.Following guidance by ARM [ARM], and taking into account the recent GoFetch microarchitectural attack on Apple M-series SoCs [CWS + 24], we measure execution times with the DIT (data-independent timing) bit set.
We compare our implementation to the state-of-the-art ones of [CCHY23], which implements only HPS2048677 and HRSS701 (and holds the current speed record for these sets), and [NG21], which implements all parameter sets.We backport optimized polynomial inversion and constant-time sorting routines from [CCHY23] to [NG21], seeking to highlight differences in polynomial multiplication performance only; we include this modified implementation in our GitHub repository.We also optimize NIST's AES-256 CTR-DRBG randombytes() function with AES instructions from ARMv8-A's Cryptographic Extensions, using software tests to verify that outputs are bit-identical to the NIST version.
All of these optimizations (polynomial inversion, constant-time sorting and random number generation) are also applied to our AMX implementation.We ensure Known Answer Tests for all implementations match those provided in the NTRU reference code.
Recall from Section 4.5 that our implementation uses a specific memory allocation strategy for arrays of polynomial coefficients.To ensure a fair comparison, we create stack and mmap() allocation variants for every implementation and report performance figures for both, computing speedups by comparing the fastest strategy for each implementation.
Performance measurements use the same cycle-counting harness found in [NG21] and [CCHY23].To reduce timing variability, each routine is run for 1,024 times in a loop, measuring the cycle counts of each run individually, and the average value is reported.

Performance measurements
We present NTRU performance results in Table 6.We omit the tc variant of HPS2048677 included in [CCHY23], as the tmvp variant always outperforms it.In general, NEON implementations perform similarly regardless of memory allocation strategy (with a few outliers), but AMX consistently favors mmap() allocation, more so in the M1; recall that speedup calculations pick the best performing memory allocation method for each implementation.The M3 AMX unit is faster than the M1's; in particular, microbenchmarks show that the M3 executes some vector AMX operations at a rate of 2 instructions/cycle, a phenomenon we didn't observe on the M1.The most representative comparisons are with HPS2048677 and HRSS701 TMVP NEON implementations [CCHY23].AMX polynomial multiplication achieves significant speedups of 1.54× (M1)/1.72×(M3) for the former and 1.79× (M1)/1.93×(M3) for the latter.These translate into smaller speedups for KEM operations (1.08× to 1.28× for the M1 and 1.11× to 1.34× for the M3), since the time spent in other routines (such as polynomial inversion for key generation, and constant-time sorting for key generation and encapsulation) remain identical between our implementation and previous ones.Nevertheless, our implementation consistently outperforms all others.For other parameter sets, our gains are even more pronounced, but we expect the gap would shrink if TMVP were applied to them.
Scaling and subquadratic algorithms.In Figure 7(a), we provide polynomial multiplication scaling results as a function of polynomial degree, either with or without reduction modulo x n − 1.We also perform a least-squares fit to an equation of the form ax b , and note that in this range, the algorithm displays subquadratic scaling; this can be attributed to O(n 2 ) scaling of outer product operations (cheaper) versus O(n) scaling of shifts and flattenings (more expensive).We note that the version with reduction outperforms the one without, as it requires only about half as many slower shift/flattening operations.
We also implemented Karatsuba on AMX, for a single recursion level with our routine of Section 4.3 as a basecase, and compare it to the routine of Section 4.4 in Figure 7(b).Karatsuba only outperforms schoolbook for n ≥ 4480, far outside the range of cryptographic interest.We give numerical results for all NTRU parameter sets in Table 7, and also compute lower bounds (accounting for the cost of pointwise multiplications only) for single-level Karatsuba and subquadratic algorithms across all possible multi-level recursion strategies, using either schoolbook, Karatsuba, Toom-3 or Toom-4 at each level (the bounds are computed by a Jupyter notebook included in our GitHub repository).We conclude that schoolbook outperforms any subquadratic strategy for all NTRU parameter sets.Constant-time experiments.We empirically tested whether AMX instructions execute with data-independent timing, by benchmarking polynomial multiplication routines with either zero or random polynomials as inputs; the former were chosen as the most likely case for hypothetical data-dependent hardware optimizations.Experiments were run with the same combinations of parameter set, memory allocation and implementation as in Table 6, including NEON implementations (presumed constant-time) for comparison purposes.Our experimental design uses paired measurements of cycle counts for multiplying either zero polynomials or randomly generated polynomials (generated once at the beginning, Student's t-test, evaluate the null hypothesis that the means of both distributions is identical, i.e. the difference of cycle counts between paired observations is zero.Given the large sample size of 1,000,000 iterations, even very small deviations from zero, attributable to measurement noise, may lead to a spurious rejection of the null hypothesis.Additionally, even if the null hypothesis is not rejected, one cannot conclude that there is no effect; in other words, "absence of evidence is not evidence of absence."[AB95] Equivalence tests [Lak17] are based on the concept of an equivalence range, such that any effects within that range are considered negligible or of no practical interest.The null hypothesis is that there is an effect outside of the equivalence range, and thus, a rejection of the null hypothesis can be properly interpreted as evidence for equivalence.We have chosen to use such a test, in particular the TOST (Two One-Sided Tests) method using Student's t-test as the underlying hypothesis test, with an equivalence range of ±1 clock cycle, which is the meaningful effect threshold in a clocked digital circuit such as an SoC.
A potential objection to the use of Student's t-test is that it is parametric and thus assumes a normal distribution, which is not the case for some of our measurements, as seen in Figure 8.However, it is more accurate to say that the test assumes that the outcome variable (i.e. the mean of the difference between paired measurements) is normally distributed, which follows for large sample sizes from the central limit theorem [LDEC02].
In Table 8, we report confidence intervals and p-values at a significance level α = 10 −6 .In all cases (whether NEON or AMX implementations), the confidence intervals are contained in the equivalence range [−1, 1], and p < α.We conclude that all implementations deviate by < 1 clock cycle between zero and random inputs, and thus are constant-time.We also measure the latency of matrix-mode mac16 for both zero and random inputs.In a loop, we run mac16 and extract the results back from Z to X and Y, creating an interiteration data dependency which serializes all operations, so the code is latency-bound.Running this on the M1 (M3) for 100,000 iterations, we get 2,699,699 (2,299,566) cycles for zero inputs and 2,699,688 (2,299,666) cycles for random inputs, a difference of 11 (100) cycles.A variable-time ALU would show at least 1 cycle difference in latency between each case, so one would expect a minimum difference of 1 cycle/iteration, i.e., 100,000 cycles.
In summary, we see no evidence of data-dependent timing in AMX.

Conclusion
We report on an implementation of polynomial multiplication, applied to the NTRU lattice-based PQC scheme, leveraging a matrix-multiplication coprocessor which, as of 2024, is shipping on hundreds of millions of devices.Our implementation sets new speed records on two representative devices featuring this coprocessor, the Apple M1 and M3 SoCs, outperforming state-of-the-art conventional implementations for the same platforms.CPU-coupled matrix multiplication accelerators will soon flood the market, bringing about huge leaps in processing power, but implementers must adapt to their new architectural paradigms to harness their added performance.Our most surprising conclusion stems from multiplication (2D outer product) being much cheaper in AMX than vector (1D) operations; the latter must be applied per-row/column, increasing instruction count.As subquadratic algorithms trade off multiplication count for extra vector operations, AMX's architectural tradeoffs favor schoolbook for polynomial degrees of cryptographic interest.
Future work.We suggest several avenues for future work based on this paper's results.
The GoFetch attack [CWS + 24] requires expensive countermeasures for the M1 and M2 SoCs, but we note AMX may be unaffected by this attack.Should this be confirmed, AMX implementations may be used as a countermeasure and even bring performance gains.
Other schemes based on NTT-unfriendly rings may see similar speedups from AMX as NTRU.It is unclear whether NTT-based schemes such as Kyber [SAB + 20] and Dilithium [LDK + 22] are a good match for AMX, but it merits investigation.
While we have argued that schoolbook should outperform other subquadratic algorithms for NTRU, we assume that recursion is performed in a "black-box" fashion, calling self-contained functions to compute multiplications of smaller degree.We encourage investigating if an integrated approach, as in Sections 4.3 and 4.4, can improve performance.
Finally, we are witnessing the initial generations of matrix-multiplication accelerators.As they evolve, implementation techniques must equally adapt: faster vector operations could render subquadratic algorithms feasible, and the introduction of instructions to accelerate matrix-vector products could favor the use of TMVP.In this vein, other accelerators such as Intel AMX and ARMv9-A SME undoubtedly differ to some degree in available instructions and performance characteristics from Apple's AMX, and warrant a careful study aiming to exploit their performance potential for post-quantum cryptography.

Figure 2 :
Figure 2: Relationship between bytes, registers and slices in X and Y AMX registers.

Figure 4 :
Figure 4: Full multiplication process of 4-coefficient polynomial slices on a hypothetical 1/8-size AMX unit.The sequence of operations starts from the leftmost state of the Z registers (result of the outer product operation) and follows the arrows, representing different states of Z and the relevant part of X. Thick arrows indicate an extrh operation of the thick-bordered row of Z to X, while dotted arrows indicate an accumulation of each dotted slice of X with a corresponding row (0 or 1) of the target Z registers.

Algorithm 4. 1
PolyMul32×32(c k,l , a k , b j ): modulo-2 16 multiplication of two 32coefficient polynomials using AMX.Input: a k , b l (arrays of 32-coefficient slices of a and b) Output: c k,l (array of 63 coefficients of c k,l (x) = a k (x) × b l (x))

2 .
The parallelogram of Figure 5 consists of 4 outer products: b T 0 a 0 , b T 1 a 0 , b T 1 a 0 and b T 1 a 1 .It is seen in Figure 5(b) that b T

Figure 5 :
Figure 5: Transformations of the multiplication parallelogram for 8-coefficient polynomials for efficient implementation on a hypothetical 1/8-size AMX unit.Dotted lines enclose each outer product of 4-coefficient slices.Starting from the regular parallelogram (a), the i-th row is shifted right by i mod 4 positions (i.e. the reverse of the shifting step of Figure 4) to obtain (b), revealing how two outer products are aligned.In (c), these two outer products are summed prior to left-shifting the i-th row by i positions in the last state shown, (d).The final result is obtained by flattening each column (not shown).

Figure 6 :
Figure 6: Sequence of operations to implement integrated polynomial multiplication and reduction of 12-coefficient polynomials modulo x 10 − 1 on a hypothetical 1/8-size AMX unit.See text for a description of transformations from each subfigure to the next.

Algorithm 4. 8
MergeFirstAndLastBlocks(c, j, r, n): sum-reduction of columns from the two most significant multiplication sub-parallelograms.Input: j (index of current sub-parallelogram) Input: r ∈ {0, 1} (indicates relative order of even and odd rows of Z) Input: n (reduction polynomial is x n − 1) Output: c[0 : 31] (coefficients of x 0 through x 31 of the polynomial multiplication result) Notes: assumes that Z[1 − r : 2 : 63 − r] and Z[r : 2 : 62 + r] contain the output of Algorithm 4.7 for sub-parallelogram indices j − 1 and j, respectively 1

Figure 7 :
Figure 7: Cycle counts for AMX polynomial multiplication routines on the M3 as a function of n: (a) schoolbook algorithm with and without reduction modulo x n − 1; (b) schoolbook and (single recursion level) Karatsuba algorithms with reduction.

Table 2 :
A taxonomy of AMX instructions.Opcodes given in base 10.

Table 3 :
Parameters for AMX load and store instructions.

Table 4 :
Cycle counts for AMX polynomial multiplication without reduction (Section 4.3) and polynomial multiplication with integrated reduction (Section 4.4).

Table 5 :
Geometric mean of speedups (×) for mmap() compared to stack memory allocation.

Table 6 :
Cycle counts for NTRU KEM operations and polynomial multiplication.Speedups computed as a ratio of the best cycle counts for previous implementations and our implementation (in both cases, across different memory allocations types).

Table 8 :
Confidence intervals and p-values for the statistical hypothesis test of equivalence of cycle counts between zero and random inputs.