Efficient isochronous fixed-weight sampling with applications to NTRU

. We present a solution to the open problem of designing an efficient, unbiased and timing attack-resistant shuffling algorithm for NTRU fixed-weight sampling. Although it can be implemented without timing leakages of secret data in any architecture, we illustrate with ARMv7-M and ARMv8-A implementations; for the latter, we take advantage of architectural features such as NEON and conditional instructions, which are representative of features available on architectures targeting similar systems, such as Intel. Our proposed algorithm improves asymptotically upon the current approach, which is based on constant-time sorting networks ( O ( n ) versus O ( n log 2 n )), and an implementation of the new algorithm is also faster in practice, by a factor of up to 6 . 91 (591%) on ARMv8-A cores and 12 . 58 (1158%) on the Cortex-M4; it also requires fewer uniform random bits. This translates into performance improvements for NTRU encapsulation, compared to state-of-the-art implementations, of up to 50% on ARMv8-A cores and 71% on the Cortex-M4, and small improvements to key generation (up to 2.7% on ARMv8-A cores and 6.1% on the Cortex-M4), with negligible impact on code size and a slight improvement in RAM usage for the Cortex-M4.


Introduction
In the late 1990s, the rise of quantum algorithms for database search and factorization [Gro96,Sho97] posed a threat to public-key cryptosystems based on integer factorization and/or discrete logarithms.Even though quantum computers capable of efficiently performing such computations do not exist yet, growing concern within the community led to seeking alternative cryptographic primitives capable of resisting attacks from quantum algorithms.Thus, Post-Quantum Cryptography (PQC) arises as an attempt to counter these attacks by developing new public-key cryptographic algorithms built on problems known to be resistant to quantum attacks, such as lattice-based problems.
One of the oldest lattice-based cryptosystems is NTRU, first presented in the rump session of CRYPTO '96 [HPS96].It remains relevant, as shown by advancing to the third round of the NIST PQC standardization process [CDH + 20, Nat17], and its standardization in other forums [Ins09,Ame17].A performance bottleneck of NTRU is fixed-weight sampling of polynomials, i.e. those with a prescribed number of randomly permuted −1, 0 and 1 coefficients, employed in key generation and encapsulation.Unless carefully optimized, this sampling incurs a significant runtime cost, particularly to encapsulation.
Shuffling algorithms appear perfectly suited to solve the problem of fixed-weight sampling; however, there is no known linear-time algorithm that is resistant to timing attacks for this problem.Instead, constant-time sorting is used to generate random permutations, as mandated by the NTRU submission to the NIST PQC contest [CDH + 20].We propose a new, timing attack-resistant shuffling algorithm to replace the sorting-based approach, with improved asymptotic running time and large performance improvements in actual implementations, especially for embedded architectures.
Prior to our work, the main proposal to avoid the cost of fixed-weight sampling for NTRU is NTRU-HRSS [HRSS17].Their technique was later merged into the NTRU proposal for NIST's PQC standardization process [CDH + 20].Due to larger key and ciphertext sizes, it was adopted for only one out of the four suggested parameter sets.
There exist many shuffling algorithms, such as Fisher-Yates [FY38, Dur64, Knu97], Rao-Sandelius [Rao61,San62] and MergeShuffle [BBHL18].Algorithms in the coin tossing model, aimed at minimizing the consumption of random bits, are reviewed in [BBHT17].However, none of these are designed to resist side-channel attacks.Indeed, [Dan19] remarks that Fisher-Yates is the most straightforward implementation of fixed-weight sampling, but cautions that "implementing Fisher-Yates in such a way that there is no side channel is difficult."They opt for a constant-time sorting network [Bat68], as proposed by Bernstein, Chou and Schwabe for use with the McEliece [BCS13] cryptosystem, and Bernstein et al. for the NTRU Prime [BCLv17] cryptosystem.
Sendrier [Sen21] proposed a constant-time Fisher-Yates variant for binary vectors for use with the BIKE cryptosystem.Its running time is O(w 2 ), where w is the weight (count of non-zero elements) of an n-element vector; since the weight in BIKE grows as In NTRU, ternary vectors are used, and also, a dense vector is sampled, i.e. w = O(n); thus, Sendrier's algorithm cannot compete with the O(n log 2 n) running time of the fastest practical sorting networks.
Our contributions.In §3, we solve the open problem of designing an unbiased lineartime shuffling algorithm resistant to timing attacks for the NTRU fixed-weight sampling problem.It is a drop-in replacement for NTRU's current sampling-by-sorting approach, improving the running time from O(n log 2 n) for the best practical sorting networks to O(n), without impacting security.We also discuss its application to other postquantum cryptosystems.In §4, we discuss implementation aspects.We show in §5 that an implementation of our proposed approach is considerably faster for the fixed-weight sampling step, by factors of up to 6.91 (591%) on ARMv8-A cores and 12.89 (1189%) on the Cortex-M4.This translates into considerable improvements for the KEM encapsulation operation (up to 50% on ARMv8-A cores and 72% on the Cortex-M4) and smaller improvements for key generation (up to 2.7% and 6.1% on the same respective platforms), with little effect on code size, and small gains in memory usage, for embedded architectures.We illustrate how to implement its main operations efficiently in the ARMv8-A and ARMv7-M architectures, as well as generic operations suitable for any architecture, and discuss possible implementations for Intel architectures.Our implementations are available under an open source license in two different GitHub repositories: https://github.com/dgazzoni/NTRU-sampling for a reference version with ARMv8-A optimizations, and https://github.com/dgazzoni/NTRU-sampling-pqm4for a Cortex-M4 implementation.

NTRU random sampling
NTRU is a post-quantum public-key cryptosystem whose security relies on the difficulty of finding short vectors in high-dimensional lattices [Ajt96,MG02].It is based on a polynomial ring over a finite field, and some of its parameters are random ternary polynomials 1 , i.e., with coefficients in {−1, 0, +1}.A subset of these are restricted to being fixed-weight, i.e. with a prescribed number of non-zero coefficients.Indeed, NTRU's specification [CDH + 20] requires an even stronger property, defining T (d) as the set of ternary "polynomials that have exactly d/2 coefficients equal to +1 and d/2 coefficients equal to −1".
The straightforward approach to sample from T (d) is to fix a representative of T (d) (e.g., −1 for the first d/2 coefficients, +1 for the next d/2 coefficients, and 0 for the remaining ones), and randomly permute its coefficients using a shuffling algorithm.However, known shuffling algorithms are not timing-attack resistant [Dan19].The usual alternative, mandated by the NTRU specification [CDH + 20], is based on constant-time sorting.Briefly, an array of key-value pairs is created, using uniformly random samples as keys, while values are coefficients from the chosen fixed representative of T (d).Sorting the random keys induces a random permutation of the coefficients.This approach is illustrated in Algorithm 1.While most classical sorting algorithms are variable-time, sorting networks [Bat68], [Knu98,§5.3.4] are constant-time and shown to be efficient in practice [BCS13,BCLv17].Output: v (an (n − 1)-coefficient polynomial in T (q/16 − 1)) Notes: We denote by Int(x 0 , . . ., x k−1 ) the unsigned integer with x j (0 ≤ j ≤ k − 1) at the j-th bit of its binary representation.

Shuffling algorithms
Fisher-Yates.The Fisher-Yates shuffle algorithm, also known as Knuth's shuffle [FY38,Dur64,Knu97], is a classical technique for randomly and unbiasedly permuting elements in a collection.It is displayed in Algorithm 2.
Algorithm 2 Fisher-Yates(a, n) Input: An array a of n elements Output: A random permutation of a 1:

Exchange a[j] and a[i]
4: return a 1 Given that these are ternary polynomials, the coefficient 2 may be used interchangeably with −1.
Fisher-Yates has favorable performance characteristics: O(n) running time with small constants.However, array accesses indexed by secret data are susceptible to timing attacks, due to variabilities induced by the presence or absence of data in CPU caches.This was observed in [Koc96], and demonstrated in practice in [TTMM02, TSS + 03, Ber04] and many other works.This principle can be applied to Algorithm 2 to recover the indices j in the accesses to a[j] in line 3, allowing an attacker to reconstruct the permutation.
8: Let A 0 and A 1 be two empty arrays 9: for i = 0 to n do 10: The case n = 2 can be made constant-time using standard techniques.Line 10 directs each element a[i] to a different array depending on a random bit; by evicting both arrays from the cache for later probing, an attacker can find which array was written to.This can be countered by writing to both arrays regardless of the random bit drawn, but only incrementing the correct pointer.However, the random choice of array for assignment may lead to uneven growth of the arrays.We are unaware of any concrete analyses in the literature, but conjecture that this leaks enough data to mount a cache timing attack.
MergeShuffle.Finally, MergeShuffle, introduced in [BBHL18], "is an (easy to implement) extremely efficient algorithm to generate random permutations (or to randomly permute an existing array)".As with the Rao-Sandelius algorithm, MergeShuffle uses a divide-and-conquer strategy and is amenable to a parallel implementation.
Let k be a cut-off threshold to switch to Fisher-Yates.MergeShuffle splits an input array (a 0 , a 1 , . . ., a n−1 ) into 2 k blocks to be shuffled using Fisher-Yates (Algorithm 2), and then merges the resulting permutations as presented in Algorithm 4. The merging procedure is similar in spirit to that of e.g.mergesort, but it is performed in-place and uses a random bit to choose whether to swap elements from the two input arrays.
The use of Fisher-Yates as a subroutine of MergeShuffle renders it equally susceptible to cache timing attacks.It is also unclear whether the merging step can be vectorized, to attain competitive performance, and implemented in constant-time.

Fixed-weight sampling by constant-time shuffling
As just discussed, while shuffling is the natural solution to the fixed-weight sampling problem in NTRU, we are unaware of any shuffling algorithm resistant to side-channel attacks.In this section, we propose a linear-time, unbiased and timing attack-resistant p ← p + 1 7: until a consists of a single block 8: return a shuffling algorithm suitable for NTRU fixed-weight sampling.Throughout this section, n is defined as in the NTRU specification and assumes values of either 509, 677 or 821.
We first describe a subroutine (Algorithm 5) to generate an array of random integers si such that si Algorithm 5].While other approaches exist to achieve the same result, some of which are discussed in the same paper, this method achieves the best performance among all methods we experimented with, while restricting costly (and, in all CPUs we are familiar with, variable-time) divisions by non-power-of-two integers to a pre-computation step.Then, ∀s ∈ [0, 2 L ) and ∀y ∈ [0, s), with s, y integers, there are

Algorithm 5 RejSamplingMod(n): Generation of an array of unbiased uniformly random integers modulo
which is a multiple of s.There are (2 L − (2 L mod s))/s = ⌊2 L /s⌋ representatives of each residue class modulo s in this interval, and in particular multiples of s of the form x • s.
Lemma 1 implies that rejecting values such that x • s mod 2 L < 2 L mod s ensures that ⌊x • s/2 L ⌋ is unbiasedly sampled from [0, s).We discuss issues of timing attack resistance, as well as the choice of the performance-critical parameter L in §4.Algorithm 6 is our proposed shuffling approach for ternary fixed-weight sampling.
Algorithm 6 Shuffle(n, c 0 , c 1 ): Ternary fixed-weight sampling by shuffling Input: n Input: c 0 , c 1 (prescribed number of coefficients equal to 0, resp. 1) Output: v (output array of n − 1 integers) Firstly, we note that, while NTRU prescribes an identical count of 1 and −1 elements in the sampled vector, Algorithm 6 imposes no such restriction: the user is free to choose the parameters c 0 and c 1 (and implicitly, c Evidently, the main loop of Algorithm 6, as presented, does not execute in constant time due to the use of branches.However, architecture-agnostic standard techniques, as well as architecture-specific conditional instructions, can be used to obtain a branchless, constant-time implementation; see §4.Moreover, all accesses to the arrays si and v are performed sequentially.We exploit the fact that O(1) distinct values need to be shuffled (indeed, only 3: −1, 0, 1), a situation not considered in the usual shuffling algorithms.Intuitively, one could draw an analogy between Fisher-Yates shuffling and selection sort, and by replacing the latter with counting sort, arrive at our proposed algorithm.
Proof.The loop of line 2 clearly executes in time O(n).Thus, the remaining work consists in analyzing Algorithm 5.The outer loop (line 3) consists of n − 1 iterations.Noticing that t[i] < n, ∀i, the condition in line 9 will be satisfied with probability 1 − n 2 L > 1/2.Thus, the expected number of iterations in the inner loop (line 4) is less than 2, so Algorithm 5 also executes in time O(n) on average.
We remark that the O(n) running time of Algorithm 6 improves upon the O(n log2 n) running time of sorting networks typically used for constant-time sorting implementations [BCS13,BCLv17], such as Batcher's odd-even merge sort [Bat68,Knu98].
The algorithm necessarily consumes at least n − 1 random L-bit integers and may, in principle, consume an infinite number of them due to rejections; however, in §4, we show that, for L = 16, generating just 4% to 5.5% extra random integers is sufficient in practice.
Application to other PQC cryptosystems.While we have directed our efforts to a single scheme (NTRU) to showcase the potential of our approach through highly optimized code, many other PQC schemes also rely on fixed-weight sampling.We briefly mention some relevant schemes, commenting on the applicability of our techniques to them; note that this is not an exhaustive list of potential applications.
For lattice-based schemes, we are only aware of applications of fixed-weight sampling to NTRU variants, such as NTRU LPRime and Streamlined NTRU Prime [BCLv17, BBC + 20], also submitted to the NIST PQC contest.As in NTRU, these variants sample vectors in {−1, 0, 1} with a fixed weight, i.e. count of non-zero elements; however, we recall that, in NTRU, the count of elements equal to 1 and −1 in a sampled vector is identical, a constraint which is lifted in these variants 2 .For concreteness, we consider the parameter sets sntrup761 and ntrulpr761; in the notation of Algorithm 6, we have n = 761 and c 0 = 761 − 286 = 475 for sntrup761 and c 0 = 761 − 250 = 511 for ntrulpr761.Whereas NTRU fixes c 1 = (n − c 0 )/2, in these variants c 1 is freely chosen, although there are attacks for the case that c 1 that is too small [BCLv19,§6].Consider the process of uniform sampling of ternary vectors of size 761 with c 0 elements equal to zero.Then, c 1 follows a binomial distribution with n − c 0 trials and identical success/failure probabilities of 0.5 (mapping wlog "success" and "failure" to sampling 1 and −1, respectively).We can efficiently sample from this distribution by a simple procedure: sample an array of n − c 0 random bits and set c 1 to the population count of this random bit array3 .We believe that this extra step does not add too much cost, and that these NTRU variants should still benefit considerably from our approach; we encourage others to investigate this possibility.
Another class of PQC schemes that require fixed-weight sampling are code-based cryptosystems, such as Classic McEliece [ABC + 22], BIKE [ABB + 22] and HQC [AAB + 22], all of which are contenders in the fourth round of NIST's PQC contest.They require binary, rather than ternary, fixed-weight vectors; Algorithm 6 is easily modified to handle this case.Nevertheless, our approach is unlikely to benefit these schemes, due to their use of low weight (sparse) vectors: typically w = O( √ n).These schemes perform fixed-weight sampling using distinct approaches: [Sen21, Algorithm 5] for BIKE and HQC, and rejection sampling of indices of non-zero elements (rejecting in the case of index collisions) for McEliece.In both cases, the running time, memory usage and randomness consumption are a function of the weight w rather than the vector length n, whereas these parameters are a function of n in Algorithm 6. D. J. Bernstein, in private communication with the authors, indicates that he has devised a new algorithm with favorable performance characteristics for the sparse setting4 .WAVE [BCC + 23] is a code-based signature scheme submitted to NIST's additional digital signatures contest, which calls for random permutations of fixed-weight ternary vectors.Despite similarities to the fixed-weight sampling problem treated in this paper, an important distinction is a requirement for computing the explicit permutation mapping the input vector to the randomly permuted output vector.We have superficially investigated a modified version of Algorithm 6 to fit this scenario, by keeping track of the indices of each element during the shuffling process, but this version departs significantly from the efficiency and constant-time characteristics of Algorithm 6. Lacking further progress, it appears that sorting remains better suited to this scenario for now.

Implementation aspects
Architectural guarantees regarding constant-time execution.Both ARMv8-A and Intel architectures have recently introduced hardware flags that, when set, guarantee constant-time execution of a subset of CPU instructions, which should generally be sufficient to implement most cryptographic algorithms: FEAT_DIT for ARMv8-A [ARM23, §A2.6.1,B1.3.6,C5.2.4] and DOIT for Intel [Int23a,Int23b].We verified that all instructions handling secret data in our ARMv8-A implementations are included in the affected subset.
These new features do not imply that CPUs launched prior to the introduction of these flags execute these instructions in variable time.Indeed, ARM claims to be unaware of older CPUs with variable timing for instructions now covered by FEAT_DIT [ARM] 5 ; and Intel advises developers to assume older microarchitectures behave as if DOIT is enabled [Int23a].
This issue has garnered attention at the beginning of 2024, as Apple ARMv8-A cores (which are designed by Apple and not ARM) are subject to a microarchitectural attack called GoFetch [CWS + 24]; setting the FEAT_DIT bit on the M3 disables the data memorydependent prefetchers targeted by the attack, rendering it ineffective, while for the M1 there appears to be no countermeasure at the hardware level.
Resistance against timing attacks of Algorithm 5.There are some possible sources of timing leaks in Algorithm 5, which we enumerate and analyze.
The integer multiplication in line 7 must execute in constant-time, which is the norm in modern CPUs6 , although there are rare exceptions such as the ARM Cortex-M3 for 32 × 32 = 64-bit multiplications; however, 32 × 32 = 32-bit multiplication suffices for the purpose of this algorithm, and there is evidence that it executes in constant time in the Cortex-M3 [dG15,Por18].
Array accesses in line 9 use sequential indices; thus, secret data is not leaked.The loop in lines 4 to 9 performs rejection sampling based on public data, precomputed in line 2: the remainder of 2 L divided by integers in the sequence n − 1, n − 2, . . ., 1, where L and n are public parameters.Nevertheless, given the attack of Guo et al. [GHJ + 22] targeting rejection sampling in fixed-weight sampling algorithms for BIKE and HQC code-based cryptosystems, it is worth analyzing whether a similar attack could apply here.We note that their attack relies on two key assumptions: 1.A high rejection rate, leading to multiple calls to the seedexpander routine (equivalently in our case, the randombytes routine) which creates a timing distinguisher.As discussed next, the rejection rate for our chosen parameter L = 16 is sufficiently small that e.g. a full run of Algorithm 5 in the case n = 509 has > 40% probability of no rejections at all.Due to this low rejection rate, and using a batch random number generation idea discussed later, we sample enough uniform random integers from the outset so that the probability that extra samples are required is deemed negligible (< 2 −74 , which can be further reduced at little extra cost).This allows calling randombytes only once, generating a fixed number of random integers, while introducing a negligible overhead.
2. Derivation of the random seed for fixed-weight sampling from secret data -namely, the output of decryption from the reencryption step of decapsulation, as required by the Fujisaki-Okamoto transform for IND-CCA security of the KEM.The attack starts by trial encrypting many candidate messages until finding an m that requires multiple calls to seedexpander, which gives rise to a timing distinguisher (a possibility ruled out by the first point above).Carefully constructed perturbations of the resulting ciphertext c are fed to the decapsulation procedure, while using the timing distinguisher to determine whether the decryption step of reencryption outputs the same m or a different message, allowing the attacker to learn information about the secret key.Repeated application of this procedure extracts the vast majority of key material, and the remaining bits are easily found.However, we note that NTRU does not require reencryption due to the rigidity of the NTRU DPKE [CDH + 20, Figures 9 and 10]; indeed, the fixed-weight sampling algorithm is not executed at all during either decapsulation or decryption.
Thus, we conclude that Algorithm 5 does not render NTRU vulnerable to the attack of Guo et al Choosing the parameter L. The choice of L in Algorithm 5 is a tradeoff between the cost of random number generation and the frequency of rejections; the latter lead to branch mispredictions and costly pipeline flushes in modern, highly-pipelined superscalar CPUs such as some of the ARMv8-A cores considered in this work.If samples are rarely rejected, a SIMD implementation of the algorithm becomes feasible; one can keep track of which lanes were rejected and resample them later (usually with scalar code).To minimize rejection, one must choose L such that 2 L ≫ n − i, but this translates into added cost for random number generation, and thus L should not be unreasonably larger than n − i.
We propose L = 16 as a natural choice, supported by all scalar and SIMD instruction sets we are aware of.The next smaller size, 8 bits, is insufficient for half or more of the values to be sampled in the standard NTRU parameter sets, and for most of the intervals where it is sufficient, it would lead to a high rejection rate, running counter to the SIMD philosophy.By exactly matching an available lane size, no bit shifts/masks/permutations are required to load random integers into SIMD registers, further improving performance.It is also the natural choice for storing the 11-or 12-bit NTRU polynomial coefficients; indeed, it is the representation used by the reference code and the state-of-the-art implementations we chose for performance comparisons, requiring no size conversions.
Finally and most importantly, rejections are relatively rare: a block of 16 samples is fully accepted (zero rejections) with probability at least 94.2%, 91.6% and 90.1% for n = 509, 677 and 821, respectively.These are minimum figures, and as n − i decreases, the acceptance probability increases even further.Furthermore, the probability of accepting all n − 1 samples (i.e., no rejections at all during a complete execution of the algorithm) is 40.2%, 18.9% and 8.6% for n = 509, 677 and 821 respectively.These figures are obtained by modeling the number of required samples as a sum of geometric random variables and are displayed in a Jupyter notebook accompanying the source code of our implementation.
Due to the low rejection probability, it is sufficient to generate just a few extra random integers over the lower bound of n.For each n, we computed the cumulative distribution function P (x ≤ k) and sought the minimum k such that 1 − P (x ≤ k) < 2 −74 , enough to sample 2 10 > n integers for each of 2 64 key exchanges.For L = 16, and rounding up to the next multiple of 8 (the number of 16-bit lanes in a NEON register), we find that 536, 704 and 856 random 16-bit integers are sufficient (i.e., an overhead of 5.5%, 4.1% and 4.4%) for n = 509, 677 and 821, respectively.This calculation is included in the aforementioned Jupyter notebook, which can be adapted to other choices of L if desired.
One might argue that L = 16 is a "wasteful" choice, as it requires 123%, 109% and 103% more bits than the (unattainable) lower bound of log 2 (n!) bits for n = 509, 677 and 821, respectively.Still, we note this is slightly more than half as many random bits as the approach dictated by the NTRU specification [CDH + 20], which calls for 30 × n bits.
Taking L > 16 appears counterproductive, e.g.due to reduced computational throughput from using larger SIMD lanes.On the other hand, in scenarios where pseudo-random number generation is expensive, SIMD is not available and pipeline flushes have less performance impact (i.e.deeply embedded cores such as the Cortex-M4), choosing L < 16 (say, 12 or 10) may result in better overall performance.One might even conceive of an adaptive choice, decreasing L along with n − i, although this results in more complex code.D. J. Bernstein, in private communication with the authors, pointed us to an unpublished paper of his [Ber18] which quantifies the security loss due to foregoing rejection sampling in Algorithm 5, with obvious performance benefits due to the avoidance of branches.For the choice L = 16, it leads to 1.52, 2.61 and 3.84 bits of security loss, respectively, for n = 509, 677 and 821.Alternatively, a larger value of L can be chosen to reduce the security loss, but then the performance gains due to branchless code must be weighted against extra randomness consumption and reduced arithmetic throughput.We have experimented with L = 20, which reduce security by 0.09, 0.16 and 0.24 bits, respectively, for n = 509, 677 and 821.Benchmarks show a speedup of 9.7% to 11.5% in the performance of fixed-weight sampling on the Cortex-M4; however, due to the increased randomness consumption, encapsulation performance actually decreased by 0.6% to 0.7%.Preliminary experiments on a NEON version with L = 22 resulted in a small slowdown, but we cannot rule out a crossover to a small improvement through careful optimization.
Batch random number generation in Algorithm 5.For ease of understanding, we displayed uniform random numbers as being sampled online in line 5 of Algorithm 5.However, this leads to high overheads in actual implementations.Instead, we sample the required count of uniform random numbers (see previous discussion about the choice of L) into an array ahead of time, and consume the randomness directly from this array.

SIMD implementation of Algorithm 5.
To minimize the execution time of Algorithm 6, we seek to implement Algorithm 5 using SIMD instructions.At first glance, it is unsuitable for SIMD, as some lanes may be rejected while others are accepted during sampling.However, it is possible to sample a whole SIMD register and take note of which lanes, if any, were rejected, to be fixed up later using scalar code (recall that an adequate choice of L ensures that rejections occur with low probability, so the performance impact of this fixup procedure is limited.)However, when the batch random number generation idea is used, rejections induce an irregular access pattern to the array of random numbers.We avoid this issue by using disjoint ranges of the array for SIMD sampling (indices 0 to n − 2) and the fixup procedure (n − 1 onwards).These ideas are captured in Algorithm 7.
In addition to previously discussed issues of timing attack resistance of Algorithm 5, we note that any non-sequential accesses to the array rnd arise from switching between the ranges of indices 0 ≤ i + k < n − 1 and j ≥ n − 1, that is, they are due to rejections and thus do not leak secret data; accesses within each range are sequential.
Algorithm 7 SIMD-RejSamplingMod(n, rnd): SIMD version of Algorithm 5. Input: n Input: rnd (array of random L-bit integers; refer to previous discussion about its length) Output: si (output array of (W + 1)⌊(n − 1)/W ⌋ integer elements, of which only the first n − 1 entries are valid.) mask k ← 0 23: return si Line 10 should use SIMD comparison instructions (e.g.NEON's CMHI or AVX2's VPCMPGT).These create a mask with all bits set or clear in the corresponding lane, while Algorithm 7 as written calls for setting and clearing individual bits, a choice made purely for ease of exposition.Actual implementations are advised to tweak the representation to employ groups of bits instead, so as to achieve an efficient implementation of the inner loop of line 6.For instance, VPMOVMSKB is a natural choice in AVX2, resulting in 2-bit mask groups for 16-bit lanes.In NEON, we extract 8-bit masks with UZP1, and reduce them to 4-bit masks using SHRN by 4. NEON's 128-bit registers suggest a choice of W = 8 if L = 16.However, we achieved better performance by taking W = 16, implemented as an unrolled 2-iteration loop processing 8-element vectors.We attribute this to the fact that converting a mask with UZP1 and SHRN costs the same for 8 or 16 values.
Constant-time implementation of Algorithm 6.We now discuss how to implement Algorithm 6 in constant-time.First, we rewrite it using the C language's ternary operator, as shown in Algorithm 8, and then discuss strategies to implement this operator in constant time, firstly as an architecture-agnostic solution, and then consider conditional instructions present in the ARMv8-A, ARMv7-M and Intel architectures.Note that this version replaces −1 coefficients by 2; this is not an issue as the sampled polynomial has coefficients in Z/3Z, and indeed, the reference NTRU code employs the same representation.
Expressions of the form (x < y) ?− 1 : 0, in lines 4 and 5 of Algorithm 8, can be made constant-time by noticing that, in two's complement integer arithmetic (used in nearly all modern architectures), −1 and 0 have all bits set and cleared, respectively.The sign (most Algorithm 8 CT-Shuffle(n, c 0 , c 1 , rnd): Fixed-weight sampling by shuffling, implemented in constant-time Input: n Input: c 0 , c 1 (prescribed number of coefficients equal to 0, resp. 1) Input: rnd (array of random L-bit integers; refer to previous discussion about its length) Output: v (output array of n − 1 integers) Notes: We employ the C language ternary operator ? to denote constant-time selection between two values based on a condition.See text for implementation possibilities.1: si ← SIMD-RejSamplingMod(n, rnd) 2: c 01 ← c 0 + c 1 ▷ Note this invariant is maintained in the loop body 3: for i = 0 to n − 2 do 4: y and 0 otherwise; an arithmetic right shift by w − 1 bits, where w is the word size, replicates the sign bit across the entire word.Concretely, the following C code implements line 4 for 16-bit signed integer variables: While already efficient, better performance is achievable.To that end, we analyze the critical path of the main loop of Algorithm 8, shown in Figure 1.We disregard memory loads and stores, which can be removed from the critical path by proper scheduling.For any mobile-, desktop-or server-class modern CPU, one can assume at least a 2-way superscalar pipeline and single-cycle latency for all used operations, in which case the critical path of lines 4 and 5 from one iteration to the next (the bold arrows in the figure) takes 3 cycles.In ARMv8-A, arithmetic instructions can be encoded so that one of the input operands is shifted; thus, a single instruction can compute both t 0 = t 0 ≫ (w − 1) and c 0 = c 0 + t 0 .Unfortunately, ARMv8-A CPUs considered in this work, such as the Apple M1 [Joh22] and Cortex-A72 [ARM15], execute these instructions with a 2-cycle latency, offering no gain in performance (but a slight reduction in code size).
By employing ARMv8-A conditional instructions such as CINC and CSET, it is possible to reduce the critical path to 2 cycles.However, Algorithm 8 calls for decrementing c 0 and c 01 , and there is no CDEC instruction in ARMv8-A; we modify the algorithm to use negative values for c 0 , c 01 and si[i], so that we can increment c 0 and c 01 using CINC instead.Thus, we arrive at the code of Listing 4.1 for the algorithm's main loop.
Listing 4.1: Main loop of Algorithm 8 in the ARMv8-A architecture.
There are two critical paths: one from cmp c0, r to cinc c0, c0, lt to the next iteration's cmp c0, r; and the second for the same instructions involving c01.In all the considered ARM CPUs, all instructions in the code fragment above have single-cycle latency, and thus the loop has the potential to execute in 2 cycles/iteration.
Unfortunately, we run into throughput issues: in the Apple M1, reverse engineering efforts [Joh22] indicate that, although it is capable of executing 6 scalar instructions/cycle, only 3 execution units can execute flag-setting and conditional instructions, i.e. all instructions in the above code fragment.While theoretically sufficient to run the code at maximum throughput, we have observed instruction scheduling issues while attempting to software-pipeline Algorithms 7 and 8, preventing execution at maximum throughput.The instruction sequence of Listing 4.2 requires more µops, but performs better in the M1.We use 32-bit registers (w0, w1, etc.) and initialize two with the constant 2. It is also advantageous for the Cortex-A72, since the add instruction with shifted argument executes in the M pipeline, whereas all other instructions execute in the I0/I1 pipelines.While other bottlenecks come into play in the Cortex-A72, notably its 3-wide instruction decoder, this alternative instruction sequence performs better than the original.
Intel has conditional instructions for conditional moves (CMOVcc) and sets (SETcc), where cc are condition codes, but no conditional increments or decrements.For positive values of c 0 and c 01 , as in the original version of Algorithm 8, an alternative is to decrement c0 and c01 and use CMOV to select between original and decremented values; decrements can execute in parallel with comparisons, thus the critical path is not lengthened.
Unfortunately, Intel instructions do not offer the three-operand form of ARMv8-A and other RISC architectures, so an extra MOV is required to create a copy prior to decrementing in order to avoid overwriting the original values; this doesn't necessarily increase the critical path, due to MOV elimination [Fog22], but it does increase front-end pressure.Implementers are advised to keep in mind the achievable performance given the critical path, to benchmark and analyze compiler-generated code if employing a high-level language, and to consider inline assembly (or a full assembly language implementation) to emit instructions that are well-matched to the decoder restrictions.
For the ARMv7-M architecture, a straightforward implementation of Algorithm 8, implementing lines 4 and 5 using the arithmetic right shift trick, works really well; this is aided by the ability to shift one of the input operands to data processing (logical and arithmetic) instructions.We have experimented with ARMv7-M's conditional execution (IT) instruction, but we were unable to improve performance compared to the straightforward implementation.However, we did find an especially compact instruction sequence devoid of IT instructions to implement the main loop of Algorithm 8, shown in Listing 4.3.
Listing 4.3: Main loop of Algorithm 8 optimized for the Cortex-M4.
We set one to the constant 1.As the straightforward implementation is already efficient, this alternative saves one clock cycle per loop iteration, i.e. < 1000 cycles for the full algorithm.As fixed-weight sampling is performed only once during key generation and encapsulation, the speedup is just < 0.02% for the former and ≈ 0.15% for the latter.
Software pipelining of Algorithms 7 and 8. Modern superscalar CPUs use distinct execution units for scalar and SIMD instructions.Most of the execution time of Algorithm 7 is spent in SIMD code, while Algorithm 8 is strictly scalar.This is amenable to software pipelining [Lam88].In the best-case scenario, one can achieve execution time close to the maximum, rather than the sum, of the execution times of Algorithms 7 and 8.
Concretely, we inline Algorithm 7 into Algorithm 8, strip-mine the main loop of the latter, and then fuse the outer loops of both algorithms, processing W entries at a time.With this approach, we were able to achieve, in the Apple M1, execution times only ≈ 12% slower than the lower bound (2 cycles/iteration) for the main loop of Algorithm 8 alone.This includes all overhead such as function calls and returns, prologue and epilogue, initialization, and of course, the execution of Algorithm 7 itself, as seen in Table 4.The narrow (3-wide) decoder of the Cortex-A72 precludes achieving a similar result as the M1, but by interleaving instructions of both algorithms to improve scheduling, we achieved results not far from the limit dictated by the decoder bandwidth bottleneck.Known Answer Tests.We note that the Known Answer Tests (KATs) in NTRU's specification [CDH + 20] are tightly coupled to the fixed-weight sampling by sorting approach mandated there.Therefore, an implementation employing Algorithm 6 will fail these KATs for key generation and encapsulation.However, our sampled polynomials meet the fixed-weight requirement imposed by NTRU and are in principle indistinguishable from those generated by the existing approach.Thus, keys generated using our algorithm are valid, and the result of an encapsulation employing our algorithm will produce a correct decapsulation even by an unmodified implementation of the current NTRU proposal.
Given the simplicity and improved performance and code size characteristics of Algorithm 6, we suggest that future standardization attempts of NTRU specify our approach instead of sampling by sorting, and generate KATs accordingly.Implementers attempting to replicate our results, whether on ARMv8-A or other architectures, can use unofficial KATs generated by us, included in our source code package.

Experimental results
We now present experimental results for implementations of our proposed approach for various 64-bit ARMv8-A cores, as well as the 32-bit ARMv7-M Cortex-M4 core.

Methodology
We implemented reference versions of Algorithms 5 and 6, and optimized versions for ARMv7-M and ARMv8-A by replacing Algorithm 6 with Algorithm 8; for ARMv8-A specifically, we replaced Algorithm 5 by a NEON version of Algorithm 7. We integrated the reference and optimized implementations with existing state-of-the-art implementations of NTRU: pqm4 [KRSS19]7 for ARMv7-M and [GFBL24, NG21, CCHY24] for ARMv8-A.KATs were generated using the reference implementation and compared against the optimized implementations; we added tests to ensure interoperability between a conventional implementation (using sampling by sorting) and our proposed approach.
Testbeds and measurement methods.Our testbeds for performance measurement, with their corresponding CPU cores, are: • Apple M1 P-core at 3200 MHz in an Apple MacBook Air laptop running macOS; • Apple M3 P-core at 4064 MHz in an Apple MacBook Pro laptop running macOS; • Cortex-A72 at 1500 MHz in a Raspberry Pi 4 single-board computer running Linux; • Cortex-A57 at 1430 MHz in an Nvidia Jetson Nano single-board computer running Linux; • Cortex-A53 at 1400 MHz in a Raspberry Pi 3 single-board computer running Linux; • Cortex-M4 at 24 MHz in an STM32F4DISCOVERY development board.
Save for the ARMv7-M Cortex-M4 core, the remaining testbeds are ARMv8-A, running in 64-bit mode.Of these, the Apple M1, M3 and Cortex-A57 cores feature ARMv8-A Cryptographic Extensions, but the Cortex-A72 and the Cortex-A53 do not.
Our ARMv8-A performance measurements use the cycle counting routines originally introduced in [NG21].Each routine is executed for 1,024 times and the average cycle count is reported.ARMv7-M measurements employ the pqm4 [KRSS19] benchmarking harness, which counts cycles using the Cortex-M4 SysTick timer.The number of iterations is set to 10, and the mean of results are reported; although this is a small number, the Cortex-M4 core is much simpler and more deterministic than the large out-of-order ARMv8-A cores, thus exhibiting little run-to-run variability.
While, to a first approximation, cycle counts are not influenced by CPU clock speed, there may be second-order effects such as the decoupling of CPU and bus/RAM/cache clocks.Thus, we take precautions to maximize the likelihood that benchmarks are performed at the nominal clock speeds quoted above.For Linux systems, we use the performance scaling governor.In Apple systems, as far as we aware, there is no control over clock speeds, and there is no TurboBoost-like feature.In both cases, we try to avoid thermal throttling by inserting delays between benchmark runs to allow systems to cool down.The Cortex-M4 core does not automatically boost/throttle clock speeds; pqm4 configures it to 24 MHz at startup, ensuring all benchmarks run at that fixed clock speed.
ARMv8-A binaries were compiled with Apple clang 15.0 (Apple M1 and M3), clang 17.0 (Cortex-A72 and Cortex-A53), and clang 10.0 (Cortex-A57), with -O3 and core-specific -mcpu optimization flags.ARMv7-M binaries were compiled with gcc 13.2.1, passing the -o speed flag to the pqm4 benchmark script.We enable the FEAT_DIT bit on ARMv8-A cores where it is available (in the case of our testbeds, only the Apple M1 and M3).
ARMv8-A implementation.Our implementation is based on the source code provided by [GFBL24], which contains their AMX implementation and the NEON implementations of [CCHY24,NG21].As [CCHY24] is the state-of-the-art in NEON implementations, but targets only the HPS2048677 and HRSS701 parameter sets, [NG21] is included to display HPS2048509 and HPS4096821 results.Importantly, [GFBL24] backports optimized auxiliary routines of [CCHY24] to [NG21] (in particular a NEON implementation of constant-time sorting) and provides an optimized implementation of NIST's randombytes() AES-CTR-DRBG pseudo-random number generator (PRNG), using ARMv8-A Cryptographic Extensions.These routines are critical to the performance of fixed-weight sampling.
For CPUs that do not feature ARMv8-A Cryptographic Extensions, the ChaCha20 PRNG of [CCHY24] is used.As KATs are incompatible across different PRNGs, we supply two KAT sets for validation, using ChaCha20 and AES-CTR-DRBG generators.We ensure that latter matches those provided in the NTRU specification, which uses the same PRNG.
ARMv7-M implementation.pqm4 [KRSS19] is the gold standard for Cortex-M4 implementations of PQC schemes.While its NTRU implementation has highly optimized polynomial multiplication and inversion routines, the constant-time sorting routine in use is the portable3 variant of djbsort [Ber19], using an architecture-agnostic implementation of the core minimum/maximum operation of the sorting network.Inspection of the binaries reveals that the minimum/maximum idiom was not recognized by the compiler, thus generating suboptimal code without using e.g.conditional instructions.We performed some optimization work on this routine, so as to avoid casting our proposed approach in an excessively favorable light.We switched to the more efficient portable4 variant of djbsort, wrote inline assembly versions of the core minimum/maximum operation using conditional operations and a reduced number of memory accesses, and replaced all long long (64-bit) variables by 32-bit long variables to avoid unnecessary use of multi-precision arithmetic, given that ARMv7-M is a 32-bit architecture.This range reduction does not present an issue in NTRU due to the small lengths (hundreds of elements) of the arrays to be sorted.While it is certainly possible to further optimize this routine, further experiments by us resulted in code size increases, which are undesirable in deeply embedded environments.
Table 1 compares the performance, code size and stack memory usage of encapsulation in the existing version of NTRU (using sampling by sorting), for the original pqm4 implementation and our optimized version in our STM32F4DISCOVERY testbed; we denote these as " [KRSS19] original" and " [KRSS19] optimized", respectively, in Table 1.It is seen that our optimizations result in large speedups (43-47%) with negligible effect on code size and none at all on stack usage.While we omit corresponding figures for key generation, our optimizations also outperformed the original implementation, although by smaller amounts (5.4-6.0%);code size and stack usage differences are similar.Results for decapsulation and for the HRSS701 parameter set are not shown, as they do not call the constant-time sorting routine.

Performance figures and analysis
We present performance figures for NTRU KEM key generation and encapsulation in Tables 2 (for Apple SoCs) and 3 (for ARM Cortex cores); decapsulation does not employ fixed-weight sampling, thus its performance is unaffected by our proposed approach.We present NEON results from the implementations of [NG21] for the HPS2048509 and HPS4096821 parameter sets, and [CCHY24] for the HPS2048677 and HRSS701 parameter sets.AMX results are from the implementation of [GFBL24].We emphasize that all ARMv8-A implementations use the NEON optimized constant-time sorting routine of [CCHY24].For the Cortex-M4 core, we use the implementation of [KRSS19], incorporating our optimizations for constant-time sorting.We present performance results as cycle counts, calculating speedups as c sorting /c shuffling − 1. Results for the shuffling approach consist in replacing the sample_fixed_type routine by our proposed algorithms, and adjusting the amount of uniform random bits to match the requirements of the shuffling algorithms, as discussed in §4.
We also present performance figures for fixed-weight sampling, by measuring calls to the sample_fixed_type routine, whose results are presented in Table 4. Finally, we present code size (Flash) and stack (RAM) usage figures for the Cortex-M4 in Table 5.
Key generation and encapsulation.Our proposed approach achieved performance improvements across the board, for both key generation and encapsulation, save for a few outliers in the former.For Cortex-M4, these improvements come at a negligible cost to code size (Flash), and even a slight improvement in stack (RAM) usage, as seen in Table 5.
With regards to key generation, we see improvements of up to 2.7% for ARMv8-A cores and 6.1% for the Cortex-M4.We recall that NTRU key generation is computationally expensive; disregarding simpler operations, it requires a modulo-q inversion (usually realized by a modulo-2 inversion followed by 8 multiplications), a modulo-3 inversion, 5 extra multiplications, 2 different types of sampling (including sample_fixed_type) and pseudo-random number generation.Therefore, it is not surprising that optimizing a single sampling routine results in limited performance improvements.
Results are more significant for encapsulation, which are arguably of more interest than key generation, seeing as, for most cryptographic applications, the former will be run far more often than the latter.We see improvements of up to 44% and 50% for NEON and AMX implementations in ARMv8-A, respectively, and 72% for the Cortex-M4.Improvements correlate well with polynomial multiplication performance, which is fastest for NEON in the HPS2048677 parameter set (based on the faster TMVP approach of [CCHY24]) and in AMX implementations; this is expected due to Amdahl's law.
Fixed-weight sampling.Table 4 shows that our shuffling approach significantly improves performance of fixed-weight sampling compared to the sampling by sorting approach of previous works.We see very significant speedups for all platforms: factors of up to 6.91 (591%) in ARMv8-A cores and 12.89 (1189%) in the Cortex-M4.Measurements do not include the cost of pseudo-random number generation (i.e. the randombytes routine), which is highly platform-dependent; recall that our approach requires slightly more than half as many pseudo-random bytes as sampling by sorting.
Effect of the FEAT_DIT bit.Recall that, out of all our ARMv8-A testbeds, only the Apple M1 and M3 SoCs implement the FEAT_DIT bit for data-independent timing.In order to ascertain the performance impact of setting this bit, we have also ran benchmarks with FEAT_DIT unset on those Apple SoCs.Although we omit the full dataset, it is available as part of our GitHub repository.Briefly, for fixed-weight sampling alone (i.e.benchmarks equivalent to those of Table 4, across all parameter sets, both SoCs and both the sorting and shuffling approaches, we see differences of 0 to at most 3 clock cycles (corresponding to at most a 0.14% difference), in both directions: faster with FEAT_DIT set or unset.On average, the difference is 0.01% in favor of FEAT_DIT unset.Given the magnitude and varying direction of the difference, we conjecture that this is merely due to measurement noise, which is to be expected when running benchmarks in such a complex core.
Protocol-level benchmarks display similar behavior, but with a slightly greater magnitude (from 0.14% faster to 0.37% slower with FEAT_DIT set), averaging 0.004% slower across all parameters sets and different SoCs, implementations (AMX/NEON) and shuffling/sorting approaches.There is no clear performance trend either in favor or against setting FEAT_DIT, and the increased magnitude appears to be due to benchmarking more complex code, which further taxes the cache/memory subsystem and AMX coprocessor.
We thus conclude that, at least for Apple SoCs, setting FEAT_DIT has at most a negligible impact on performance, and most likely no impact at all.

Comparison with NTRU-HRSS.
It is instructive to compare NTRU-HPS2048677 to NTRU-HRSS701, as both are designed to the same NIST security level.Fortunately, the state-of-the-art NEON implementation of [CCHY24] implements both parameter sets, allowing for a fair comparison.Tables 2 and 3 include rows marked "Slowdown vs. 677 sorting" and "Slowdown vs. 677 shuffling", computed as c 701 /c 677 − 1; thus, positive values  indicate that HRSS701 is slower than HPS2048677, and the contrary for negative values.Even with the sampling by sorting approach, HPS2048677 is usually faster than HRSS701 for key generation, with the exception of the Cortex-A53 and Cortex-M4 cores; with the shuffling approach, HPS2048677 key generation also outperforms HRSS701 in the Cortex-M4, and reduces the gap in the Cortex-A53.As for encapsulation, HPS2048677 using sampling by sorting was significantly slower than HRSS701 in all cases, by up to 35%, 36% and 33% in Apple SoCs, ARMv8-A Cortex cores and the Cortex-M4, respectively.The shuffling approach closes this gap, with HPS2048677 slower by at most 4.4% in Apple SoCs, and 8.5% in the Cortex-A53; for other ARMv8-A cores, HPS2048677 is actually faster, by up to 18%, and in the Cortex-M4, it is also faster by 13%.

Conclusion
In this work, we showed that timing attack-resistant fixed-weight sampling can be performed without using constant-time sorting.We have proposed a new algorithm (Algorithm 6) which achieves a running time of O(n), an improvement over O(n log 2 (n)) for previous, sorting network-based approaches.This results in performance improvements in actual implementations across a range of different platforms, from deeply embedded to highperformance laptop CPUs.Additionally, the amount of random data needed for sampling is reduced by almost half, which is advantageous for architectures without instructions to accelerate cryptographically secure PRNGs.Moreover, our proposed method may be simpler to implement in an optimized fashion than constant-time sorting networks.This solves a long-standing open problem: to date, the best alternative was the NTRU-HRSS variant, which also seeks to eliminate the cost of constant-time sorting required for sampling fixed-weight polynomials.As discussed in §5, a modified NTRU-HPS2048677, using our proposed approach, nearly closes the performance gap to NTRU-HRSS701 in some cases, and actually outperforms it in others (recalling that both are designed to the same NIST security level).We also note that key and ciphertext sizes for NTRU-HPS2048677 are smaller: 930 (resp.1138) bytes for the public key and ciphertext, and 1234 (resp.1450) bytes for the private key, for NTRU-HPS2048677 (resp.NTRU-HRSS701).Finally, the need to support both NTRU-HPS and NTRU-HRSS to achieve different security levels results in increased implementation complexity, e.g.due to the HRSSspecific version of Lift [CDH + 20, §1.9.3] and the additional Ternary_Plus sampling routine [CDH + 20, §1.10.4].In light of these arguments, we call into question the need for a separate NTRU-HRSS parameter set.

Future work.
Although NTRU is no longer being considered by NIST, we recall that it has been standardized in other forums [Ins09,Ame17].Since our proposed Algorithm 6 improves upon the existing fixed-weight sampling by sorting approach mandated by the NTRU specification submitted to NIST [CDH + 20], we suggest amending NTRU specifications to use Algorithm 6, and incorporating it into any future standardization efforts (for instance, we note that FrodoKEM [BCD + 16] is also no longer under consideration by NIST, but is being considered for standardization by ISO [Int23c]).Similarly, as discussed in §3, other PQC schemes such as Streamlined NTRU Prime and NTRU LPRime are likely to benefit from our approach, and we encourage investigations of its use in these schemes.We also suggest developing implementations for other widely-used architectures, in particular, Intel (using AVX2 and AVX-512 SIMD extensions) and the recently released ARMv8.1-MHelium SIMD instruction set for deeply embedded systems [Dir19].
Algorithm 8, as stated, is not amenable to vectorization, due to a loop-carried dependency between iterations of its main loop.Using a similar idea as the initial step of MergeShuffle (Algorithm 4), vectorization becomes possible; we developed a prototype implementation that confirms its potential for large speedups, especially on wide CPUs such as the M1 and M3.However, without applying the remaining steps of MergeShuffle, the resulting permutation is biased, which may create an avenue of attack.An alternative we envisioned involves sampling from the hypergeometric distribution; however, this is an uncommon distribution in cryptography, and we were unable to find any efficient, constanttime algorithms.We invite future work into either modifying MergeShuffle to be constant-time, or to propose efficient, constant-time hypergeometric sampling algorithms.
While Algorithm 5 already consumes fewer random bits than the approach mandated by the NTRU specification, there is room for improvement, as discussed in §4.This is particularly relevant for architectures without instructions to accelerate cryptographicallysecure PRNGs, in which case the performance bottleneck may be the generation of uniformly random samples consumed by Algorithm 5. We encourage further research into parameter choices and new approaches that may reduce randomness requirements.

Figure 1 :
Figure 1: Critical path of the main loop of Algorithm 8.
bit of mask

Table 1 :
[KRSS19]on of the original pqm4[KRSS19]NTRU implementation and our optimized version for encapsulation.Code size and stack usage are in bytes.For differences, positive values denote an increase in the optimized version relative to the original one.

Table 4 :
Cycle counts (in kilocycles) for fixed-weight sampling, excluding the cost of uniform random number generation.

Table 5 :
[KRSS19]e (Flash) and stack (RAM) usage, in bytes, for ARMv7-M binaries.Statically allocated data (.data and .bsssections)werereported as zero in all cases."Diff."refers to the percentual difference between implementations; positive values denote an increase in our version relative to[KRSS19].