Search results

Implicit Factorization with Shared Any Bits

Chunzhi Zhao, Junqi Zhang, Jinzheng Cao, Qingfeng Cheng, Fushan Wei

Published 2024-10-07 Show abstract

PDF

At PKC 2009, May and Ritzenhofen proposed the implicit factorization problem (IFP). They showed that it is undemanding to factor two h-bit RSA moduli N1=p1q1, N2=p2q2 where q1, q2 are both αh-bit, and p1, p2 share uh>2αh the least significant bits (LSBs). Subsequent works mainly focused on extending the IFP to the cases where p1, p2 share some of the most significant bits (MSBs) or the middle bits (MBs). In this paper, we propose a novel generalized IFP where p1 and p2 share an arbitrary number of bit blocks, with each block having a consistent displacement in its position between p1 and p2, and we solve it successfully based on Coppersmith’s method. Specifically, we generate a new set of shift polynomials to construct the lattice and optimize the structure of the lattice by introducing a new variable z=p1. We derive that we can factor the two moduli in polynomial time when u>2(n+1)α(1−α^1/(n+1)) with p1, p2 sharing n blocks. Further, no matter how many blocks are shared, we can theoretically factor the two moduli as long as u>2αln(1/α). In addition, we consider two other cases where the positions of the shared blocks are arbitrary or there are k>2 known moduli. Meanwhile, we provide the corresponding solutions for the two cases. Our work is verified by experiments.

The Perils of Limited Key Reuse: Adaptive and Parallel Mismatch Attacks with Post-processing Against Kyber

Qian Guo, Erik Mårtensson, Adrian Åström

Published 2024-10-07 Show abstract

PDF

The Module Learning With Errors (MLWE)-based Key Encapsulation Mechanism (KEM) Kyber is NIST's new standard scheme for post-quantum encryption. As a building block, Kyber uses a Chosen Plaintext Attack (CPA)-secure Public Key Encryption (PKE) scheme, referred to as Kyber.CPAPKE. In this paper we study the robustness of Kyber.CPAPKE against key mismatch attacks.

We demonstrate that Kyber's security levels can be compromised if having access to a few mismatch queries of Kyber.CPAPKE, by striking a balance between the parallelization level and the cost of lattice reduction for post-processing. This highlights the imperative need to strictly prohibit key reuse in Kyber.CPAPKE.

We further propose an adaptive method to enhance parallel mismatch attacks, initially proposed by Shao et al. at AsiaCCS 2024, thereby significantly reducing query complexity. This method combines the adaptive attack with post-processing via lattice reduction to retrieve the final secret key entries. Our method proves its efficacy by reducing query complexity by 14.6 % for Kyber512 and 7.5 % for Kyber768/Kyber1024.

Furthermore, this approach has the potential to improve multi-value Plaintext-Checking (PC) oracle-based side-channel attacks and fault-injection attacks against Kyber itself.

Efficient Boolean-to-Arithmetic Mask Conversion in Hardware

Aein Rezaei Shahmirzadi, Michael Hutter

Published 2024-10-07 Show abstract

PDF

Masking schemes are key in thwarting side-channel attacks due to their robust theoretical foundation. Transitioning from Boolean to arithmetic (B2A) masking is a necessary step in various cryptography schemes, including hash functions, ARX-based ciphers, and lattice-based cryptography. While there exists a significant body of research focusing on B2A software implementations, studies pertaining to hardware implementations are quite limited, with the majority dedicated solely to creating efficient Boolean masked adders. In this paper, we present first- and second-order secure hardware implementations to perform B2A mask conversion efficiently without using masked adder structures. We first introduce a first-order secure low-latency gadget that executes a B2A2k in a single cycle. Furthermore, we propose a second-order secure B2A2k gadget that has a latency of only 4 clock cycles. Both gadgets are independent of the input word size k. We then show how these new primitives lead to improved B2Aq hardware implementations that perform a B2A mask conversion of integers modulo an arbitrary number. Our results show that our new gadgets outperform comparable solutions by more than a magnitude in terms of resource requirements and are at least 3 times faster in terms of latency and throughput. All gadgets have been formally verified and proven secure in the glitch-robust PINI security model. We additionally confirm the security of our gadgets on an FPGA platform using practical TVLA tests.

Unpacking Needs Protection - A Single-Trace Secret Key Recovery Attack on Dilithium

Ruize Wang, Kalle Ngo, Joel Gärtner, Elena Dubrova

Published 2024-10-07 Show abstract

PDF

Most of the previous attacks on Dilithium exploit side-channel information which is leaked during the computation of the polynomial multiplication cs1, where s1 is a small-norm secret and c is a verifier's challenge. In this paper, we present a new attack utilizing leakage during secret key unpacking in the signing algorithm. The unpacking is also used in other post-quantum cryptographic algorithms, including Kyber, because inputs and outputs of their API functions are byte arrays. Exploiting leakage during unpacking is more challenging than exploiting leakage during the computation of cs1 since c varies for each signing, while the unpacked secret key remains constant. Therefore, post-processing is required in the latter case to recover a full secret key. We present two variants of post-processing. In the first one, a half of the coefficients of the secret s1 and the error s2 is recovered by profiled deep learning-assisted power analysis and the rest is derived by solving linear equations based on t = As1 + s2, where A and t are parts of the public key. This case assumes knowledge of the least significant bits of t, t0. The second variant uses lattice reduction to derive s1 without the knowledge of t0. However, it needs a larger portion of s1 to be recovered by power analysis. We evaluate both variants on an ARM Cortex-M4 implementation of Dilithium-2. The experiments show that the attack assuming the knowledge of t0 can recover s1 from a single trace captured from a different from profiling device with a non-negligible probability.

Memory adds no cost to lattice sieving for computers in 3 or more spatial dimensions

Samuel Jaques

Published 2024-10-07 Show abstract

PDF

The security of lattice-based crytography (LWE, NTRU, and FHE) depends on the hardness of the shortest-vector problem (SVP). Sieving algorithms give the lowest asymptotic runtime to solve SVP, but depend on exponential memory. Memory access costs much more in reality than in the RAM model, so we consider a computational model where processors, memory, and meters of wire are in constant proportions to each other. While this adds substantial costs to route data during lattice sieving, we modify existing algorithms to amortize these costs and find that, asymptotically, a classical computer can achieve the previous RAM model cost of $2^{0.2925d+o(d)}$ to sieve a $d$-dimensional lattice for a computer existing in 3 or more spatial dimensions, and can reach $2^{0.3113d+o(d)}$ in 2 spatial dimensions, where “spatial dimensions” are the dimensions of the physical geometry in which the computer exists.

Since this result implies an increased cost in 2 spatial dimensions, we make several assumptions about the constant terms of memory access and lattice attacks so that we can give bit security estimates for Kyber and Dilithium. These estimates support but do not increase the security categories claimed in the Kyber and Dilithium specifications, at least with respect to known attacks.

Information Theoretic Evaluation of Raccoon's Side-Channel Leakage

Dinal Kamel, François-Xavier Standaert, Olivier Bronchain

Published 2024-10-07 Show abstract

PDF

Raccoon is a lattice-based scheme submitted to the NIST 2022 call for additional post-quantum signatures. One of its main selling points is that its design is intrinsically easy to mask against side-channel attacks. So far, Raccoon's physical security guarantees were only stated in the abstract probing model. In this paper, we discuss how these probing security results translate into guarantees in more realistic leakage models. We also highlight that this translation differs from what is usually observed (e.g., in symmetric cryptography), due to the algebraic structure of Raccoon's operations. For this purpose, we perform an in-depth information theoretic evaluation of Raccoon's most innovative part, namely the AddRepNoise function which allows generating its arithmetic shares on-the-fly. Our results are twofold. First, we show that the resulting shares do not enforce a statistical security order (i.e., the need for the side-channel adversary to estimate higher-order moments of the leakage distribution), as usually expected when masking. Second, we observe that the first-order leakage on the (large) random coefficients manipulated by Raccoon cannot be efficiently turned into leakage on the (smaller) coefficients of its long-term secret. Concretely, our information theoretic evaluations for relevant leakage functions also suggest that Raccoon's masked implementations can ensure high security with less shares than suggested by a conservative analysis in the probing model.

Provably Secure and Area-Efficient Modular Addition over Boolean Shares

Guilhèm Assael, Philippe Elbaz-Vincent

Published 2024-07-08 Show abstract

PDF

Several cryptographic schemes, including lattice-based cryptography and the SHA-2 family of hash functions, involve both integer arithmetic and Boolean logic. Each of these classes of operations, considered separately, can be efficiently implemented under the masking countermeasure when resistance against vertical attacks is required. However, protecting interleaved arithmetic and logic operations is much more expensive, requiring either additional masking conversions to switch between masking schemes, or implementing arithmetic functions as nonlinear operations over a Boolean masking. Both solutions can be achieved by providing masked arithmetic addition over Boolean shares, which is an operation with relatively long latency and usually high area utilization in hardware. A further complication arises when the arithmetic performed by the scheme is over a prime modulus, which is common in lattice-based cryptography. In this work, we propose a first-order masked implementation of arithmetic addition over Boolean shares occupying a very small area, while still having reasonable latency. Our proposal is specifically tuned for efficient addition and subtraction modulo an arbitrary integer, but it can also be configured at runtime for power-of-two arithmetic. To the best of our knowledge, we propose the first such construction whose security is formally proven in the glitch+transition-robust probing model.

A Survey of Polynomial Multiplications for Lattice-Based Cryptosystems

Vincent Hwang

Published 2024-07-08 Show abstract

PDF

We survey various mathematical tools used in software works multiplying polynomials in \[ \frac{\mathbb{Z}_q[x]}{\left\langle {x^n - \alpha x - \beta} \right\rangle}. \] In particular, we survey implementation works targeting polynomial multiplications in lattice-based cryptosystems Dilithium, Kyber, NTRU, NTRU Prime, and Saber with instruction set architectures/extensions Armv7-M, Armv7E-M, Armv8-A, and AVX2.

There are three emphases in this paper: (i) modular arithmetic, (ii) homomorphisms, and (iii) vectorization. For modular arithmetic, we survey Montgomery, Barrett, and Plantard multiplications. For homomorphisms, we survey (a) various homomorphisms such as Cooley–Tukey FFT, Good–Thomas FFT, Bruun's FFT, Rader's FFT, Karatsuba, and Toom–Cook; (b) various algebraic techniques for adjoining nice properties to the coefficient rings, including localization, Schönhage's FFT, Nussbaumer's FFT, and coefficient ring switching; and (c) various algebraic techniques related to the polynomial moduli, including twisting, composed multiplication, evaluation at $\infty$, truncation, incomplete transformation, striding, and Toeplitz matrix-vector product. For vectorization, we survey the relations between homomorphisms and vector arithmetic.

We then go through several case studies: We compare the implementations of modular multiplications used in Dilithium and Kyber, explain how the matrix-to-vector structure was exploited in Saber, and review the design choices of transformations for NTRU and NTRU Prime with vectorization. Finally, we outline several interesting implementation projects.

Survey: Recovering cryptographic keys from partial information, by example

Gabrielle De Micheli, Nadia Heninger

Published 2024-04-09 Show abstract

PDF

Side-channel attacks targeting cryptography may leak only partial or indirect information about the secret keys. There are a variety of techniques in the literature for recovering secret keys from partial information. In this work, we survey several of the main families of partial key recovery algorithms for RSA, (EC)DSA, and (elliptic curve) Diffie-Hellman, the classical public-key cryptosystems in common use today. We categorize the known techniques by the structure of the information that is learned by the attacker, and give simplified examples for each technique to illustrate the underlying ideas.

Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs

Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López

Published 2024-04-09 Show abstract

PDF

Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by $1.54$–$3.07\times$, $1.08$–$1.33\times$, $1.11$–$1.50\times$ and $1.20$–$1.98\times$, respectively, over the previous state-of-the-art.

A provably masked implementation of BIKE Key Encapsulation Mechanism

Loïc Demange, Mélissa Rossi

Published 2024-04-09 Show abstract

PDF

BIKE is a post-quantum key encapsulation mechanism (KEM) selected for the 4th round of the NIST's standardization campaign. It relies on the hardness of the syndrome decoding problem for quasi-cyclic codes and on the indistinguishability of the public key from a random element, and provides the most competitive performance among round 4 candidates, which makes it relevant for future real-world use cases. Analyzing its side-channel resistance has been highly encouraged by the community and several works have already outlined various side-channel weaknesses and proposed ad-hoc countermeasures. However, in contrast to the well-documented research line on masking lattice-based algorithms, the possibility of generically protecting code-based algorithms by masking has only been marginally studied in a 2016 paper by Chen et al. in SAC 2015. At this stage of the standardization campaign, it is important to assess the possibility of fully masking BIKE scheme and the resulting cost in terms of performances.

In this work, we provide the first high-order masked implementation of a code-based algorithm. We had to tackle many issues such as finding proper ways to handle large sparse polynomials, masking the key-generation algorithm or keeping the benefit of the bitslicing. In this paper, we present all the gadgets necessary to provide a fully masked implementation of BIKE, we discuss our different implementation choices and we propose a full proof of masking in the Ishai Sahai and Wagner (Crypto 2003) model.

More practically, we also provide an open C-code masked implementation of the key-generation, encapsulation and decapsulation algorithms with extensive benchmarks. While the obtained performance is slower than existing masked lattice-based algorithms, we show that masking at order 1, 2, 3, 4 and 5 implies a performance penalty of x5.8, x14.2, x24.4, x38 and x55.6 compared to order 0 (unmasked and unoptimized BIKE). This scaling is encouraging and no Boolean to Arithmetic conversion has been used.

Verifiable FHE via Lattice-based SNARKs

Shahla Atapoor, Karim Baghery, Hilder V. L. Pereira, Jannik Spiessens

Published 2024-04-09 Show abstract

PDF

Fully Homomorphic Encryption (FHE) is a prevalent cryptographic primitive that allows for computation on encrypted data. In various cryptographic protocols, this enables outsourcing computation to a third party while retaining the privacy of the inputs to the computation. However, these schemes make an honest-but-curious assumption about the adversary. Previous work has tried to remove this assumption by combining FHE with Verifiable Computation (VC). Recent work has increased the flexibility of this approach by introducing integrity checks for homomorphic computations over rings. However, efficient FHE for circuits of large multiplicative depth also requires non-ring computations called maintenance operations, i.e. modswitching and keyswitching, which cannot be efficiently verified by existing constructions. We propose the first efficiently verifiable FHE scheme that allows for arbitrary depth homomorphic circuits by utilizing the double-CRT representation in which FHE schemes are typically computed, and using lattice-based SNARKs to prove components of this computation separately, including the maintenance operations. Therefore, our construction can theoretically handle bootstrapping operations. We also present the first implementation of a verifiable computation on encrypted data for a computation that contains multiple ciphertext-ciphertext multiplications. Concretely, we verify the homomorphic computation of an approximate neural network containing three layers and >100 ciphertexts in less than 1 second while maintaining reasonable prover costs.

12 results sorted by publication date