Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation

Laurent-Stéphane Didier; Nadia El Mrabet; Léa Glandus; Jean-Marc Robert

doi:10.62056/a3txl86bm

Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation

Authors

Laurent-Stéphane Didier, Nadia El Mrabet, Léa Glandus, Jean-Marc Robert

Laurent-Stéphane Didier

Toulon, France
laurent-stephane dot didier at univ-tln dot fr

Nadia El Mrabet

Saint-Etienne, France
nadia dot el-mrabet at emse dot fr

Léa Glandus

Toulon, France
lea dot glandus at univ-tln dot fr

Jean-Marc Robert

Toulon, France
jean-marc dot robert at univ-tln dot fr

Keywords: Montgomery Modular multiplication Modular Exponentiation Batch Computation AVX512 SIMD RSA

Abstract

This paper presents software implementations of batch computations, dealing with multi-precision integer operations. In this work, we use the Single Instruction Multiple Data (SIMD) AVX512 instruction set of the x86-64 processors, in particular the vectorized fused multiplier-adder VPMADD52. We focus on batch multiplications, squarings, modular multiplications, modular squarings and constant time modular exponentiations of 8 values using a word-slicing storage. We explore the use of Schoolbook and Karatsuba approaches with operands up to 4108 and 4154 bits respectively. We also introduce a truncated multiplication that speeds up the computation of the Montgomery modular reduction in the context of software implementation. Our Truncated Montgomery modular multiplication improvement offers speed gains of almost 20 % over the conventional non-truncated versions. Compared to the state-of-the-art GMP and OpenSSL libraries, our speedup modular operations are more than 4 times faster. Compared to OpenSSL BN_mod_exp_mont_consttimex2 using AVX512 and madd52* (madd52hi or madd52lo) in 256-bit registers, in fixed-window exponentiations of sizes $1024$ and $2048$, our 512-bit implementation provides speedups of respectively 1.75 and 1.38, while the 256-bit version speedups are 1.51 and 1.05 for $1024$ and $2048$-bit sizes (batch of 4 values in this case).

References

[ABS10]

S. Antão, J.-C. Bajard, and L. Sousa. Elliptic Curve Point Multiplication on GPUs. In ASAP 2010-21st IEEE International Conference on Application-specific Systems, Architectures and Processors, pages 192–199. 2010. IEEE. DOI: 10.1109/ASAP.2010.5541000

Google Scholar ePrint

[Arc]

Architecture specification. Intel Advanced Vector extensions 10. https://www.intel.com/content/www/us/en/content-details/784267/intel-advanced-vector-extensions-10-intel-avx10-architecture-specification.html.

Google Scholar ePrint

[ARM]

ARM. Neon Intrinsics Reference. https://developer.arm.com/architectures/instruction-sets/intrinsics/.

Google Scholar ePrint

[Bar86]

P. Barrett. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. Advances in Cryptology — CRYPTO' 86, 263:311–323, 1986. DOI: 10.1007/3-540-47721-7_24

Google Scholar ePrint

[BCC⁺09]

D. J. Bernstein, H.-C. Chen, M.-S. Chen, C.-M. Cheng, C.-H. Hsiao, T. Lange, Z.-C. Lin, and B.-Y. Yang. The Billion-Mulmod-Per-Second PC. In Workshop record of SHARCS, volume 9, pages 131–144. 2009.

Google Scholar ePrint

[BGH22]

B. Buhrow, B. Gilbert, and C. Haider. Parallel Modular Multiplication using 512-bit Advanced Vector Instructions: RSA Fault-Injection Countermeasure via Interleaved Parallel Multiplication. Journal of Cryptographic Engineering, 12(1):95–105, 2022. DOI: 10.1007/s13389-021-00256-9

Google Scholar ePrint

[BI21]

C. Bouvier and L. Imbert. An Alternative Approach for SIDH Arithmetic. In Juan A. Garay, editor, Public-Key Cryptography – PKC 2021, pages 27–44, Cham. 2021. Springer International Publishing. DOI: 10.1007/978-3-030-75245-3_2

Google Scholar ePrint

[BKP21]

J. W. Bos, T. Kleinjung, and D. Page. Efficient Modular Multiplication, chapter 8. Cambridge University Press 2021.

Google Scholar ePrint

[BMSZ14]

J. W. Bos, P. L. Montgomery, D. Shumow, and G. M. Zaverucha. Montgomery Multiplication Using Vector Instructions. In Selected Areas in Cryptography – SAC 2013, pages 471–489, Berlin, Heidelberg. 2014. Springer Berlin Heidelberg. DOI: 10.1007/978-3-662-43414-7_24

Google Scholar ePrint

[Bos12]

J. W. Bos. Low-latency Elliptic Curve Scalar Multiplication. International Journal of Parallel Programming, 40:532–550, 2012. DOI: 10.1007/s10766-012-0198-5

Google Scholar ePrint

[BSS99]

I. F. Blake, G. Seroussi, and N. Smart. Elliptic Curves in Cryptography. Cambridge University Press 1999. DOI: 10.1017/CBO9781107360211

Google Scholar ePrint

[CFG⁺21]

H. Cheng, G. Fotiadis, J. Groszschädl, P. Y.A. Ryan, and P. Roenne. Batching CSIDH Group Actions using AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2021(4):618–649, 2021. DOI: 10.46586/tches.v2021.i4.618-649

Google Scholar ePrint

[CFGR22]

H. Cheng, G. Fotiadis, J. Groszschädl, and P. Y.A. Ryan. Highly Vectorized SIKE for AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2022(2):41–68, 2022. DOI: 10.46586/tches.v2022.i2.41-68

Google Scholar ePrint

[CMNT11]

J.-S. Coron, A. Mandal, D. Naccache, and M. Tibouchi. Fully Homomorphic Encryption over the Integers with Shorter Public Keys. In Phillip Rogaway, editor, Advances in Cryptology – CRYPTO 2011, pages 487–504, Berlin, Heidelberg. 2011. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-22792-9_28

Google Scholar ePrint

[DDX19]

J. Dyer, M. Dyer, and J. Xu. Practical homomorphic encryption over the integers for secure computation in the cloud. International Journal of Information Security, 18(5):549-579, October 2019. DOI: 10.1007/s10207-019-00427-0

Google Scholar ePrint

[DG19]

N. Drucker and S. Gueron. Fast Modular Squaring with AVX512IFMA. In 16th International Conference on Information Technology-New Generations (ITNG 2019), pages 3–8, Cham. 2019. Springer International Publishing. DOI: 10.1007/978-3-030-14070-0_1

Google Scholar ePrint

[DGK18]

N. Drucker, S. Gueron, and V. Krasnov. Fast Multiplication of Binary Polynomials with the Forthcoming Vectorized VPCLMULQDQ Instruction. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 115-119. 2018. DOI: 10.1109/ARITH.2018.8464777

Google Scholar ePrint

[DL06]

S. Duquesne and T. Lange. Pairing-based cryptography, chapter 24, pages 573–590. Chapman and Hall/CRC Press 2006. DOI: 10.1201/9781420034981.ch24

Google Scholar ePrint

[DL18]

J. Ding and S. Li. A Modular Multiplier Implemented With Truncated Multiplication. IEEE Transactions on Circuits and Systems II: Express Briefs, 65(11):1713-1717, 2018. DOI: 10.1109/TCSII.2017.2771239

Google Scholar ePrint

[DL20]

J. Ding and S. Li. A Low-Latency and Low-Cost Montgomery Modular Multiplier Based on NLP Multiplication. IEEE Transactions on Circuits and Systems II: Express Briefs, 67(7):1319-1323, 2020. DOI: 10.1109/TCSII.2019.2932328

Google Scholar ePrint

[ELWW16]

N. Emmart, J. Luitjens, C. Weems, and C. Woolley. Optimizing Modular Multiplication for Nvidia's Maxwell GPUs. In 2016 IEEE 23nd symposium on computer arithmetic (ARITH), pages 47–54. 2016. IEEE. DOI: 10.1109/ARITH.2016.21

Google Scholar ePrint

[ET20]

T. Edamatsu and D. Takahashi. Accelerating Large Integer Multiplication Using Intel AVX-512IFMA. In Algorithms and Architectures for Parallel Processing, pages 60–74, Cham. 2020. Springer International Publishing. DOI: 10.1007/978-3-030-38991-8_5

Google Scholar ePrint

[EZW18]

N. Emmart, F. Zheng, and C. Weems. Faster Modular Exponentiation using Double Precision Floating Point Arithmetic on the GPU. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 130–137. 2018. IEEE. DOI: 10.1109/ARITH.2018.8464792

Google Scholar ePrint

[FKL⁺20]

L. De Feo, D. Kohel, A. Leroux, C. Petit, and B. Wesolowski. SQISign: compact post-quantum signatures from quaternions and isogenies. Cryptology ePrint Archive, Paper 2020/1240. 2020.

Google Scholar ePrint

[Ga]

T. Granlund and al.. GNU Multiple Precision Arithmetic Library 6.1.2. https://gmplib.org/.

Google Scholar ePrint

[GGP08]

P. Grabher, J. Groszschaedl, and D. Page. On Software Parallel Implementation of Cryptographic Pairings. Cryptology ePrint Archive, Paper 2008/205. 2008.

Google Scholar ePrint

[GK12]

S. Gueron and V. Krasnov. Software Implementation of Modular Exponentiation, using Advanced Vector Instructions Architectures. In Arithmetic of Finite Fields: 4th International Workshop, WAIFI 2012, Bochum, Germany, July 16-19, 2012. Proceedings 4, pages 119–135. 2012. Springer. DOI: 10.1007/978-3-642-31662-3_9

Google Scholar ePrint

[GK16]

S. Gueron and V. Krasnov. Accelerating Big Integer Arithmetic using Intel IFMA Extensions. In 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pages 32–38. 2016. IEEE. DOI: 10.1109/ARITH.2016.22

Google Scholar ePrint

[Har05]

L. Hars. Fast Truncated Multiplication for Cryptographic Applications. In Cryptographic Hardware and Embedded Systems – CHES 2005, pages 211–225. 2005. Springer Berlin Heidelberg. DOI: 10.1007/11545262_16

Google Scholar ePrint

[Har06]

L. Hars. Applications of Fast Truncated Multiplication in Cryptography. EURASIP Journal on Embedded Systems, 2007(1):061721, December 2006. DOI: 10.1155/2007/61721

Google Scholar ePrint

[Int]

Intel. Intel Intrinsics Guide. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html.

Google Scholar ePrint

[KKAK96]

C. Kaya Koc, T. Acar, and B.S. Kaliski. Analyzing and Comparing Montgomery Multiplication Algorithms. IEEE Micro, 16(3):26-33, 1996. DOI: 10.1109/40.502403

Google Scholar ePrint

[Kra17]

V. Krasnov. On the dangers of intel’s frequency scaling. 2017.

Google Scholar ePrint

[MC14]

E. M Mahé and J.-M. Chauvet. Fast GPGPU-Based Elliptic Curve Scalar Multiplication. Cryptology ePrint Archive, 2014.

Google Scholar ePrint

[MJ17]

N.E. Mrabet and M. Joye. Guide to Pairing-Based Cryptography. Chapman and Hall/CRC Cryptography and Network Security Series. CRC Press 2017.

Google Scholar ePrint

[Mon85]

P. L. Montgomery. Modular Multiplication Without Trial Division. Mathematics of Computation, 44(170):519–521, 1985. DOI: 10.2307/2007970

Google Scholar ePrint

[Pro]

The OpenSSL Project. OpenSSL. https://www.openssl.org/.

Google Scholar ePrint

[RSA78]

R. L. Rivest, A. Shamir, and L. M. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Commun. ACM, 21(2):120–126, 1978. DOI: 10.1145/359340.359342

Google Scholar ePrint

[RV22]

J. -M. Robert and P. Véron. Faster Multiplication over ${\mathbb {F}}_2[X]$ using AVX512 Instruction Set and VPCLMULQDQ Instruction. Journal of Cryptographic Engineering, January 2022. DOI: 10.1007/s13389-021-00278-3

Google Scholar ePrint

[Sch96]

B. Schneier. Applied Cryptography - Protocols, Algorithms, and Source Code in C, 2nd Edition. Wiley 1996.

Google Scholar ePrint

[Tak20]

D. Takahashi. Fast Multiple Montgomery Multiplications Using Intel AVX-512IFMA Instructions. In Computational Science and Its Applications – ICCSA 2020, pages 655–663, Cham. 2020. Springer International Publishing. DOI: 10.1007/978-3-030-58814-4_52

Google Scholar ePrint

[Tre13]

Wilke Trei. Efficient Modular Arithmetic for SIMD Devices. Cryptology ePrint Archive, Report 2013/652. 2013.

Google Scholar ePrint

PDF Open access

DOI: https://doi.org/10.62056/a3txl86bm

History

Submitted: 2024-07-02
Accepted: 2024-09-02
Published: 2024-10-07

How to cite

Laurent-Stéphane Didier, Nadia El Mrabet, Léa Glandus, and Jean-Marc Robert, Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation. IACR Communications in Cryptology, vol. 1, no. 3, Oct 07, 2024, doi: 10.62056/a3txl86bm.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.

@article{10.62056/a3txl86bm,
         author={Laurent-Stéphane Didier and Nadia El Mrabet and Léa Glandus and Jean-Marc Robert},
         title={Truncated multiplication and batch software {SIMD} {AVX}512 implementation for faster Montgomery multiplications and modular exponentiation},
         volume={1},
         number={3},
         year={2024},
         date={2024-10-07},
         issn={3006-5496},
         doi={10.62056/a3txl86bm},
         journal={{IACR} Communications in Cryptology},
         publisher={International Association for Cryptologic Research}
}

TY  - JOUR
AU  - Laurent-Stéphane Didier
AU  - Nadia El Mrabet
AU  - Léa Glandus
AU  - Jean-Marc Robert
PY  - 2024
TI  - Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation
JF  - IACR Communications in Cryptology
JA  - CIC
VL  - 1
IS  - 3
DO  - 10.62056/a3txl86bm
UR  - https://doi.org/10.62056/a3txl86bm
AB  - <p>This paper presents software implementations of batch computations, dealing with multi-precision integer operations. In this work, we use the Single Instruction Multiple Data (SIMD) AVX512 instruction set of the x86-64 processors, in particular the vectorized fused multiplier-adder VPMADD52. We focus on batch multiplications, squarings, modular multiplications, modular squarings and constant time modular exponentiations of 8 values using a word-slicing storage. We explore the use of Schoolbook and Karatsuba approaches with operands up to 4108 and 4154 bits respectively. We also introduce a truncated multiplication that speeds up the computation of the Montgomery modular reduction in the context of software implementation. Our Truncated Montgomery modular multiplication improvement offers speed gains of almost 20 % over the conventional non-truncated versions. Compared to the state-of-the-art GMP and OpenSSL libraries, our speedup modular operations are more than 4 times faster.  Compared to OpenSSL BN_mod_exp_mont_consttimex2 using AVX512 and madd52* (madd52hi or madd52lo) in 256-bit registers, in fixed-window exponentiations of sizes $1024$ and $2048$, our 512-bit implementation provides speedups of respectively 1.75 and 1.38, while the 256-bit version speedups are 1.51 and 1.05 for $1024$ and $2048$-bit sizes (batch of 4 values in this case). </p>
ER  -

Laurent-Stéphane Didier, Nadia El Mrabet, Léa Glandus, and Jean-Marc Robert, Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation. IACR Communications in Cryptology, vol. 1, no. 3, Oct 07, 2024, doi: 10.62056/a3txl86bm.