Communications in Cryptology IACR CiC

Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation

Authors

Laurent-Stéphane Didier, Nadia El Mrabet, Léa Glandus, Jean-Marc Robert
Laurent-Stéphane Didier ORCID
Toulon, France
laurent-stephane dot didier at univ-tln dot fr
Nadia El Mrabet ORCID
Saint-Etienne, France
nadia dot el-mrabet at emse dot fr
Léa Glandus ORCID
Toulon, France
lea dot glandus at univ-tln dot fr
Jean-Marc Robert ORCID
Toulon, France
jean-marc dot robert at univ-tln dot fr

Abstract

This paper presents software implementations of batch computations, dealing with multi-precision integer operations. In this work, we use the Single Instruction Multiple Data (SIMD) AVX512 instruction set of the x86-64 processors, in particular the vectorized fused multiplier-adder VPMADD52. We focus on batch multiplications, squarings, modular multiplications, modular squarings and constant time modular exponentiations of 8 values using a word-slicing storage. We explore the use of Schoolbook and Karatsuba approaches with operands up to 4108 and 4154 bits respectively. We also introduce a truncated multiplication that speeds up the computation of the Montgomery modular reduction in the context of software implementation. Our Truncated Montgomery modular multiplication improvement offers speed gains of almost 20 % over the conventional non-truncated versions. Compared to the state-of-the-art GMP and OpenSSL libraries, our speedup modular operations are more than 4 times faster. Compared to OpenSSL BN_mod_exp_mont_consttimex2 using AVX512 and madd52* (madd52hi or madd52lo) in 256-bit registers, in fixed-window exponentiations of sizes $1024$ and $2048$, our 512-bit implementation provides speedups of respectively 1.75 and 1.38, while the 256-bit version speedups are 1.51 and 1.05 for $1024$ and $2048$-bit sizes (batch of 4 values in this case).

References

[ABS10]
S. Antão, J.-C. Bajard, and L. Sousa. Elliptic Curve Point Multiplication on GPUs. In ASAP 2010-21st IEEE International Conference on Application-specific Systems, Architectures and Processors, pages 192–199. 2010. IEEE. DOI: 10.1109/ASAP.2010.5541000
[Arc]
Architecture specification. Intel Advanced Vector extensions 10. https://www.intel.com/content/www/us/en/content-details/784267/intel-advanced-vector-extensions-10-intel-avx10-architecture-specification.html.
[Bar86]
P. Barrett. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. Advances in Cryptology — CRYPTO' 86, 263:311–323, 1986. DOI: 10.1007/3-540-47721-7_24
[BCC+09]
D. J. Bernstein, H.-C. Chen, M.-S. Chen, C.-M. Cheng, C.-H. Hsiao, T. Lange, Z.-C. Lin, and B.-Y. Yang. The Billion-Mulmod-Per-Second PC. In Workshop record of SHARCS, volume 9, pages 131–144. 2009.
[BGH22]
B. Buhrow, B. Gilbert, and C. Haider. Parallel Modular Multiplication using 512-bit Advanced Vector Instructions: RSA Fault-Injection Countermeasure via Interleaved Parallel Multiplication. Journal of Cryptographic Engineering, 12(1):95–105, 2022. DOI: 10.1007/s13389-021-00256-9
[BI21]
C. Bouvier and L. Imbert. An Alternative Approach for SIDH Arithmetic. In Juan A. Garay, editor, Public-Key Cryptography – PKC 2021, pages 27–44, Cham. 2021. Springer International Publishing. DOI: 10.1007/978-3-030-75245-3_2
[BKP21]
J. W. Bos, T. Kleinjung, and D. Page. Efficient Modular Multiplication, chapter 8. Cambridge University Press 2021.
[BMSZ14]
J. W. Bos, P. L. Montgomery, D. Shumow, and G. M. Zaverucha. Montgomery Multiplication Using Vector Instructions. In Selected Areas in Cryptography – SAC 2013, pages 471–489, Berlin, Heidelberg. 2014. Springer Berlin Heidelberg. DOI: 10.1007/978-3-662-43414-7_24
[Bos12]
J. W. Bos. Low-latency Elliptic Curve Scalar Multiplication. International Journal of Parallel Programming, 40:532–550, 2012. DOI: 10.1007/s10766-012-0198-5
[BSS99]
I. F. Blake, G. Seroussi, and N. Smart. Elliptic Curves in Cryptography. Cambridge University Press 1999. DOI: 10.1017/CBO9781107360211
[CFG+21]
H. Cheng, G. Fotiadis, J. Groszschädl, P. Y.A. Ryan, and P. Roenne. Batching CSIDH Group Actions using AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2021(4):618–649, 2021. DOI: 10.46586/tches.v2021.i4.618-649
[CFGR22]
H. Cheng, G. Fotiadis, J. Groszschädl, and P. Y.A. Ryan. Highly Vectorized SIKE for AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES), 2022(2):41–68, 2022. DOI: 10.46586/tches.v2022.i2.41-68
[CMNT11]
J.-S. Coron, A. Mandal, D. Naccache, and M. Tibouchi. Fully Homomorphic Encryption over the Integers with Shorter Public Keys. In Phillip Rogaway, editor, Advances in Cryptology – CRYPTO 2011, pages 487–504, Berlin, Heidelberg. 2011. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-22792-9_28
[DDX19]
J. Dyer, M. Dyer, and J. Xu. Practical homomorphic encryption over the integers for secure computation in the cloud. International Journal of Information Security, 18(5):549-579, October 2019. DOI: 10.1007/s10207-019-00427-0
[DG19]
N. Drucker and S. Gueron. Fast Modular Squaring with AVX512IFMA. In 16th International Conference on Information Technology-New Generations (ITNG 2019), pages 3–8, Cham. 2019. Springer International Publishing. DOI: 10.1007/978-3-030-14070-0_1
[DGK18]
N. Drucker, S. Gueron, and V. Krasnov. Fast Multiplication of Binary Polynomials with the Forthcoming Vectorized VPCLMULQDQ Instruction. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 115-119. 2018. DOI: 10.1109/ARITH.2018.8464777
[DL06]
S. Duquesne and T. Lange. Pairing-based cryptography, chapter 24, pages 573–590. Chapman and Hall/CRC Press 2006. DOI: 10.1201/9781420034981.ch24
[DL18]
J. Ding and S. Li. A Modular Multiplier Implemented With Truncated Multiplication. IEEE Transactions on Circuits and Systems II: Express Briefs, 65(11):1713-1717, 2018. DOI: 10.1109/TCSII.2017.2771239
[DL20]
J. Ding and S. Li. A Low-Latency and Low-Cost Montgomery Modular Multiplier Based on NLP Multiplication. IEEE Transactions on Circuits and Systems II: Express Briefs, 67(7):1319-1323, 2020. DOI: 10.1109/TCSII.2019.2932328
[ELWW16]
N. Emmart, J. Luitjens, C. Weems, and C. Woolley. Optimizing Modular Multiplication for Nvidia's Maxwell GPUs. In 2016 IEEE 23nd symposium on computer arithmetic (ARITH), pages 47–54. 2016. IEEE. DOI: 10.1109/ARITH.2016.21
[ET20]
T. Edamatsu and D. Takahashi. Accelerating Large Integer Multiplication Using Intel AVX-512IFMA. In Algorithms and Architectures for Parallel Processing, pages 60–74, Cham. 2020. Springer International Publishing. DOI: 10.1007/978-3-030-38991-8_5
[EZW18]
N. Emmart, F. Zheng, and C. Weems. Faster Modular Exponentiation using Double Precision Floating Point Arithmetic on the GPU. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 130–137. 2018. IEEE. DOI: 10.1109/ARITH.2018.8464792
[FKL+20]
L. De Feo, D. Kohel, A. Leroux, C. Petit, and B. Wesolowski. SQISign: compact post-quantum signatures from quaternions and isogenies. Cryptology ePrint Archive, Paper 2020/1240. 2020.
[Ga]
T. Granlund and al.. GNU Multiple Precision Arithmetic Library 6.1.2. https://gmplib.org/.
[GGP08]
P. Grabher, J. Groszschaedl, and D. Page. On Software Parallel Implementation of Cryptographic Pairings. Cryptology ePrint Archive, Paper 2008/205. 2008.
[GK12]
S. Gueron and V. Krasnov. Software Implementation of Modular Exponentiation, using Advanced Vector Instructions Architectures. In Arithmetic of Finite Fields: 4th International Workshop, WAIFI 2012, Bochum, Germany, July 16-19, 2012. Proceedings 4, pages 119–135. 2012. Springer. DOI: 10.1007/978-3-642-31662-3_9
[GK16]
S. Gueron and V. Krasnov. Accelerating Big Integer Arithmetic using Intel IFMA Extensions. In 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH), pages 32–38. 2016. IEEE. DOI: 10.1109/ARITH.2016.22
[Har05]
L. Hars. Fast Truncated Multiplication for Cryptographic Applications. In Cryptographic Hardware and Embedded Systems – CHES 2005, pages 211–225. 2005. Springer Berlin Heidelberg. DOI: 10.1007/11545262_16
[Har06]
L. Hars. Applications of Fast Truncated Multiplication in Cryptography. EURASIP Journal on Embedded Systems, 2007(1):061721, December 2006. DOI: 10.1155/2007/61721
[KKAK96]
C. Kaya Koc, T. Acar, and B.S. Kaliski. Analyzing and Comparing Montgomery Multiplication Algorithms. IEEE Micro, 16(3):26-33, 1996. DOI: 10.1109/40.502403
[MC14]
E. M Mahé and J.-M. Chauvet. Fast GPGPU-Based Elliptic Curve Scalar Multiplication. Cryptology ePrint Archive, 2014.
[MJ17]
N.E. Mrabet and M. Joye. Guide to Pairing-Based Cryptography. Chapman and Hall/CRC Cryptography and Network Security Series. CRC Press 2017.
[Mon85]
P. L. Montgomery. Modular Multiplication Without Trial Division. Mathematics of Computation, 44(170):519–521, 1985. DOI: 10.2307/2007970
[Pro]
The OpenSSL Project. OpenSSL. https://www.openssl.org/.
[RSA78]
R. L. Rivest, A. Shamir, and L. M. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Commun. ACM, 21(2):120–126, 1978. DOI: 10.1145/359340.359342
[RV22]
J. -M. Robert and P. Véron. Faster Multiplication over ${\mathbb {F}}_2[X]$ using AVX512 Instruction Set and VPCLMULQDQ Instruction. Journal of Cryptographic Engineering, January 2022. DOI: 10.1007/s13389-021-00278-3
[Tak20]
D. Takahashi. Fast Multiple Montgomery Multiplications Using Intel AVX-512IFMA Instructions. In Computational Science and Its Applications – ICCSA 2020, pages 655–663, Cham. 2020. Springer International Publishing. DOI: 10.1007/978-3-030-58814-4_52
[Tre13]
Wilke Trei. Efficient Modular Arithmetic for SIMD Devices. Cryptology ePrint Archive, Report 2013/652. 2013.

PDFPDF Open access

History
Submitted: 2024-07-02
Accepted: 2024-09-02
Published: 2024-10-07
How to cite

Laurent-Stéphane Didier, Nadia El Mrabet, Léa Glandus, and Jean-Marc Robert, Truncated multiplication and batch software SIMD AVX512 implementation for faster Montgomery multiplications and modular exponentiation. IACR Communications in Cryptology, vol. 1, no. 3, Oct 07, 2024, doi: 10.62056/a3txl86bm.

License

Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.