Communications in Cryptology IACR CiC

Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs


Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López
Décio Luiz Gazzoni Filho ORCID
Universidade Estadual de Campinas (UNICAMP), Campinas, Brazil
State University of Londrina, Londrina, Brazil
decio dot gazzoni at ic dot unicamp dot br
Guilherme Brandão ORCID
Independent Researcher, Londrina, Brazil
brandaogbs at gmail dot com
Julio López ORCID
Universidade Estadual de Campinas (UNICAMP), Campinas, Brazil
jlopez at ic dot unicamp dot br


Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by $1.54$–$3.07\times$, $1.08$–$1.33\times$, $1.11$–$1.50\times$ and $1.20$–$1.98\times$, respectively, over the previous state-of-the-art.


D. G Altman and J M. Bland. Statistics notes: absence of evidence is not evidence of absence. BMJ, 311(7003):485–485, August 1995.
American National Standards Institute. Lattice-based polynomial public key establishment algorithm for the financial services industry. 2017.
Apple Inc. Accelerate: make large-scale mathematical computations and image calculations, optimized for high performance and low energy consumption. 2023.
Joppe W. Bos, Craig Costello, Léo Ducas, Ilya Mironov, Michael Naehrig, Valeria Nikolaenko, Ananth Raghunathan, and Douglas Stebila. Frodo: take off the ring! Practical, quantum-secure key exchange from LWE. In Edgar R. Weippl, Stefan Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, ACM CCS 2016: 23rd Conference on Computer and Communications Security, 1006–1018. October 2016. ACM Press.
Bundesamt für Sicherheit in der Informationstechnik. Migration to post quantum cryptography: recommendations for action by the BSI. 2021.
Peter Cawley. Apple AMX instruction set. 2024.
Han-Ting Chen, Yi-Hua Chung, Vincent Hwang, and Bo-Yin Yang. Algorithmic views of vectorized polynomial multipliers – NTRU. 2023.
Cong Chen, Oussama Danba, Jeffrey Hoffstein, Andreas Hulsing, Joost Rijneveld, John M. Schanck, Peter Schwabe, William Whyte, Zhenfei Zhang, Tsunekazu Saito, Takashi Yamakawa, and Keita Xagawa. NTRU. Technical Report, National Institute of Standards and Technology, 2020.
Boru Chen, Yingchen Wang, Pradyumna Shome, Christopher W. Fletcher, David Kohlbrenner, Riccardo Paccagnella, and Daniel Genkin. GoFetch: breaking constant-time cryptographic implementations using data memory-dependent prefetchers. In USENIX Security. 2024.
Jan-Pieter D'Anvers, Angshuman Karmakar, Sujoy Sinha Roy, Frederik Vercauteren, Jose Maria Bermudo Mera, Michiel Van Beirendonck, and Andrea Basso. SABER. Technical Report, National Institute of Standards and Technology, 2020.
Maynard Handley. AArch64-Explore: exploration of Apple CPUs – volume 3: SoC. 2023.
Dennis Hofheinz, Kathrin Hövelmanns, and Eike Kiltz. A modular analysis of the Fujisaki-Okamoto transformation. In Yael Kalai and Leonid Reyzin, editors, TCC 2017: 15th Theory of Cryptography Conference, Part I, volume 10677 of Lecture Notes in Computer Science, 341–371. November 2017. Springer, Heidelberg.
John L. Hennessy and David A. Patterson. A new golden age for computer architecture. Communications of the ACM, 62(2):48–60, January 2019.
Jeffrey Hoffstein, Jill Pipher, and Joseph H. Silverman. NTRU: a new high speed public key cryptosystem. 1996.
Jeffrey Hoffstein, Jill Pipher, and Joseph H. Silverman. NTRU: a ring-based public key cryptosystem. In Joe P. Buhler, editor, Algorithmic Number Theory, volume 1423 of Lecture Notes in Computer Science, 267–288. Berlin, Heidelberg, 1998. Springer Berlin Heidelberg.
Andreas Hülsing, Joost Rijneveld, John M. Schanck, and Peter Schwabe. High-speed key encapsulation from NTRU. In Wieland Fischer and Naofumi Homma, editors, Cryptographic Hardware and Embedded Systems – CHES 2017, volume 10529 of Lecture Notes in Computer Science, 232–252. September 2017. Springer, Heidelberg.
Institute of Electrical and Electronics Engineers. IEEE standard specification for public key cryptographic techniques based on hard problems over lattices. 2009.
Intel Corporation. Intel\textsuperscript ® architecture instruction set extensions and future features: programming reference (revision 047). December 2022.
Intel Corporation. Data operand independent timing instructions. 2023.
International Organization for Standardization. FrodoKEM: learning with errors key encapsulation preliminary draft standard. 2023.
Dougall Johnson. Apple M1 microarchitecture research. 2022.
Dougall Johnson. IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX. 2022.
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, 1–12. New York, NY, USA, 2017. Association for Computing Machinery.
Daniël Lakens. Equivalence tests: a practical primer for $t$ tests, correlations, and meta-analyses. Social Psychological and Personality Science, 8(4):355–362, May 2017.
Thomas Lumley, Paula Diehr, Scott Emerson, and Lu Chen. The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23(1):151–169, May 2002.
Vadim Lyubashevsky, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Peter Schwabe, Gregor Seiler, Damien Stehlé, and Shi Bai. CRYSTALS-DILITHIUM. Technical Report, National Institute of Standards and Technology, 2022.
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw., 5(3):308–323, September 1979.
Wai-Kong Lee, Hwajeong Seo, Seong Oun Hwang, Ramachandra Achar, Angshuman Karmakar, and Jose Maria Bermudo Mera. DPCrypto: acceleration of post-quantum cryptography using dot-product instructions on GPUs. IEEE Transactions on Circuits and Systems I: Regular Papers, 69(9):3591–3604, 2022.
Wai-Kong Lee, Hwajeong Seo, Zhenfei Zhang, and Seong Oun Hwang. TensorCrypto: high throughput acceleration of lattice-based cryptography using tensor core on GPU. IEEE Access, 10():20616–20632, 2022.
José E. Moreira, Kit Barton, Steven Battle, Peter Bergner, Ramon Bertran, Puneeth Bhat, Pedro Caldeira, David Edelsohn, Gordon C. Fossum, Brad Frey, Nemanja Ivanovic, Chip Kerchner, Vincent Lim, Shakti Kapoor, Tulio Machado Filho, Silvia Melitta Mueller, Brett Olsson, Satish Sadasivam, Baptiste Saleil, Bill Schmidt, Rajalakshmi Srinivasaraghavan, Shricharan Srivatsan, Brian W. Thompto, Andreas Wagner, and Nelson Wu. A matrix math facility for Power ISA(TM) processors. 2021.
S. Markidis, S. Chien, E. Laure, I. Peng, and J. S. Vetter. NVIDIA tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 522–531. Los Alamitos, CA, USA, May 2018. IEEE Computer Society.
Michael Naehrig, Erdem Alkim, Joppe Bos, Léo Ducas, Karen Easterbrook, Brian LaMacchia, Patrick Longa, Ilya Mironov, Valeria Nikolaenko, Christopher Peikert, Ananth Raghunathan, and Douglas Stebila. FrodoKEM. Technical Report, National Institute of Standards and Technology, 2020.
National Institute of Standards and Technology. Post-quantum cryptography standardization: call for proposals announcement. 2017.
Duc Tri Nguyen and Kris Gaj. Fast NEON-based multiplication for lattice-based NIST post-quantum cryptography finalists. In Jung Hee Cheon and Jean-Pierre Tillich, editors, Post-Quantum Cryptography - 12th International Workshop, PQCrypto 2021, 234–254. July 2021. Springer, Heidelberg.
Andres Rodriguez. Deep Learning Systems: Algorithms, Compilers, and Processors for Large-Scale Production. Synthesis Lectures on Computer Architecture. Springer Cham, October 2020. ISBN 978-3-031-00641-8.
Peter Schwabe, Roberto Avanzi, Joppe Bos, Léo Ducas, Eike Kiltz, Tancrède Lepoint, Vadim Lyubashevsky, John M. Schanck, Gregor Seiler, and Damien Stehlé. CRYSTALS-KYBER. Technical Report, National Institute of Standards and Technology, 2020.
Ali Sazegari, Eric Bainville, Jeffry E. Gonion, Gerard R. Williams III, and Andrew J. Beaumont-Smith. Outer product engine. 2016.
Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput., 26(5):1484–1509, October 1997.
John M. Schanck, Andreas Hulsing, Joost Rijneveld, and Peter Schwabe. NTRU-HRSS-KEM. Technical Report, National Institute of Standards and Technology, 2017.
Finn Wilkinson and Simon McIntosh-Smith. An initial evaluation of Arm’s Scalable Matrix Extension. In 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 135–140. 2022.
Lipeng Wan, Fangyu Zheng, Guang Fan, Rong Wei, Lili Gao, Yuewu Wang, Jingqiang Lin, and Jiankuo Dong. A novel high-performance implementation of CRYSTALS-Kyber with AI accelerator. In Vijayalakshmi Atluri, Roberto Di Pietro, Christian Damsgaard Jensen, and Weizhi Meng, editors, ESORICS 2022: 27th European Symposium on Research in Computer Security, Part III, volume 13556 of Lecture Notes in Computer Science, 514–534. September 2022. Springer, Heidelberg.
Zhenfei Zhang, Cong Chen, Jeffrey Hoffstein, William Whyte, John M. Schanck, Andreas Hulsing, Joost Rijneveld, Peter Schwabe, and Oussama Danba. NTRUEncrypt. Technical Report, National Institute of Standards and Technology, 2019.

PDFPDF Open access

Submitted: 2024-01-05
Accepted: 2024-03-05
Published: 2024-04-09
How to cite

Décio Luiz Gazzoni Filho, Guilherme Brandão, and Julio López, "Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs," IACR Communications in Cryptology, vol. 1, no. 1, Apr 09, 2024, doi: 10.62056/a3txommol.


Copyright is held by the author(s)

This work is licensed under a Creative Commons Attribution (CC BY) license.