Optimizations and Practicality of High-Security CSIDH

. In this work, we assess the real-world practicality of CSIDH, an isogeny-based non-interactive key exchange. We provide the first thorough assessment of the practicality of CSIDH in higher parameter sizes for conservative estimates of quantum security, and with protection against physical attacks. This requires a three-fold analysis of CSIDH. First, we describe two approaches to efficient high-security CSIDH implementations, based on SQALE and CTIDH. Second, we optimize such high-security implementations, on a high level by improving several subroutines, and on a low level by improving the finite field arithmetic. Third, we benchmark the performance of high-security CSIDH. As a stand-alone primitive, our implementations outperform previous results by a factor up to 2.53×. As a real-world use case considering network protocols, we use CSIDH in TLS variants that allow early authentication through a NIKE. Although our instantiations of CSIDH have smaller communication requirements than post-quantum KEM and signature schemes, even our highly-optimized implementations result in too-large handshake latency (tens of seconds), showing that CSIDH is only practical in niche cases.


Introduction
The commutative isogeny-based key exchange protocol (CSIDH) was proposed by Castryck, Lange, Martindale, Panny, and Renes [22] at Asiacrypt 2018.Although it was proposed too late to be included as a candidate in the NIST post-quantum standardization effort [53], it has since received significant attention from the post-quantum-crypto research community.
From a crypto-engineering point of view, this attention can be explained by two unique features of CSIDH: Firstly, with the originally proposed parameters, CSIDH has remarkably small bandwidth requirements.Specifically, CSIDH-512, the parameter set targeting security equivalent to AES-128, needs to transmit only 64 bytes each way, more than 10 times less than Kyber-512, the KEM chosen for standardization by NIST.Secondly-and more importantly-CSIDH is so far the only realistic option for post-quantum non-interactive key exchange (NIKE), meaning it can be used as a post-quantum drop-in replacement for Diffie-Hellman (DH) key exchange in protocols that combine ephemeral and static DH key shares non-interactively.Such protocols include the Signal X3DH handshake [49] and early proposals for TLS 1.3 [38], known as OPTLS.The OPTLS authentication mechanism is still under consideration as an extension [58].CSIDH is the only post-quantum NIKE that might enable these use cases, except for the recently-proposed Swoosh algorithm [30] (which has too-large public keys for use in TLS).
Unfortunately, quite soon after CSIDH was proposed, several security analyses called into question the claimed concrete security against quantum attacks achieved by the proposed parameters [12,24,55].The gist of these analyses seems troublesome; Peikert [55] states that "the cost of CSIDH-512 key recovery is only about 2 16 quantum evaluations using 2 40 bits of quantumly accessible classical memory (plus relatively small other resources)".Similarly, Bonnetain and Schrottenloher [12] claim a cost of 2 19 quantum evaluations for attacking the same instance, and propose a quantum circuit requiring only 2 52.6 T-gates per evaluation, which means the security would still be insufficient.Upon exploring the quantum cost of attacking larger instances but ignoring the cost per CSIDH quantum evaluation, instances may require 2048 to 4096 bit keys to achieve the security level originally claimed by CSIDH-512 [24].
Interestingly, although some of these concerns were raised as early as May 2018 (i.e., at a time when [22] was available only as preprint), most research on efficient implementations [5,23,50,54], side-channel attacks [17], and fault attacks against CSIDH [7,16,44] continued to work with the original parameters.This can probably partly be explained by the fact that the software implementation referenced in [22] 8 implements only the smaller two of the three original parameter sets, i.e., CSIDH-512 and CSIDH-1024.However, another reason is that the concerns about the quantum security of CSIDH were (and to some extent still are) subject of debate.Most notably, Bernstein, Lange, Martindale, and Panny [9] point out that one issue with quantum attacks against CSIDH is the rather steep cost of implementing CSIDH on a quantum computer in the first place.They conclude that the cost of each query pushes the total attack cost above 2 80 .
In this paper, we do not take any position in this ongoing debate but rather set out to answer the question of what it means for CSIDH performance and applicability if we choose more conservative parameters.This includes protection against physical attacks, which is often required for real-world applications.We call such instantiations high-security CSIDH.
Contributions of this paper.The core contribution of this paper is an indepth assessment of the real-world practicality of CSIDH. 9On a high level, this assessment is divided into three parts.First, we instantiate CSIDH at high(er) security levels, suitable for real-world applications, and with protection against physical attacks; second, we optimize the efficiency of high-security CSIDH; third, we test the practicality of high-security CSIDH.
1. Efficient CSIDH instantiations, following two different approaches of implementing high-security CSIDH.
(a) The first approach aims at protection against physical attacks, and is based on SQALE [24].In this approach, we eliminate randomness requirements and the use of dummy operations in CSIDH by restricting the keyspace to {−1, 1} n , as proposed by Cervantes-Vázquez, Chenu, Chi-Domínguez, De Feo, Rodríguez-Henríquez, and Smith [23].We refer to this deterministic version of CSIDH as dCSIDH.(b) The second approach optimizes purely for performance and uses the CTIDH batching techniques introduced in [5].We refer to this variant of CSIDH as CTIDH.In particular, we extend the implementation from [5] to larger parameter sets.2. Optimized implementation of dCSIDH and CTIDH.
(a) On a high level, we present faster key validation for large parameters, and add a small number of bits to public keys to improve shared-key generation in dCSIDH.(b) On a low level, we improve the finite field arithmetic.Our implementations use curves over large prime fields F p , where p ranges from 2048 to 9216 bits.We optimize arithmetic in these fields for 64-bit Intel processors, specifically the Skylake microarchitecture, using three different options for the underlying field arithmetic.
(a) As a standalone primitive, we benchmark our optimized C/assembly implementations.Our dCSIDH implementation outperforms previous implementations by a factor up to 2.53×.Our CTIDH implementation is the first using large parameters, and, dropping determinism, is thrice as fast as dCSIDH.(b) As a real-world use case, we benchmark both dCSIDH and CTIDH in realworld network protocols.We extend the Rustls library [11] to support OPTLS [38].OPTLS is a variant of the TLS 1.3 handshake that heavily relies on NIKE for authentication, and avoids handshake signatures (which are especially large (Dilithium [47]) or hard to implement (Falcon [56]) in the post-quantum setting).We compare the performance of the resulting post-quantum OPTLS to post-quantum KEMTLS [61], which is an OPTLS-inspired protocol that uses KEMs for authentication to avoid handshake signatures (but requires significant changes to the handshake protocol).Our results show that dCSIDH and CTIDH are too slow for general-purpose use, as a fully CSIDH-instantiated handshake protocol, though smaller in bandwidth requirements, is orders of magnitude slower than an equivalent based on signatures or KEMs.This implies that current NIKE-based protocols will require changes to transition to post-quantum security, if they are sensitive to latency.

Related work.
The impact that the proposal of CSIDH has produced in the community can be assessed by the many papers that have been produced around this protocol.Since Castryck, Lange, Martindale, Panny, and Renes [22] left open the problem of implementing CSIDH in constant-time, several papers were published proposing different strategies for achieving this property.The first constant-time implementation of CSIDH was reported by Bernstein, Lange, Martindale, and Panny [9].Their analysis focused on assessing the quantum security level provided by CSIDH.For this purpose, they strove for producing not only a constant-time CSIDH instantiation but also a randomness-free implementation of it.Meyer, Campos, and Reith [50] (see also [51]) presented a more efficient constant-time instantiation of CSIDH for practical purposes.They introduced several algorithmic tricks, including the SIMBA technique, and sampling secret keys from varying intervals.This was further improved by Onuki, Aikawa, Yamazaki, and Takagi [54], who proposed to keep track of two points to evaluate the action of an ideal: one in E(F p ), and one in E(F p 2 ) with its x-coordinate in F p .Moreover, Moriya, Onuki, and Takagi [52], and Cervantes-Vázquez et al. [23], performed more efficient CSIDH isogeny computations using the twisted Edwards model of elliptic curves.The authors of [23] proposed a more computationally demanding dummy-free variant of CSIDH, which in exchange, is arguably better suited to resist physical attacks from stronger adversaries, such as fault attacks.
In a second wave of studies around CSIDH, several crucial building blocks were improved.[26,34] presented a framework that permits to adapt the optimal strategies of SIDH/SIKE into the context of CSIDH.The computation of large degree isogenies using an improved version of Vélu's formulas known as √ élu [8], was exploited in [1,5].Variants of CSIDH were reported in [19,20,25].
A breakthrough in the performance of constant-time CSIDH was achieved by Banegas, Bernstein, Campos, Chou, Lange, Meyer, Smith, and Sotáková [5], resulting in an almost twofold speedup.They introduce a variant, named CTIDH, using a new key space and accompanying constant-time algorithms that exploit the idea of batching isogeny degrees.However, the performance evaluation of [5] is restricted to primes of 512 and 1024 bits.The authors of [24] presented SQALE, the first CSIDH implementation at higher security levels going all the way from primes of size 2000 bits up to 9000 bits.The software we present here starts from the analysis and parameter sizes proposed in [24] to reach NIST security levels 1 (equivalent AES-128) and 3 (equivalent AES-192) under different assumptions about the efficiency of quantum attacks.However, our results go much further with regard to optimizing parameters and implementation techniques than [24].
CSIDH is not the only attempt at building a post-quantum NIKE.Although the SIDH protocol [28,35] was known to be insecure in the static-static scenario [31], it was suggested in [3] that a NIKE could still be obtained at the cost of many parallel executions of SIDH.However, SIDH was completely broken by the attacks in [18,48,59].The only post-quantum NIKE that is not based on isogenies is based on (R/M)LWE and, according to Lyubashevsky, goes back to "folkore" [46].Such a NIKE was first analyzed in [37].A more concrete instantiation of this approach is the recently proposed Swoosh [30].We discuss differences between CSIDH and Swoosh in a bit more detail in Section 7.
Availability of software.We place our CSIDH software into the public domain (CC0).All software described in this paper and all measurement data from the TLS experiments are available at https://github.com/kemtls-secsidh/code.
Organization of this paper.Section 2 presents the necessary background on isogeny-based cryptography and introduces CSIDH and its CTIDH instantiation.Section 3 explains how we instantiate dCSIDH and CTIDH and choose parameters for our optimized implementations.Section 4 introduces algorithmic optimizations that apply to our instantiations of dCSIDH and CTIDH.Section 5 details our optimization techniques for finite field arithmetic, in particular the efficient Karatsuba-based field arithmetic, and presents benchmarking results for the group action evaluation for dCSIDH and CTIDH.Section 6 describes our integration of dCSIDH and CTIDH into OPTLS and presents handshake performance results.Finally, Section 7 concludes the paper and sketches directions for future work.

NIKEs vs. KEMs
We briefly recall the definitions of non-interactive key exchange (NIKE) and key-encapsulation mechanism (KEM) as follows: Both NIKEs and KEMs can be used for key exchange, but the non-interactive nature of a NIKE makes it more flexible than a KEM.In the context of their use in protocols, there are three different scenarios: 1. Some scenarios naturally use a KEM.Those scenarios can alternatively also use a NIKE, but they do not benefit in any way from the non-interactive nature of a NIKE.An example for this scenario is the ephemeral key exchange in TLS 1.3, which currently uses (EC)DH, but will easily migrate to postquantum KEMs [13,14,41,42,64].2. Some protocols, most notably the X3DH protocol in Signal [49] have to use a NIKE and cannot replace this NIKE by a KEM.The reason is that this protocol cannot assume communication partners to be online at the same time and critically relies on the non-interactive nature of a NIKE.3. Some protocols are somewhat in between: they can be designed from KEMs only, but this comes at the cost of more communication rounds.This has been discussed in some detail in the design of post-quantum Noise [2] and also in the context of the NIKE-based OPTLS [38] vs. the KEM-based KEMTLS [61].We will revisit the comparison of these two protocols in a post-quantum context in more detail in Section 6.

The CSIDH NIKE
Background.Let F p be a finite field of prime order p, such that p where each ℓ i is a small odd prime, f ≥ 4 is a power of two, and g is a cofactor guaranteeing that p is prime.Now consider the set of supersingular elliptic curves over F p , i.e., the elliptic curves with p + 1 F p -rational points.We will represent these curves in the Montgomery model, i.e., through an equation of the form This is possible since the group order (p + 1) is a multiple of 4. In the context of CSIDH we are interested in isogeny graphs of degree N , denoted G N (F p ).The vertices of such graphs are precisely the supersingular curves over F p ; the edges are F p -rational isogenies of degree N .CSIDH relies on the following property: for each small odd prime ℓ i dividing p + 1, a supersingular curve E A has only two (supersingular) neighbors in the isogeny graph G ℓi (F p ) (i.e., isogenies over F p of degree ℓ i ).We can uniquely describe these isogenies by their kernels: The unique cyclic subgroup of order ℓ i of E A (F p ) defines the isogeny from E A to one of these neighbors E A ′ .This cyclic subgroup can be described by any of its generators, which in this case means that finding a point in E A (F p ) of order ℓ i is enough to describe an isogeny of degree ℓ i .As E A ′ is again supersingular, E A ′ (F p ) has order p + 1 as well and hence a unique cyclic subgroup of order ℓ i , which gives an isogeny to the unique neighbor that is not E A .The general action of moving in this direction in this graph G ℓi (F p ) using the unique subgroup of order ℓ i is denoted by l i , and the curve E A ′ that is reached from E A by this action is denoted l i * E A .In short, l i represents one step in the isogeny graph G ℓi (F p ), and each small odd prime ℓ i dividing p + 1 gives us such an l i .Steps in G ℓi (F p ), represented by l i , are commutative, so that applying l i to l j * E A is the same as applying l j to l i * E A for different degrees ℓ i and ℓ j .We can also compute steps in the other direction, which is denoted by l −1 i * E A .The subgroup of points of order ℓ i with x-coordinate in F p and y-coordinate in F p 2 \F p uniquely defines the corresponding isogeny kernels.Applying both l i and l −1 i effectively cancels out, i.e., we have l i * (l The CSIDH scheme.The CSIDH scheme [22] unrolls naturally from the action described above: The secret key is a vector of n integers (e 1 , . . ., e n ) defining the product a = n i=1 l ei i .In the original proposal the integers e i are chosen from {−m, . . ., m} for some m ∈ N, which results in a key space of size (2m + 1) n .The public key is the supersingular curve E A which corresponds to the secret key a applied to a publicly known starting curve E 0 : This public key E A can be encoded by the single value A ∈ F p (see Equation ( 1)).Shared-key computation is the same as public-key computation, except that instead of the public parameter E 0 it uses a public key E A as input curve.That is, Alice and Bob compute their shared secret by calculating to the commutativity.This is summarized by the following diagram: Computing the group action a * E. Straightforward high-level pseudocode for the computation of the group action a * E is given in Algorithm 1.The dominating cost is the construction and evaluation of the ℓ i -isogenies corresponding to the action of the l i (Lines 5 and 7), which in turn decompose into a sequence of operations in F p .However, the high-level view also illustrates an additional complication for secure implementations of CSIDH, namely that the number of iterations of the inner loop (Line 3) and the direction of the isogenies corresponding to the action of l i (Line 4) depend on the secrets e i and naive implementations thus leak secret information through timing.
Algorithm 1 High-level view of the CSIDH group action computation.
Input: I ∈ Fp defining a curve EI Input: secret key (e1, . . ., en) for j from 1 to |ei| do 4: if ei > 0 then 5: end for 10: end for 11: return R For constant-time behavior, we need to be careful not to leak this information on e i .Current implementations of CSIDH hide e i by computing m isogenies per degree ℓ i , while effectively performing |e i | isogenies, e.g., by using dummy computations or computations that effectively cancel each other such as l i * l −1 i * E. For the sake of simplicity, Algorithm 1 omits the description of several underlying building blocks.For example, the computation of an isogeny of degree ℓ i requires as input a point of order ℓ i .Points of a prescribed order can be obtained probabilistically by sampling random points on the current curve.Any randomly sampled point T can generate exactly one isogeny of those degrees ℓ i that divide the order of T , by pushing T through such isogenies to get a similar point T on the codomain curve.The order in which we perform such ℓ i -isogenies giving a point T that can perform multiple of them influences the performance.Hence, different strategies, i.e. orderings of ℓ i -isogenies, point evaluations, and point multiplications, can affect performance.Several efficient strategies are described in, e.g., [22,26].We describe our choices for the CSIDH group action computation in more detail in Section 3 and Section 4.

Computing a single isogeny E
be computed in multiple ways: Traditionally, the formulas introduced by Vélu [63] are used, at a cost of approximately 6ℓ field multiplications for an isogeny of degree ℓ.In 2020, [8] presented new formulas for constructing and evaluating isogenies of degree ℓ, at a combined cost of just Õ( √ ℓ) field multiplications, denoted as

√
élu.With respect to CSIDH, [1] reports that the √ élu formulas of [8] improve the traditional formulas for isogenies of degree ℓ ≥ 89, and concludes that constant-time CSIDH implementations using 511-and 1023-bit primes are moderately improved by the √ élu formulas.The authors from [19] presented a variant of CSIDH named CSURF, which essentially proposes using 2-isogenies by calculating radical computations (i.e., by performing exponentiation with a fixed exponent along with a field inversion).[21] extended the radical approach to compute isogenies for odd isogeny degrees less than 13.Both works suggest a modest savings in the running time of CSIDH and essentially CSURF can be considered CSIDH with radical isogenies of degree 2. On the one side, the authors from [25] improved the formulas from [19,21] by presenting an inverse-free method to compute such radical isogenies at the cost of a single exponentiation.Conversely, the recent work from [20] provided some interesting improvements (in terms of field multiplication) to the results from [21]; they still require one exponentiation by a fixed exponent and at least one field inversion, which are the bottleneck.Nevertheless, [25] additionally showed that such radical isogenies become too costly in large CSIDH parameters.On that basis, we will not make use of the radical isogenies, as the analysis from [25] shows that this is unfavorable when the base field F p is larger than 1024 bits.

CTIDH
Banegas, Bernstein, Campos, Chou, Lange, Meyer, Smith, and Sotáková [5] proposed a new approach for constant-time CSIDH, named CTIDH.The main novelties are a different way of specifying the key spaces, and some algorithmic adaptions in order to obtain a constant-time algorithm.
Instead of directly sampling the key elements e i,j from some interval [−m, m] as in CSIDH, CTIDH only limits the 1-norm of each key batch.That is, for the i-th batch (ℓ i,1 , . . ., ℓ i,Ni ), we fix a bound m i and sample corresponding key elements e i,j such that Ni j=1 |e i,j | ≤ m i .This means that for each isogeny we compute for the i-th batch, its degree could be any of ℓ i,1 , . . ., ℓ i,Ni .This adds a combinatorial advantage, in the sense that the same number of isogenies as in CSIDH leads to a much larger key space size in CTIDH.In other words, CTIDH requires a smaller number of isogenies for reaching the same key space size.For example, the fastest previous constant-time implementation of CSIDH-512 with key space size 2 256 required the computation of 438 isogenies, while the CTIDH parameters of [5] only requires 208 isogenies for the same key space size.For details, we refer to [5].We note that as defined above, CSIDH is a special case of CTIDH using n batches of size 1.

CTIDH algorithm.
The main problem for constant-time implementations with this adapted key space lies in the fact that we must hide the degree of each isogeny from side channels.Given that the computational effort for an isogeny directly depends on its degree, a straightforward implementation of CTIDH would leak the degree of each isogeny.On the other hand, an attacker must not be able to observe to which degree out of {ℓ i,1 , . . ., ℓ i,Ni } each isogeny for the i-th batch corresponds.[5] achieves this by using an observation from [9].The usual isogeny formulas [8,63], have a Matryoshka-doll structure.That is, if ℓ i < ℓ j , then an ℓ j -isogeny performs exactly the computations that an ℓ i -isogeny would require, plus some extra operations.Therefore, we can easily compute an ℓ i -isogeny at the cost of an ℓ j -isogeny, by performing dummy operations for the extra steps.In CTIDH, we use this idea to compute each isogeny for the i-th batch (ℓ i,1 , . . ., ℓ i,Ni ) at the cost of the most expensive degree, i.e., an ℓ i,Ni -isogeny.In this way, the isogeny degrees do not leak via timing channels.
There are several other operations that require adjustments in CTIDH in order to obtain a constant-time implementation.For instance, this includes scalar multiplications that produce points of suitable order, or point rejections, which must occur independently of the required isogeny degree.For details on how these issues are resolved, we refer to [5].
Even though these algorithmic adjustments induce some computational overhead, CTIDH is almost twice as fast as its CSIDH counterpart for the CSIDH-512 and CSIDH-1024 parameter sets from [22] (see [5]).

Quantum security
While classical security imposes a restriction on the minimum key space size, quantum security usually poses more restrictive requirements.However, it is argued in [24] that for reasonable key spaces (that is, spaces large enough to achieve classical security), the quantum security of CSIDH relies only on the size of the prime p, regardless of the size of the actual key space being used.This is due to the fact that the most efficient quantum attack, Kuperberg's algorithm [40], requires working over a set with a group structure.Since the entire group representing all possible isogenies is of size roughly √ p, 10 this attack needs to search a space much larger than the keyspace itself, which only depends on n and the exponent bound m.For example, in the case of CSIDH-512, the element l 3 alone generates the entire group of size roughly 2 257 [10].It is expected that a handful of l i generate the entire group also for larger instances.In a nutshell, classical security is determined by the size of the key space, whereas quantum security is determined by the size of p, as long as the key space is not chosen particularly badly, e.g., as a small subgroup of the full class group.

Proposed instantiations of CSIDH
In this section, we describe how to instantiate and choose parameters for largeparameter CSIDH.We describe two different approaches to selecting parameters: dCSIDH targets a deterministic and dummy-operation-free implementation 11 , whereas CTIDH optimizes for the batching strategies proposed in [5].This reflects the two extreme choices one can make to either prioritize security against physical attacks or speed.We note that there are several choices in the middle ground, trading off physical security for speed.For comparability, both approaches share the choice of underlying finite fields F p , which we detail in Section 3.1.

The choice of p
In this work, we take the conservative parameter suggestions from [24] at face value.In particular, we consider primes of 2048 and 4096 bits to target NIST security level 1, 5120 and 6144 bits to target NIST security level 2, and 8192 and 9216 bits to target NIST security level 3.Each pair of bitsizes represents a choice between more "aggressive" assumptions (with attacker circuit depth bounded by 2 60 ) or more "conservative" assumptions (attacker circuit depth bounded by 2 80 ).As stressed in [24], this choice of parameters does not take into account the cost of calls to the CSIDH evaluation oracle on a quantum computer and is likely to underestimate security.However, as discussed in Section 1, we merely aim at giving performance results for conservative parameters.All our implementations use primes of the form p = f • n i=1 ℓ i − 1, where ℓ i are distinct odd primes, f is a large power of 2 and n denotes the number of such ℓ i dividing p + 1.For these sizes of p, it becomes natural to pick secret key exponents e i ∈ {−1, +1}, as n can be chosen large enough to reach the desired keyspace size [23,24].In particular, to achieve a keyspace of b bits in CSIDH we need to have at least n = b of these ℓ i in this case.
Finally, we restrict to cofactors f for which the power of 2 is a multiple of 64, since the arithmetic optimizations discussed in Section 5 require this shape.Hence, to find optimal primes for our implementation, we let ℓ 1 , . . ., ℓ b be the b smallest odd primes and then compute the cofactor f as the largest power of 2 64 that fits in the leftover bitlength.This still leaves us with a bitlength slightly smaller than the target, and hence the leftover bits can be used to search for additional factors These extra factors go unused for dCSIDH, where they are viewed as part of the cofactor, but are exploited by the batching strategies of CTIDH to increase performance.We set a minimum requirement of 5 additional ℓ i factors (that is, n ≥ b + 5), decreasing f by a single factor of 2 64 when not enough bits were left over.The results of this search are shown in Table 1.
Table 1: Parameters for reconstructing each prime p = f • n i=1 ℓ i − 1.In each case the ℓ i are assumed to be the first n odd primes, excluding some primes and including larger primes ℓ i to ensure that p is prime.These are given in the Excluded and Included columns.

Parameters for dummy-free, deterministic dCSIDH
The restriction of exponents to {−1, +1} makes it easier to make dCSIDH deterministic and dummy free [23,24], as we always perform only one isogeny of each degree, with the only variable being the "direction" of each isogeny.Since isogenies in either direction require exactly the same operations, it is easy to obtain a constant-time implementation without using dummy operations.Randomness appears in the traditional CSIDH implementation: it arises from the fact that performing isogenies of degree ℓ i requires a point of order ℓ i as input, and such a point is obtained by sampling random points on the current curve.Any random point can either be used for "positive" steps l +1 i or "negative" steps l −1 i .Hence, a point of order ℓ i can be used only once and only for a specific orientation.Doing more than one isogeny of each degree requires us, therefore, to sample new points midway.However, by restricting e i to {−1, +1}, we have to compute only one isogeny per degree ℓ i .This allows us to avoid random sampling by providing a pair of points T + , T − beforehand whose orders are divisble by all ℓ i , where T + can be used for the positive steps l i with e i = 1, and T − for the negative steps l −1 i , with e i = −1.We refer to such points as full-torsion points, as they allow us to perform an isogeny of every degree ℓ i by multiplying them by the right scalar.That is, to perform an ℓ i -isogeny in the "plus" direction, we can use the point [ p+1 ℓi ] T + of order ℓ i .Note that the probability for the order of a random point to contain the factor ℓ i is given by ℓi−1 ℓi .Thus, sampling for a pair of full-torsion points can be expensive when small factors ℓ i are used, as they dominate the probability ℓi−1 ℓi of sampling a full-torsion point.Since the primes we use always have additional ℓ i factors that are unused in dCSIDH (see Section 3.1), we make point sampling more efficient by always discarding the smallest primes rather than the largest ones, increasing the odds to sample a full-torsion point.For example, the prime p4096 has 262 ℓ i factors but only needs a keyspace of 2 256 , hence we can discard 6 primes.By discarding the 6 smallest ones, the probability to sample a full-torsion point goes up from ℓi−1 ℓi ≈ 0.418, making it more than 2.7 times as easy to sample full-torsion points T + and T − .Such a shift in primes causes a trade-off in the rest of the protocol, as higher-degree isogenies are more expensive.However, due to the improvements in [8], the extra cost of using ℓ 257 , . . ., ℓ 262 instead of ℓ 1 , . . ., ℓ 6 is relatively small in comparison to the total cost of a group action computation.Thus, discarding the smallest ℓ i is preferable as it significantly decreases the cost of sampling full-torsion points, and only increases the cost of computing a * E by a marginal amount.
The points T + , T − on the starting curve E 0 can be precomputed and considered public parameters, but for the public-key curves they must be computed in real time.We include the computation of these points in the key generation, and include them in the public key, which makes the shared-secret derivation completely constant-time and deterministic.The key generation is then the only part that does not run in strictly constant wall-clock time (yet is implemented following the constant-time paradigm), but is still made deterministic by sampling points in a pre-defined order.As we describe in Section 4, these points can be represented in a very compact form, which increases public-key sizes by only a few bits.We further emphasize that in order to avoid active attacks, the shared-key computation must validate these transmitted points to be full-torsion points.
Following the SQALE implementation [24], we use the optimal strategy approach from [26] to efficiently evaluate the class group action.

Parameters for CTIDH
As mentioned above, the instantiations of dCSIDH that we use are designed as dummy-free and deterministic algorithms, in order to avoid potential issues with randomness and dummy operations.However, these choices induce significant computational overhead.Therefore, we additionally give performance results for CTIDH [5], the fastest available constant-time implementation of CSIDH (allowing randomness and dummy operations), at the same security levels so that we can compare performance.Note that [5] only reports performance results for 512-bit and 1024-bit primes.
For the parameter sizes considered in this work, we thus use the same primes as in the dCSIDH case (see Table 1).This allows for a simple comparison of the two approaches, since both implementations use the same finite field arithmetic (see Section 5).On the other hand, it is unclear which parameters are optimal for CTIDH with the given prime sizes.A larger number of small prime factors ℓ i in the factorization of p + 1 can be beneficial, since the combinatorial advantage of CTIDH batching increases with the number of available prime degrees.On the other hand, this would mean that we have to include larger ℓ i , and therefore compute more expensive large degree isogenies.Furthermore, the choice of CTIDH parameters, i.e., batches and norm bounds, becomes more challenging at larger prime sizes.We thus leave the exploration of optimal CTIDH parameters for large primes as future work.
For the given primes, we use the greedy algorithm from [5] for determining these additional parameters, adapted to the case of the cofactor f > 4. On input of the primes ℓ i and a fixed number of batches, the algorithm searches for a locally optimal way of batching the primes, and according norm bounds, such that the expected number of field multiplications per group action evaluation is minimized.However, for the parameter sizes in this work, the greedy search becomes increasingly inefficient.We could thus only run searches for a small set of potential batch numbers, especially for the larger parameters.We obtained these potential inputs by extrapolating from the data of smaller parameter sizes from [5] and slightly beyond.For concrete parameter choices, we refer to our software.Note that the choice of a different number of batches could improve the results, but an exhaustive search using the greedy algorithm seems out of reach.
Apart from the parameters and batching setup, our CTIDH implementation uses the algorithms and strategies from [5].We remark that CTIDH could in theory also be implemented in a dummy-free or deterministic way.[5] presents an algorithm that avoids dummy isogenies, but points out that the Matryoshka isogenies require dummy operations by design.Thus, the current techniques do not allow for a dummy-free implementation of CTIDH.Further, the design of a deterministic variant of CTIDH requires some adaptions, such as computing multiple isogenies per batch in a single round.We leave the design and analysis of such an implementation for future work.

Optimizing dCSIDH and CTIDH
Given the parameter choices from Section 3, we describe the high-level optimizations we apply for dCSIDH and CTIDH.Note that apart from the improved public key validation, we use the standard CTIDH implementation from [5] extended to the parameter sizes from Section 3.3.For dCSIDH, we present several improvements in Section 4.2.

Supersingularity verification
For the prime choices from Section 3.1, we need to adapt the supersingularity verification from [22].In particular, given primes with cofactor log f > 1  2 log p, both algorithms discussed in [22, Alg. 1 and Alg.3] to test supersingularity of a public key E A do not work.
Note that these supersingular tests, verify whether #E A (F p ) = p + 1, by showing that there is a point P with large enough order N | p+1.Both algorithms start by sampling a random point P , followed by a multiplication by the cofactor P ← [f ]P , and then by checking whether the resulting point has ℓ i -torsion.This is done by calculating if [ j̸ =i ℓ j ]P ̸ = O and [ ℓ j ]P = O.If the random point P has ℓ i -torsion for enough ℓ i such that their product ℓ i ≥ 4 √ p, then in the Hasse interval p + 1 − 2 √ p ≤ #E A (F p ) ≤ p + 1 + 2 √ p, p + 1 is the only possible multiple of its order ord(P ).This implies that #E A (F p ) = p + 1.
Unfortunately, this approach cannot be applied to our setting, because for primes where log f > 1 2 log p, even a point with ℓ i -torsion for all i does not reach the threshold 4 √ p, as log( ℓ i ) = log p − log f ≤ 1 2 log p.We conclude that due to the large cofactors included in the primes targeted in this work, [22, Alg. 1 and Alg.3] cannot perform a sound supersingularity test within our setting.
Luckily, in the primes as above, where f = 2 k , we can improve this algorithm to verify supersingularity: Instead of verifying that the order of a random point P has enough ℓ i -torsion, we verify P has 2 k -torsion.When log f = k > 1  2 log p, verifying that P has 2 k -torsion implies that E A must be supersingular by the same logic as above.Furthermore, for Montgomery curves E A , we can sample P directly from E A (F p ) \ [2]E A by picking a point with rational non-square x-coordinate [27].This ensures we always sample P with maximum 2 k -torsion.Using x-only arithmetic, we only have to keep track of x P .We name this approach to verify supersingularity VeriFast, as described in Algorithm 2. VeriFast can be performed deterministically or probabilistically: Given a point with rational non-square x-coordinate, the algorithm always returns v 2 = k in case of supersingularity.Otherwise, any random point is likely to have v 2 close to k, and hence still verifies supersingularity if the cofactor is a few bits larger than 4 √ p.For the probabilistic approach, we pick x P = 2 ∈ F p , hence P = (2, −), for all supersingularity checks.This has the advantage that multiplication by 2 can be performed as a simple addition, and hence, x P = 2 optimizes the arithmetic in the computation of x P ← [ p+1 f ]x P .Furthermore, the bound 4 √ p can be improved to 2 √ p as this still implies p + 1 is the only multiple in the Hasse interval.VeriFast is faster than any of the analyzed algorithms in [6], with a cost of O(log p).More specifically, it requires a scalar multiplication by a scalar of log p − k bits and (at most) k point doublings, where f = 2 k is the cofactor.In comparison to Doliskani's test [6,29], also of complexity O(log p), we have the advantage that we can stay over F p .The condition that f > 1 + log p/2 holds for our primes p5120 and beyond.More importantly, even with the probabilistic approach, for these primes the probability to sample a point that does not have large enough 2 z -torsion is lower than 2 −256 .For the primes where f ≤ 1 + log p/2, we can still use the 2 f -torsion, as in VeriFast, but we are required to also verify some ℓ i -torsion to cross the bound 2 √ p.A comparison of performance between

Algorithm 2
VeriFast and previous methods is given in Table 2, showing VeriFast is 28 to 38 times as fast for large primes.

Optimized dCSIDH public keys
As described in Section 3.2, dCSIDH is dummy-free and deterministic by using secret key exponents e i ∈ {−1, 1}, and public keys of the form (A, T + , T − ).Recall, T + and T − are full-torsion points that can be used to perform positive steps l +1 i and negative steps l −1 i respectively.For sampling suitable points T + and T − for public keys during key generation, we use the Elligator map (A, u) → (T + , T − ) from [23], with Montgomery parameter A ∈ F p and an Elligator seed u ∈ F p .The output of Elligator is exactly such a pair of points T ′ + and T ′ − , although they might not be full-torsion, that is, their respective orders might not be divisible by all ℓ i .Let P be either T + or T − .To efficiently determine if P is a full-torsion point, we follow the usual product-tree approach that was also applied for public key validation in [22].This requires us to compute p+1 ℓi P for each ℓ i , and checking that these points are not equal to the point at infinity.In order to obtain a deterministic algorithm, we try Elligator seeds from a pre-defined sequence (u 1 , u 2 , . ..) until we find full-torsion points T + and T − .To determine which of the points T ± is T + resp.T − , Elligator requires a Legendre symbol computation.In the case of our proposed dCSIDH configuration with public inputs A and u, we can use a fast non-constant-time algorithm for the Legendre symbol computation as the one presented in Hamburg [33].
Thus, a dCSIDH public key consists of an affine Montgomery coefficient A ∈ F p , and an Elligator seed u ∈ F p such that elligator(A, u) returns two full-torsion points T + and T − on E A .We choose the fixed potential values for u small to get a public key (A, u) of only log 2 (p) + ε bits for small ε > 0.
Finally, a user has to verify such a public key (A, u).For A, we verify E A is supersingular as described in Section 4.1.For u, we verify that it generates two full-torsion points T + and T − , by ensuring at the computation of each step l ±1 i * E that the correct multiple of both T + and T − are not the point at infinity (i.e., both have order ℓ i ) regardless of which point we use to compute the step.
Remark 1.An alternative to finding and including an Elligator seed u ∈ F p in the public key is to find and include small x-coordinates x + and x − that define full-torsion points T + = (x + , −) and T − (x − , −).Information-theoretically, u and the pair (x + , x − ) share similar probabilities (to generate full-torsion points) and hence their bitlengths should be comparatively small.One advantage of x + and x − is that they can be found individually, which should speed up their search.We choose, however, the more succinct approach using u and Elligator.

Implementation
In this section, we describe the optimization steps at the level of field arithmetic to speed up both variants of CSIDH we consider.First and foremost, to enable a fair comparison, we implement a common code base for dCSIDH and CTIDH.Besides sharing the same field arithmetic, both instantiations of CSIDH share all the underlying functions required for computing the group action.However, some required parameters and the strategy within the group action strongly differ between dCSIDH and CTIDH.In the case of dCSIDH, the group action strategy and all the required parameters are based on the implementation provided by [24].In the case of CTIDH, we generate the batching and other parameters using the methods provided by [5].

Low-level approaches for the field arithmetic layer
For the underlying field arithmetic, we implement three different approaches.They all share the representation of integers in radix 2 64 and use Montgomery arithmetic for efficient reductions modulo p.
1. To establish a performance baseline, our first method uses the low-level functions for cryptography (mpn_sec_) of the GNU Multiple Precision Arithmetic Library (GMP).Modular multiplication uses a combination of mpn_sec_mul and mpn_add_n to implement Montgomery multiplication, i.e., interleaving multiplication with reduction.We refer to this first approach as GMP. 2. The second approach extends the optimized arithmetic from [22], using the MULX instruction, going from 512-bit and 1024-bit integers to the larger sizes we consider in this paper.Here, we also interleave multiplication with reduction; we generate code for all field sizes from a Python script.We refer to this second approach as OpScan.
3. Our third strategy uses Karatsuba multiplication [36] together with the MULX optimizations used in our second approach.We describe this strategy, and in particular an optimized reduction for primes of 5120 bits and above, in more detail in Section 5.2.We refer to this third approach as Karatsuba.
We follow the earlier optimization efforts for CSIDH from [5,22,24] and focus on optimizing our code primarily on Intel's Skylake microarchitecture.More specifically, we perform all benchmarks on one core of an Intel Core E3-1260L (Skylake) CPU with hyperthreading and TurboBoost disabled.An overview of (modular) multiplication performance of the three approaches for the different field sizes is given in Table 3.In the following, we will focus on describing the fastest of the three strategies mentioned above, i.e., Karatsuba, in more detail.

Optimized field arithmetic using MULX and Karatsuba
We present scripts to generate optimized code using the Karatsuba approach, based on the OpScan approach.More precisely, compared to the OpScan approach, we achieve speedups for multiplication, squaring, and reduction.Table 3: Benchmarking results for multiplication and reduction.Numbers are median clock cycles of 100000 runs on a Skylake CPU.Note that for the OpScan and the GMP approach, we can only provide clock cycles for multiplication including reduction, due to the interleaved Montgomery reduction.Multiplication.The implementation of Karatsuba follows careful considerations to optimize performance.To improve efficiency, we select a breakout level into a MULX-based schoolbook multiplication with a maximum of 9 × 9 limbs.By choosing this threshold, the implementation aims to strike a balance between utilizing the benefits of Karatsuba's divide-and-conquer strategy and minimizing the overhead of stack operations.This leads to the following number of layers of Karatsuba: 2, 3, 4, 4, 4, and 4 for the cases p2048, p4096, p5120, p6144, p8192, and p9216, respectively.To further enhance the speed of the implementation, the assembly code avoids function calls.By generating the assembly code dynamically, the implementation can adapt to different prime sizes and adjust the multiplication algorithm accordingly.
Squaring.For squaring, we take advantage of the fact that some partial products (a i a j such that i ̸ = j) only need to be calculated once, and then accumulated/used twice.On the lowest level of Karatsuba, where the schoolbook multiplication takes place, we implement a squaring function with the corresponding savings based on lazy doubling method [43] by adapting the assembly code of the squaring function of the GMP library.For a given n, the implemented method achieves the lower bound of n 2 −n 2 + n required multiplications.Furthermore, we save additions on the higher levels of Karatsuba by reusing calculated values.However, as shown in Table 4, due to the chosen breakout into schoolbook multiplication and the number of available registers, the effort for dealing with the carry chains only leads to a maximum speedup of 17%.Adding a layer of Karatsuba to reduce the number of limbs for the schoolbook multiplication leads to a speedup at this level.Overall, however, extra layers negate speed-ups gained from reducing limbs.Montgomery reduction.For the cases p ∈ {p5120, p6144, p8192, p9216}, the reduction is calculated according to the intermediate Montgomery reduction [4].
For this, we use Montgomery-friendly primes of the form p = f • n i=1 ℓ i − 1 with the cofactor f = 2 e2 where e 2 ≥ log 2 (p)/2.Table 1 shows the respective values for f and accordingly e 2 for all chosen prime numbers.
As shown in Algorithm 3, the basic idea of this reduction is to perform two Montgomery-reduction steps modulo 2 e2 instead of n steps modulo 2 w as in the standard Montgomery reduction.Based on this reduction approach, we can further apply the available Karatsuba-based multiplication when calculating q 0 × α and q 1 × α (see Line 2 and 4 in Algorithm 3), leading to further speedups.
For the cases p ∈ {p2048, p4096}, the respective primes cannot fulfill the described requirements.Hence, we implement the word version of the Montgomery reduction from [4]  this approach reduces the number of limbs to be multiplied depending on the value of e 2 .We show the results for the corresponding reduction in Table 3.

Performance results
We demonstrate the performance increase due to the high-level improvements from Section 4 and the low-level improvements from Section 5.2 for dCSIDH and CTIDH in Table 5.We compare our results to [24], the only other available implementation of CSIDH for similar parameters listing performance numbers.For parameter sizes above p5120, our implementation of dCSIDH is between 55% and 60% faster than SQALE (dummy-free), and CTIDH consistently achieves a speed-up of almost 75% compared to SQALE (OAYT).
In [45], the authors proposed a novel approach for the computation of sums of products over large prime fields achieving a significant performance impact.However, since the primes in our work support very fast reductions, applying the approach from [45] would not gain a significant advantage.Further, a comparison of the performance is unfortunately rather difficult due to the different underlying fields.

Non-Interactive Key Exchange in Protocols
Diffie-Hellman (DH) key exchange is probably the most well-known example of a NIKE protocol, even if it is often used as a "simple" interactive key exchange.
One such example is TLS, where ephemeral DH key exchange is authenticated via a signature.This key exchange can be replaced with a KEM, as shown in [13].
Experiments by Google and Cloudflare [14,41,42] used the same approach.However, in two scenarios the inherently interactive character of a KEM creates issues for protocol designers.When used with long-term keys (and a suitable PKI), a NIKE allows a user Alice to send an authenticated ciphertext to an offline user Bob.Signal's X3DH handshake [49] is a notable example using this feature of NIKEs.Indeed, [15] shows that a naive replacement of the DH operations by KEMs does not work.
In the early stages of the development of TLS 1.3, Krawczyk and Wee proposed OPTLS [38], a variant that uses DH key exchange not only for ephemeral key exchange, but also for authentication.Many elements of this proposal, made it into the eventual RFC8446 [57].Though the standard reverted to handshake signatures, the idea lives on in an Internet Draft [58].
As Kuhnen pointed out, OPTLS does use the non-interactive property of DH [39].As part of the ephemeral key exchange, the client sends their ephemeral DH public key.For authentication, the server takes this ephemeral key share and combines it with their long-term DH key.The obtained shared secret is used to compute a MAC which is used in place of the signature in the CertificateVerify message.This computation proves the server's possession of the long-term secret key corresponding to the public key in the certificate.The client can compute the same shared secret by combining its ephemeral secret DH key with the certified public key, and thus verify the MAC.

Post-Quantum TLS without signatures
In a naive instantiation of an OPTLS-like protocol with KEMs, we require an additional round-trip.To compute the authentication message, the server needs to first receive the ciphertext that was encapsulated against the long-term public key held in its certificate-which the client can not send before having received it from the server.The KEMTLS proposal by Schwabe, Stebila, and Wiggers avoids this issue partially by letting the client already transmit data immediately after computing and sending the ciphertext to the server [61].This relies on the fact that any keys derived from the shared secret encapsulated to the server's long term key are implicitly authenticated.KEMTLS has the advantage of not having to compute any typically expensive and/or large post-quantum signatures during the handshake protocol.Only the variant that assumes the client already has the server's public key, for example through caching, can achieve a protocol flow that is similar to OPTLS and TLS 1.3 [60].In that flow, the server can send authenticated data immediately on their first response to the client.
However, as CSIDH does provide post-quantum NIKE we can use it to instantiate post-quantum OPTLS and avoid any online post-quantum signatures.Because OPTLS immediately confirms the server's authenticity, its handshake has the same number of transmissions of messages as in TLS 1.3 and there is no need to rely on implicit authentication.
Integrating our implementations in OPTLS gives us an understanding of how CSIDH affects the performance of real-world network protocols, which will typically feature similar cryptographic operations and transmissions.

Benchmarking set-up
Integration into Rustls.To investigate the performance of OPTLS with CSIDH, we integrate our optimized implementations into the implementation and the measurement framework of the authors of KEMTLS.As a side effect of this work, we also provide a Rust wrapper around our C implementations.We add OPTLS to the same modified version of Rustls [11] used to implement KEMTLS.This allows us to straightforwardly compare to KEMTLS and TLS 1.3 handshakes instantiated with post-quantum primitives.
Group operations and caching ephemeral key generation.An OPTLS handshake requires a large number of group operations in each handshake, namely: 1. Generation of the ephemeral key of the client; 2. Generation of the ephemeral key of the server; 3. The server's computation of the ephemeral shared secret; 4. The client's computation of the ephemeral shared secret; 5.The server's computation of the authentication shared secret; and 6.The client's computation of the authentication shared secret.
Unfortunately, due to the order of the handshake messages and the requirements for handshake encryption, most of these computations need to be done in-order and can not really be parallelized.However, we can avoid the cost of CSIDH key generation by implementing caching of ephemeral keys.This reduces the forward secrecy; but it emulates a best-case scenario for CSIDH-based OPTLS in which the keys are generated "offline", outside the handshake context.We exclude all first TLS handshakes from clients and servers from our measurements, to exclude this key generation time: in the pregen OPTLS instances, all subsequent handshakes use the same public key material.In the ephemeral OPTLS instances, we generate ephemeral keys in each handshake.
Note that because OPTLS combines the ephemeral and static keys, all need to use the same algorithm, and we can not use a faster KEM for ephemeral key exchange.
Measurement setup.We run all TLS handshake experiments on a server with two Intel Xeon Gold 6230 CPUs, each featuring 20 physical cores.This gives us 80 hyperthreaded cores in total.For these experiments, we do not disable hyperthreading or frequency scaling, as these features would also be enabled in production scenarios.We run 80 servers and clients in parallel, as each pair of client and server roughly interleave their execution.We collect 8000 measurements per experiment.Every 11 handshakes, we restart the client and server, so that we measure many ephemeral keys even in the scenarios that use ephemeral-key caching.We exclude the first handshake from the measurements to allow for cache warm-up and ephemeral-key generation in the caching scenario.
As in the KEMTLS papers [60,61], we measure the performance of the different TLS handshakes when run over two network environments: a low-latency 30.9 ms round-trip time (RTT), 1000 Mbps and a high-latency 195.5 ms RTT, 10 Mbps network connection.The latency of the former represents a continental, high-bandwidth connection, while the latter represents a transatlantic connection.

Benchmarking results
In Table 6, we compare OPTLS with dCSIDH and CTIDH with the performance of instantiations of KEMTLS and TLS 1.3.
Comparing the sizes of the handshakes, OPTLS requires fewer bytes on the wire, as it only needs to transmit two ephemeral public keys and one static public key (and the CA signature).KEMTLS requires an additional ciphertext, and TLS an additional signature.
In OPTLS, like in TLS 1.3, the client receives the server's handshake completion message ServerFinished (SFIN) first and then sends the ClientFinished (CFIN) message (and its request) immediately after.In KEMTLS, SFIN and full server authentication is received a full round-trip after CFIN is received.However, it is clear that the runtime requirements of dCSIDH are almost insurmountable, even for the smallest parameters (p2048).Even CTIDH, which is much more efficient, is orders of magnitude slower than the KEMTLS and OPTLS instances.If the more conservative p4096 prime is required for NIST level 1 security, even CTIDH handshakes do not complete in under 30 seconds.Due to a better reduction algorithm, the p5120 prime performs roughly on par with p4096, while providing NIST level 2 security in the aggressive analysis.As discussed for (KEM)TLS in [32], for constrained environments, such as 46 kbps IoT networks, in certain scenarios the transmission size can become the dominant factor instead of computation time.However, with the results shown here, we expect the environments in which CSIDH-based OPTLS instances are competitive to be very niche.To overcome 7 seconds of computational latency, the network needs to take more than 7 seconds to transmit the additional data required for e.g.TLS 1.3 with Dilithium.This suggests link speeds of less than 1 kilobyte per second.Additionally, these environments often rely on microcontrollers that are much less performant than the Intel CPUs on which we run our implementations.
Interestingly, the CSIDH experiments run on the high-latency, low-bandwidth networks show slightly lower latencies than those on the high-bandwidth, lowlatency network.We suspect that this is due to an interaction with the TCP congestion control algorithm's transmission windows.

Conclusion and future work
In this paper, we presented low-level and high-level optimizations for CSIDH at larger parameter sets, focusing on deterministic and dummy-free behavior in dCSIDH, and on speed in CTIDH.These optimizations achieve impressive results on their own; dCSIDH is almost twice as fast as the state-of-the-art, and CTIDH, dropping determinism, is again three times as fast as dCSIDH.Further optimizations of the field arithmetic, i.e., by utilizing the vector processing capabilities of modern processors, might lead to additional speed-ups.
Nevertheless, when integrated into the latency-sensitive TLS variant OPTLS, both implementations still have too-large handshake latency in comparison to TLS or KEMTLS using lattice-based KEMs.We conclude that the reduced number of roundtrips, through the non-interactive nature of CSIDH, does not make up for the performance hit.
However, for truly non-interactive, latency-insensitive settings that cannot replace NIKEs by KEMs, the performance of CSIDH may be sufficient even high-security levels.This includes, for example, using CSIDH in X3DH [49] for post-quantum Signal, as it would incur a delay of seconds only when sending the first message to another user (who might be offline, thus ruling out KEM-based interactive approaches).
Unless significant performance improvements occur for CSIDH in large parameter sets, or the quantum-security debate shifts in favor of 512-to 1024-bits parameter sets, we conclude that CSIDH is unlikely to be practical in real-world applications, outside of those that specifically require NIKEs.
It will be interesting to investigate how CSIDH and Swoosh-the only two current proposals for a post-quantum NIKE-compare in a protocol context.There is no full implementation of Swoosh, yet; the cycle counts reported in [30] are for the passively-secure core component only.Based on the available figures it seems likely that Swoosh outperforms CSIDH with the large parameters we consider in this paper computationally, but that key sizes are much smaller for CSIDH.

Table 2 :
[24]hmarking results for supersingularity verification using VeriFast for primes with cofactor log f > 1 2 log p. Results of[24]added for comparison.Numbers are median clock cycles (in gigacycles) of 1024 runs on a Skylake CPU.

Table 4 :
Benchmarking results for multiplication and squaring for the Karatsuba approach.Numbers are median clock cycles of 100000 runs on a Skylake CPU.
for these cases.The complexity of Algorithm 4 is dominated by multiplications by α in Line 4. Compared to the standard Montgomery reduction, Algorithm 3 Intermediate Montgomery reduction for p = 2 e2 α − 1 with e 2 ≥ log 2 (p)/2

Table 5 :
[24]hmarking results for performing a group action for dCSIDH and CTIDH, excluding key validation.Results for the dummy-free and OAYT version of[24]added for comparison.Numbers are median clock cycles (in gigacycles) of 1024 executions on a Skylake CPU.

Table 6 :
Public key cryptography transmission sizes in bytes and time in seconds until client receives and sends Finished messages for OPTLS, TLS 1.3 and KEMTLS.

ms, 10 Mbps KEX Auth SFIN recv CFIN sent SFIN recv CFIN sent
All instantiations use Falcon-512 for the certificate authority; the CA public key is not transmitted.Bytes necessary for authentication includes 666 bytes for the Falcon-512 CA signature on the server's certificate.