| Wire Sysio Wire Sysion 1.0.0
    | 
This document explains the modular inverse implementation in the src/modinv*.h files. It is based on the paper "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang. The references below are for the Date: 2019.04.13 version.
The actual implementation is in C of course, but for demonstration purposes Python3 is used here. Most implementation aspects and optimizations are explained, except those that depend on the specific number representation used in the C code.
The algorithm from the paper (section 11), at a very high level, is this:
It computes the greatest common divisor of an odd integer f and any integer g. Its inner loop keeps rewriting the variables f and g alongside a state variable δ that starts at 1, until g=0 is reached. At that point, *|f|* gives the GCD. Each of the transitions in the loop is called a "division step" (referred to as divstep in what follows).
For example, gcd(21, 14) would be computed as:
Why it works:
Compared to more traditional GCD algorithms, this one has the property of only ever looking at the low-order bits of the variables to decide the next steps, and being easy to make constant-time (in more low-level languages than Python). The δ parameter is necessary to guide the algorithm towards shrinking the numbers' magnitudes without explicitly needing to look at high order bits.
Properties that will become important later:
We want an algorithm to compute the inverse a of x modulo M, i.e. the number a such that a x=1 mod M. This inverse only exists if the GCD of x and M is 1, but that is always the case if M is prime and 0 < x < M. In what follows, assume that the modular inverse exists. It turns out this inverse can be computed as a side effect of computing the GCD by keeping track of how the internal variables can be written as linear combinations of the inputs at every step (see the extended Euclidean algorithm). Since the GCD is 1, such an algorithm will compute numbers a and b such that a x + b M = 1*. Taking that expression mod M gives a x mod M = 1, and we see that a is the modular inverse of x mod M.
A similar approach can be used to calculate modular inverses using the divsteps-based GCD algorithm shown above, if the modulus M is odd. To do so, compute gcd(f=M,g=x), while keeping track of extra variables d and e, for which at every step d = f/x (mod M) and e = g/x (mod M). f/x here means the number which multiplied with x gives f mod M. As f and g are initialized to M and x respectively, d and e just start off being 0 (M/x mod M = 0/x mod M = 0) and 1 (x/x mod M = 1).
Also note that this approach to track d and e throughout the computation to determine the inverse is different from the paper. There (see paragraph 12.1 in the paper) a transition matrix for the entire computation is determined (see section 3 below) and the inverse is computed from that. The approach here avoids the need for 2x2 matrix multiplications of various sizes, and appears to be faster at the level of optimization we're able to do in C.
Every divstep can be expressed as a matrix multiplication, applying a transition matrix (1/2 t) to both vectors [f, g] and [d, e] (see paragraph 8.1 in the paper):
where (u, v, q, r) is (0, 2, -1, 1), (2, 0, 1, 1), or (2, 0, 0, 1), depending on which branch is taken. As above, the resulting f and g are always integers.
Performing multiple divsteps corresponds to a multiplication with the product of all the individual divsteps' transition matrices. As each transition matrix consists of integers divided by 2, the product of these matrices will consist of integers divided by *2N* (see also theorem 9.2 in the paper). These divisions are expensive when updating d and e, so we delay them: we compute the integer coefficients of the combined transition matrix scaled by *2N*, and do one division by *2N* as a final step:
As the branches in the divsteps are completely determined by the bottom N bits of f and g, this function to compute the transition matrix only needs to see those bottom bits. Furthermore all intermediate results and outputs fit in (N+1)-bit numbers (unsigned for f and g; signed for u, v, q, and r) (see also paragraph 8.3 in the paper). This means that an implementation using 64-bit integers could set N=62 and compute the full transition matrix for 62 steps at once without any big integer arithmetic at all. This is the reason why this algorithm is efficient: it only needs to update the full-size f, g, d, and e numbers once every N steps.
We still need functions to compute:
Because the divsteps transformation only ever divides even numbers by two, the result of t [f,g] is always even. When t is a composition of N divsteps, it follows that the resulting f and g will be multiple of *2N*, and division by *2N* is simply shifting them down:
The same is not true for d and e, and we need an equivalent of the div2 function for division by *2N mod M*. This is easy if we have precomputed *1/M mod 2N* (which always exists for odd M):
With all of those, we can write a version of modinv that performs N divsteps at once:
This means that in practice we'll always perform a multiple of N divsteps. This is not a problem because once g=0, further divsteps do not affect f, g, d, or e anymore (only δ keeps increasing). For variable time code such excess iterations will be mostly optimized away in later sections.
So far, there are two places where we compute a remainder of big numbers modulo M: at the end of div2n in every update_de, and at the very end of modinv after potentially negating d due to the sign of f. These are relatively expensive operations when done generically.
To deal with the modulus operation in div2n, we simply stop requiring d and e to be in range [0,M) all the time. Let's start by inlining div2n into update_de, and dropping the modulus operation at the end:
Let's look at bounds on the ranges of these numbers. It can be shown that *|u|+|v|* and *|q|+|r|* never exceed *2N* (see paragraph 8.3 in the paper), and thus a multiplication with t will have outputs whose absolute values are at most *2N* times the maximum absolute input value. In case the inputs d and e are in (-M,M), which is certainly true for the initial values d=0 and e=1 assuming M > 1, the multiplication results in numbers in range *(-2NM,2NM)*. Subtracting less than *2N* times M to cancel out N bits brings that up to *(-2N+1M,2NM)*, and dividing by *2N* at the end takes it to (-2M,M). Another application of update_de would take that to (-3M,2M), and so forth. This progressive expansion of the variables' ranges can be counteracted by incrementing d and e by M whenever they're negative:
With inputs in (-2M,M), they will first be shifted into range (-M,M), which means that the output will again be in (-2M,M), and this remains the case regardless of how many update_de invocations there are. In what follows, we will try to make this more efficient.
Note that increasing d by M is equal to incrementing cd by u M and ce by q M. Similarly, increasing e by M is equal to incrementing cd by v M and ce by r M. So we could instead write:
Now note that we have two steps of corrections to cd and ce that add multiples of M: this increment, and the decrement that cancels out bottom bits. The second one depends on the first one, but they can still be efficiently combined by only computing the bottom bits of cd and ce at first, and using that to compute the final md, me values:
One last optimization: we can avoid the md M and me M multiplications in the bottom bits of cd and ce by moving them to the md and me correction:
The resulting function takes d and e in range (-2M,M) as inputs, and outputs values in the same range. That also means that the d value at the end of modinv will be in that range, while we want a result in [0,M). To do that, we need a normalization function. It's easy to integrate the conditional negation of d (based on the sign of f) into it as well:
And calling it in modinv is simply:
The primary selling point of the algorithm is fast constant-time operation. What code flow still depends on the input data so far?
modinvdivsteps_n_matrixupdate_denormalizeTo make the while loop in modinv constant time it can be replaced with a constant number of iterations. The paper proves (Theorem 11.2) that 741 divsteps are sufficient for any 256-bit inputs, and safegcd-bounds shows that the slightly better bound 724 is sufficient even. Given that every loop iteration performs N divsteps, it will run a total of ⌈724/N⌉ times.
To deal with the branches in divsteps_n_matrix we will replace them with constant-time bitwise operations (and hope the C compiler isn't smart enough to turn them back into branches; see valgrind_ctime_test.c for automated tests that this isn't the case). To do so, observe that a divstep can be written instead as (compare to the inner loop of gcd in section 1).
To convert the above to bitwise operations, we rely on a trick to negate conditionally: per the definition of negative numbers in two's complement, (-v == ~v + 1) holds for every number v. As -1 in two's complement is all 1 bits, bitflipping can be expressed as xor with -1. It follows that -v == (v ^ -1) - (-1). Thus, if we have a variable c that takes on values 0 or -1, then (v ^ c) - c is v if c=0 and -v if c=-1.
Using this we can write:
in constant-time form as:
To use that trick, we need a helper mask variable c1 that resolves the condition δ>0 to -1 (if true) or 0 (if false). We compute c1 using right shifting, which is equivalent to dividing by the specified power of 2 and rounding down (in Python, and also in C under the assumption of a typical two's complement system; see assumptions.h for tests that this is the case). Right shifting by 63 thus maps all numbers in range *[-263,0)* to -1, and numbers in range *[0,263)* to 0.
Using the facts that x&0=0 and x&(-1)=x (on two's complement systems again), we can write:
as:
Using the conditional negation trick again we can write:
as:
Finally:
becomes:
It turns out that this can be implemented more efficiently by applying the substitution η=-δ. In this representation, negating δ corresponds to negating η, and incrementing δ corresponds to decrementing η. This allows us to remove the negation in the c1 computation:
A variant of divsteps with better worst-case performance can be used instead: starting δ at 1/2 instead of 1. This reduces the worst case number of iterations to 590 for 256-bit inputs (which can be shown using convex hull analysis). In this case, the substitution ζ=-(δ+1/2) is used instead to keep the variable integral. Incrementing δ by 1 still translates to decrementing ζ by 1, but negating δ now corresponds to going from ζ to -(ζ+1), or *~ζ. Doing that conditionally based on *c3 is simply:
By replacing the loop in divsteps_n_matrix with a variant of the divstep code above (extended to also apply all f operations to u, v and all g operations to q, r), a constant-time version of divsteps_n_matrix is obtained. The full code will be in section 7.
These bit fiddling tricks can also be used to make the conditional negations and additions in update_de and normalize constant-time.
In section 5, we modified the divsteps_n_matrix function (and a few others) to be constant time. Constant time operations are only necessary when computing modular inverses of secret data. In other cases, it slows down calculations unnecessarily. In this section, we will construct a faster non-constant time divsteps_n_matrix function.
To do so, first consider yet another way of writing the inner loop of divstep operations in gcd from section 1. This decomposition is also explained in the paper in section 8.2. We use the original version with initial δ=1 and η=-δ here.
Whenever g is even, the loop only shifts g down and decreases η. When g ends in multiple zero bits, these iterations can be consolidated into one step. This requires counting the bottom zero bits efficiently, which is possible on most platforms; it is abstracted here as the function count_trailing_zeros.
We can now remove multiple bottom 0 bits from g at once, but still need a full iteration whenever there is a bottom 1 bit. In what follows, we will get rid of multiple 1 bits simultaneously as well.
Observe that as long as η ≥ 0, the loop does not modify f. Instead, it cancels out bottom bits of g and shifts them out, and decreases η and i accordingly - interrupting only when η becomes negative, or when i reaches 0. Combined, this is equivalent to adding a multiple of f to g to cancel out multiple bottom bits, and then shifting them out.
It is easy to find what that multiple is: we want a number w such that g+w f has a few bottom zero bits. If that number of bits is L, we want *g+w f mod 2L = 0*, or *w = -g/f mod 2L. Since *f is odd, such a w exists for any L. L cannot be more than i steps (as we'd finish the loop before doing more) or more than η+1 steps (as we'd run eta, f, g = -eta, g, -f at that point), but apart from that, we're only limited by the complexity of computing w.
This code demonstrates how to cancel up to 4 bits per step:
By using a bigger table more bits can be cancelled at once. The table can also be implemented as a formula. Several formulas are known for computing modular inverses modulo powers of two; some can be found in Hacker's Delight second edition by Henry S. Warren, Jr. pages 245-247. Here we need the negated modular inverse, which is a simple transformation of those:
This loop, again extended to also handle u, v, q, and r alongside f and g, placed in divsteps_n_matrix, gives a significantly faster, but non-constant time version.
All together we need the following functions:
divsteps_n_matrix function from section 2, but with its loop replaced by a variant of the constant-time divstep from section 5, extended to handle u, v, q, r:update_de from section 5:normalize function from section 4, made constant time as well:modinv function too, adapted to use ζ instead of δ, and using the fixed iteration count from section 5:divsteps_n_matrix function with one that uses the divsteps loop from section 5, and a modinv version that calls it without the fixed iteration count: