From msuinfo!agate!howland.reston.ans.net!pipex!uunet!mnemosyne.cs.du.edu!nyx10!colin Thu Mar 10 10:22:37 1994 Newsgroups: sci.crypt,sci.math Path: msuinfo!agate!howland.reston.ans.net!pipex!uunet!mnemosyne.cs.du.edu!nyx10!colin From: colin@nyx10.cs.du.edu (Colin Plumb) Subject: Computing inverses modulo n Message-ID: <1994Mar8.082037.22303@mnemosyne.cs.du.edu> Followup-To: sci.crypt X-Disclaimer: Nyx is a public access Unix system run by the University of Denver for the Denver community. The University has neither control over nor responsibility for the opinions of users. Sender: usenet@mnemosyne.cs.du.edu (netnews admin account) Organization: Nyx, Public Access Unix at U. of Denver Math/CS dept. Date: Tue, 8 Mar 94 08:20:37 GMT Lines: 426 Xref: msuinfo sci.crypt:24505 sci.math:66598 I was doing some thinking about Montgomery multiplication the other day and was trying to understand the code in Arjen Lenstra's lip.c Large Integer Package. Anyway, I wasn't sure why the code as computing inverses the way it was, when I came up with simpler ways to achieve the same end when computing inverses modulo 65537 for IDEA key inversion. So I sat down and wrote up a full derivation of my techniques so I could be sure I had it straight. It was fundamentally for my own notes, but other people might be interested, so here it is for whatever use people care to make of it. Reacll the warning of Knuth: Beware of bugs in the following code. I have only proved it correct, not tried it. Anyway, the following will hopefully be somewhat enlightening... How to find multiplicative inverses. This starts with the Extended Euclidean Algorithm. The Euclidean Algorithm is a technique for finding the greatest common denominator of two numbers a and b. It is famous for being the first algorithm to be described as such, a general technique, rather than being presented as a series of examples with the generalization being left to the reader. The algorithm works with two lists of numbers, a[] and b[]. a[0] = a and b[0] = b. Each step, a[i+1] = b[i] and b[i+1] = a[i] % b[i], the remainder when divided. When you reach n such that b[n] == 0, then a[n] is the GCD of a and b. The numbers are integers, and each step b[i+1] is less than b[i], so it is guaranteed to eventually terminate. (In fact, it is guaranteed to take a number of steps which is logarithmic in the sizes of the inputs, but that is a much more complex proof.) It is usually assumed that a[i] > b[i]. if a[0] < b[0], the first step just swaps the two and the relationship is established for the rest of the algorithm. If two numbers, a and b are relatively prime, their GCD is 1. It is possible to find a multiplicative inverse B of b modulo a such that b*B == 1 (mod a). This can be done with the Extended Euclidean Algorithm, which, in addition to the series of numbers a[] and b[], keeps track of s[] and t[], so that at each step, a[i] = s[i] * a[0] + t[i] * b[0]. First comes the problrm of deriving s[] and t[] so that a[i] = s[i] * a[0] + t[i] * b[0] holds. Obviously, s[0] = 1 and t[0] = 0. Each step, a[i+1] = b[i] and b[i+1] = a[i] % b[i]. It should be obvious that s[1] = 0 and t[1] = 1. b[i] = a[i+1] = s[i+1] * a[0] + t[i+1] * b[0]. Start by reqriting b[i+1] = a[i] % b[i]. Let q[i] = a[i] / b[i], so b[i+1] = a[i] - q[i] * b[i]. After the ith step, a[i] = s[i] * a[0] + t[i] * b[0] and b[i] = a[i+1] = s[i+1] * a[0] + t[i+1] * b[0], then these can be substituted into the above equation for b[i+1] to produce: b[i+1] = s[i] * a[0] + t[i] * b[0] - q[i] * (s[i+1] * a[0] + t[i+1] * b[0]) = (s[i] - q[i] * s[i+1]) * a[0] + (t[i] - q[i] * t[i+1]) * b[0] Also, by definition, b[i+1] = a[i+2] = s[i+2] * a[0] + t[i+2] * b[0]. Equating the coefficients of a[0] and b[0] in these two equations produces: s[i+2] = s[i] - q[i] * s[i+1] t[i+2] = t[i] - q[i] * t[i+1] If a[0] and b[0] are relatively prime, you can continue this until a[n] = 1, at which point you have can take a[n] = s[n] * a[0] + t[n] * b[0] modulo a[0], and conclude that 1 == t[n] * b[0] (mod a[0]). In other words, t[n] is the inverse of b[0], modulo a[0]. To compute this, there is no need to keep track of s[] at all, so here is a first pass at a loop. The only optimization performed (much more later!) is that q[i] is used only briefly, so it is replaced by the single variable q. t[0] = 0; t[1] = 1; i = 0; for (;;) { /* Loop invariant: a[i], b[i], and t[i+1] are valid */ if (b[i] == 1) /* Thus, a[i+1] == 1 */ return t[i+1]; /* The inverse */ q = a[i] / b[i]; b[i+1] = a[i] % b[i]; a[i+1] = b[i]; t[i+2] = t[i] - q * t[i+1]; i++; } Now to begin optimisation. Start by unrolling it twice: t[0] = 0; t[1] = 1; i = 0; for (;;) { /* New loop invariant: i is even */ /* Loop invariant: a[i], b[i], and t[i+1] are valid */ if (b[i] == 1) /* Thus, a[i+1] == 1 */ return t[i+1]; /* The inverse */ q = a[i] / b[i]; b[i+1] = a[i] % b[i]; a[i+1] = b[i]; t[i+2] = t[i] - q * t[i+1]; /* Loop invariant: a[i+1], b[i+1], and t[i+2] are valid */ if (b[i+1] == 1) /* Thus, a[i+2] == 1 */ return t[i+2]; /* The inverse */ q = a[i+1] / b[i+1]; b[i+2] = a[i+1] % b[i+1]; a[i+2] = b[i+1]; t[i+3] = t[i+1] - q * t[i+2]; i += 2; } Then the arrays can be replaced by variables. Call them t_even, t_odd, a_even_b_odd and b_even_a_odd. The latter two are initialized to a[0] and b[0]. t_even = 0; t_odd = 1; i = 0; for (;;) { /* New loop invariant: i is even */ /* Loop invariant: a[i], b[i], and t[i+1] are valid */ /* * Loop invariants: * a_even_b_odd = a[i] * b_even_a_odd = b[i] * t_even = t[i] * t_odd = t[i+1] */ if (b_even_a_odd == 1) /* Thus, a[i+1] == 1 */ return t_odd; /* The inverse */ q = a_even_b_odd / b_even_a_odd; a_even_b_odd = a_even_b_odd % b_even_a_odd; b_even_a_odd = b_even_a_odd; t_even = t_even - q * t_odd; /* Loop invariant: a[i+1], b[i+1], and t[i+2] are valid */ /* * Loop invariants: * a_even_b_odd = b[i+1] * b_even_a_odd = a[i+1] * t_even = t[i+2] * t_odd = t[i+1] */ if (a_even_b_odd == 1) /* Thus, a[i+2] == 1 */ return t_even; /* The inverse */ q = b_even_a_odd / a_even_b_odd; b_even_a_odd = b_even_a_odd % a_even_b_odd; a_even_b_odd = a_even_b_odd; t_odd = t_odd - q * t_even; i += 2; } The next step is to replace the cumbersome varibles a_even_b_odd and b_even_a_odd with a and b, respectively, remove the redundant b = b and a = a copies, and use two-operand operations where possible: t_even = 0; t_odd = 1; i = 0; for (;;) { /* New loop invariant: i is even */ /* Loop invariant: a[i], b[i], and t[i+1] are valid */ /* * Loop invariants: * a = a[i] * b = b[i] * t_even = t[i] * t_odd = t[i+1] */ if (b == 1) /* Thus, a[i+1] == 1 */ return t_odd; /* The inverse */ q = a / b; a %= b; t_even -= q * t_odd; /* Loop invariant: a[i+1], b[i+1], and t[i+2] are valid */ /* * Loop invariants: * a = b[i+1] * b = a[i+1] * t_even = t[i+2] * t_odd = t[i+1] */ if (a == 1) /* Thus, a[i+2] == 1 */ return t_even; /* The inverse */ q = b / a; b %= a; t_odd -= q * t_even; i += 2; } Then, replace t_even and t_odd with t0 = -t_even and t1 = t_odd. Thus, the lines that modify t_even and t_odd become: t_even = t_even - q * t_odd t_odd = t_odd - q * t_even => -t0 = -t0 - q * t1 t1 = t1 - q * -t0 => t0 = t0 + q * t1 t1 = t1 + q * t0 => t0 += q * t1 t1 += q * t0 This avoids the need to deal with negative numbers. The initial value of t0 has to be the negative of the initial value of t_even, but that was 0, so there is no problem there. After these adjustments, the code looks like this: t0 = 0; t1 = 1; i = 0; for (;;) { /* New loop invariant: i is even */ /* Loop invariant: a[i], b[i], and t[i+1] are valid */ /* * Loop invariants: * a = a[i] * b = b[i] * t0 = -t[i] * t1 = t[i+1] */ if (b == 1) /* Thus, a[i+1] == 1 */ return t1; /* The inverse */ q = a / b; a %= b; t0 += q * t1; /* Loop invariant: a[i+1], b[i+1], and t[i+2] are valid */ /* * Loop invariants: * a = b[i+1] * b = a[i+1] * t0 = -t[i+2] * t1 = t[i+1] */ if (a == 1) /* Thus, a[i+2] == 1 */ return -t0; /* The inverse */ q = b / a; b %= a; t1 += q * t0; i += 2; } To further reduce the need to fiddle with negative numbers, -t0 can be replaced with a[0]-t0. It turns out that this does not introduce negative numbers, because t0 and t1 are bounded between 0 and a[0]. The following is missing a crucial step to become a proof, but should be a sufficiently convincing argument. The termination test is a[n] = 1. a[n] = b[n-1], and b[n] = a[n-1] % b[n-1]. Thus, b[n] = a[n-1] % 1 = 0. As mentioned at the beginning, the termination test is usually b[n] = 0, with a[n] being the GCD of a[0] and b[0], but this application assumes we know that GCD(a[0],b[0]) = 1, so we can test for a[n] = b[n-1] = 1 and avoid the last division. Thus, a[n+1] = b[n] = 0. Well, a[n+1] = s[n+1]*a[0] + t[n+1]*b[0]. If a[0] and b[0] are relatively prime, and s[n+1] and t[n+1] are not both zero, then s[n+1] must be a multiple of b[0] and t[n+1] must be a multiple of a[0]. Thus, this reduces to 0 = k*b[0]*a[0] + -k*a[0]*b[0], where s[n+1] = k * b[0] and t[n+1] = -k * a[0], for some non-zero integer k. It turns out that the Euclidean algorithm is reasonably efficient, and k is no larger than it has to be, so it is +/-1. This is the unproved point in this argument, but it is an equality, so it is easy to see experimentally and should not be too hard to prove. Now |t[i]|, the absolute value of t[i], is nondecreasing; in other words |t[i]| <= |t[i+1]|. It should not be hard to see this by looking at how t0 and t1 are computed in the above code. So |t[n]| <= |t[n+1]| = a[0], and since the inverse of b[0] mod a[0] is not a[0] (which would be zero), |t[n]| is not equal to a[0], so it must be strictly less than a[0]. Thus: 0 = t[0] < 1 = t[1] <= |t[2]| <= |t[3]| <= ... <= |t[n]| < |t[n+1]| = a[0]. Since the modified algorithm deals with |t[i]| everywhere, it is now known to suffice if the t0 and t1 variables can hold numbers up to a[0]-1 in size. This is very useful, for example, in finding the coefficients needed for Montgomery multiplication, where a[0] is usually one more the largest number that can be conveniently dealt with, i.e. the word size of the computer (usually 2^32). That requires some special-case work for the first division, but after that, a[i] < a[0], so word-sized variables will suffice. Given that t[0] and t[1] are such simple numbers, it makes sense to special-case the first iteration of the loop. t[2] = t[0] - q[0] * t[1], which reduces to t[2] = 0 - q[0] * 1 = -q. In the code we have, t0 += q * t1 reduces to t0 = q the first time through the loop. Also taking out the unnecessary i variable, the loop-invariant comments, and keeping the original a around for negating t0 on return, we have the final version of the code: datatype mult_inverse(datatype a, datatype b) { datatype t0, t1, c, q; t1 = 1; if (b == 1) return t1; t0 = a / b; c = a % b; while (c != 1) { q = b / c; b %= c; t1 += q * t0; if (b == 1) return t1; q = c / b; c %= b; t0 += q * t1; } return a-t0; } You can usually compute x / y and x % y in the same operation. A smart enough compiler will do it for you. The nice thing about the above code is that every variable is strictly greater than zero and less than a. If you are using a multi-precision math package, there is no need to test signs, and can know the necessary ranges beforehand. If a is the word size of the computer, then normal words will suffice for all the variables. Suppose a is one larger than UINT_MAX, and the computer has no larger data type. (Or even if it does, but division with it is inefficient.) Finding a / b and a % b is a bit tricky. Here is how to do it. The b==1 case is tested for and discarded before the division. Since b is relatively prime to a, this means that a % b cannot be zero. Thus, dividing (a-1) by b produces a remainder one smaller than a % b. It cannot wrap around. Thus, the quotient of the division is the desired number. So compute q = (a-1) / b; c = (a-1) % b + 1; and you will be all set. Hardcoding a to UINT_MAX+1, a version of the code to find the inverse of b is: unsigned mult_inverse(unsigned b) { unsigned t0, t1, c, q; t1 = 1; if (b == 1) return t1; t0 = UINT_MAX / b; c = UINT_MAX % b + 1; while (c != 1) { q = b / c; b %= c; t1 += q * t0; if (b == 1) return t1; q = c / b; c %= b; t0 += q * t1; } return -t0; } -- -Colin