More On Squaring and Multiplying Large Integers


Dan Zuras


Hewlett-Packard Co.


ABSTRACT


Methods of squaring and multiplying large integers are discussed. The obvious O(n2) method turns out to be best for small numbers. Existing O(nlog(3) / log(2)) = O(n1.585) methods become better as the numbers get bigger. New methods that are O(nlog(5) / log(3)) = O(n1.465), O(nlog(7) / log(4)) = O(n1.404), and O(nlog(9) / log(5)) = O(n1.365) are presented. In actual experiments, all of these methods turn out to be faster than FFT multiplies for numbers that can be quite large (> 37,000,000 bits). Squaring seems to be fundamentally faster than multiplying but it is shown that Tmultiply < 2Tsquare + O(n).
Contents Tables & Figures
Introduction Table 1: Common Operations
The Problem Figure 1: The 2-way Method, O(w1.585)
Finding the Best Figure 2: T2(w) / T1(w)
Method 1: The w2 Basis Figure 3: The 3-way Method, O(w1.465)
The 2-Way Method Figure 4: The 4-way Method, O(w1.404)
The General Idea Figure 5: The 5-way Method, O(w1.365)
The 3-Way Method Figure 6: The FFT-2 Method
Running Time of Toom-Cook Style Methods Figure 7: The Running Time of Squaring Methods
The 4-Way Method and Beyond Figure 8: The Running Time of Multiplication Methods
The FFT Multiply Figure 9: Ti(w) / T2(w)
Running Time of FFT-k Methods Figure 10: The Exponent of the Dominant Term
Some Actual Results Figure 11: The First Overhead Term
Is Squaring Faster than Multiplying? Figure 12: Multiplication Compared to Squaring
Conclusions, Speculations, Etc.  
References  

Introduction

This is an expansion of the paper On Squaring and Multiplying Large Integers presented at the IEEE Symposium on Computer Arithmetic (ARITH-11) in July 1993. This paper presents more of the ideas behind these methods; some new results; and some speculations for areas of future work.

Much of the work presented here is based on methods that appear in Don Knuth's classic work The Art of Computer Programming as well as the work of A. L. Toom.

In actual experiments, some of the simpler methods of squaring and multiplying are shown to be best in a surprisingly large number of cases. Some new methods are also discussed which are useful out to quite large numbers. In light of these results it seems that methods such as the Schönhage and Strassen FFT multiply, while of theoretical interest, may never be best for any reasonably sized numbers.

The Problem

The problem to be discussed here is how to find the best (fastest) ways to square and multiply large numbers in software. The methods presented here were implemented in C and assembly language on an HP-9000/720 but the general observations and conclusions should be true in wider areas of application.

The approach used was to write a collection of routines for squaring and multiplying w-word numbers producing 2w-word results for various specific values of w. These were general purpose routines in the sense that w was a parameter.

In the course of writing these routines, it became clear that some common operations on large integers stored as arrays of unsigned 32-bit words would be needed. These operations were written in assembly language and performed the functions detailed in Table 1.

Table 1: These arrays were oriented "big-endian". That is, the most significant word of a[0,w-1] was stored in a[0] and the least significant word was stored in a[w-1]. While the author preferred "little-endian", both C and the HP-9000 PA-RISC architecture disagreed. They won, in the end. I will not, however, inflict the big-endian notation on the reader. In this paper, I will use the somewhat more conventional notation of mathematics. That is, little-endian. So there.
Given operands a[0,w-1], b[0,w-1], producing result r[0,w-1]
vzconst(r,w,c) r[0,w-1] = ccc...c (w c's)
vzcopy(r,w,a) r[0,w-1] = a[0,w-1]
vzneg(r,w,a) r[0,w-1] = -a[0,w-1]
vzshlM(r,w,a,c) r[0,w-1] = a[0,w-1]|c << M for M in [1,2,4,8,16,31]
vzshr(r,w,a,M) r[0,w-1] = a[0,w-1] >> M for M in [0:31]
vzadd(r,w,a,b) r[0,w-1] = a[0,w-1] + b[0,w-1]
vzsub(r,w,a,b) r[0,w-1] = a[0,w-1] - b[0,w-1]
Cout = vzaddwco(r,w,a,b) r[0,w-1] + Cout = a[0,w-1] + b[0,w-1]
vzshMadd(r,w,a,b,M) r[0,w-1] = a[0,w-1] << M + b[0,w-1]
vzshMsub(r,w,a,b,M) r[0,w-1] = a[0,w-1] << M - b[0,w-1]
t = vzge(b,w,a) t = (b[0,w-1] >= a[0,w-1])

A collection of routines for squaring and multiplying "small" integers were also written in assembly. These provided a basis upon which the larger routines could be built.

All other routines were written in C and compiler optimized.

Running time was measured for w in the range 1 to 1,175,553 words (37,617,696 bits) by counting instruction cycles on a 50MHz HP-9000/720.

Finding the Best

We will define TSi(w) as the time required for method i to square a number of length w and TMi(w) as the time required for method i to multiply a number of length w. (We will use the expression Ti(w) when discussing either method or method i in general.)

Method i is considered best for some length w if

Ti(w) <= Tj(w') for all j and for all w' >= w.

That is, a method is best for a given length if there is no faster way of squaring or multiplying numbers at least as large as this one.

Method 1: The w2 Basis

To start the ball rolling, basis routines for operating on small integers were written using the obvious method.

Roughly speaking, if the algorithm for a multiply is

FOR i = w-1 to 0 :
FOR j = w-1 to 0 :
(r[i+j,i+j+1] and CarryOut) += a[i] × b[i],

then squaring can be made strictly faster than multiply by accumulating the upper half off-diagonal elements separately and combining them with the diagonals according to the formula

Result = 2 × Offdiagonals + Diagonals.

Obviously, the running time, T1(w), increases quadratically. T1(w), in cycles, is well approximated on the HP-9000/720 by the least squares formulas

TS1(w) = 4.53w2 + 10w + 30.71 and
TM1(w) = 9.82w2 - 9.06w + 90.26.

These routines turned out to be best for all w up to 33 for square (23 for multiply).

The 2-Way Method

It seems that the first discussion of a method of multiplication (squaring, actually) that was better than w2 in the width of the operand was a short paper written by Karatsuba and Ofman that appeared in the Soviet journal Doklady in 1963. The method involves splitting the operand into 2 parts: the high order bits and the low order bits, based on the polynomial

(A1x + A0)2 = A12(x2 - x) + (A1 + A0)2x + A02(1 - x)

where x = 2n.

Knuth presented a minor improvement on this based on the slightly different polynomial

(A1x + A0)2 = A12(x2 + x) - (A1 - A0)2x + A02(x + 1).

Both of these may be regarded as special cases of solving for the Ci in the polynomial

(A1x + A0)2 = C2x2 + C1x + C0.

Karatsuba and Ofman solve this equation by letting x take on the values {Infinity, 1, 0} where x = Infinity corresponds to

lim
x --> Infinity
(A1x + A0)2/x2 =
lim
x --> Infinity
(C2x2 + C1x + C0)/x2

in this context.

Rather than express this system in the usual polynomial form, I will express it as a linear transformation of the operand in the following way







V2
V1
V0




=






A1
A1 + A0
A0




=






1 0
1 1
0 1










A1
A0




(Please forgive the fairly primitive look of these matrix expressions. It is the best I have been able to do under the limitations of html.)

After squaring each element of the vector, the solution to the resulting system







1 0 0
1 1 1
0 0 1










C2
C1
C0




=






S2
S1
S0




=






A12
(A1 + A0)2
A02




may be expressed as







C2
C1
C0




=






1 0 0
-1 1 -1
0 0 1










S2
S1
S0




This method has a small problem in that the term A1 + A0 requires one more bit than the other two. Karatsuba and Ofman address this problem by showing how you can square an n+1-bit number using an n-bit algorithm together with a couple of shifts and adds.

Knuth solves the equation by letting x take on the values {Infinity, -1, 0}. A similar transformation







V2
V1
V0




=






A1
A1 - A0
A0




=






1 0
1 -1
0 1










A1
A0




results, after squaring, in a similar system







1 0 0
1 -1 1
0 0 1










C2
C1
C0




=






S2
S1
S0




=






A12
(A1 - A0)2
A02




which has the solution







C2
C1
C0




=






1 0 0
1 -1 1
0 0 1










S2
S1
S0




Knuth presented this method as Karatsuba and Ofman's but Knuth's variation leads to a real improvement. By rearranging the algebra a bit, he was able to replace the A1 + A0 term with A1 - A0. Since the only use of the term is to square it, its sign is unimportant. Therefore, one can always subtract the lesser from the greater and save that extra bit.

The test and branch involved results in much less overhead than Karatsuba and Ofman's method.

This method can be turned into a multiply method by replacing the Si with the Pi in the formula







1 0 0
1 -1 1
0 0 1










C2
C1
C0




=






P2
P1
P0




=






A1B1
(A1 - A0)(B1 - B0)
A0B0




and doing the implied extra work.

These methods are diagrammed in Figure 1.

Figure 1: The 2-way Method, O(w1.585)
  Square   Multiply
Initial Transform: V1 = |A1 - A0|   V1 = |A1 - A0|
W1 = |B1 - B0|
Squaring/Multiplication: C2 = A12
S1 = V12
C0 = A02
  C2 = A1B1
P1 = V1W1
C0 = A0B0
Final Transform: t1 = C2 + C0
C1 = t1 - S1
  t1 = C2 + C0
C1 = t1 +/- P1

Assuming that the time for an add (or subtract) of length w is O(w) , the time to square a number (or multiply two numbers) of length 2w would be

T2(2w) = 3Tbest(w) + O(w).

The overhead involved, one add before the squaring step (two adds before the multiplying step) and two after, is strictly larger than the overhead for a w2 method. Thus, although a w2 squaring (or multiply) of length 2w would take

T1(2w) = 4Tbest(w) + O(w),

there is a definite crossover point as w increases where T1(w) is roughly the same as T2(w). This point occurs where the time involved in the extra overhead of Knuth's method (which I will call the 2-way method from this point on) is roughly equivalent to the time for the 4th multiply in the w2 method.

Below this point the w2 method wins because of its simplicity and above this point the 2-way method wins because of its better asymptotic behavior.

Figure 2 illustrates this by plotting the ratio T2(w) / T1(w) for some 2-way routines. (T1(w) is actually the least squares approximation.)

Ratio of T2(w), the running time of Knuth's 2-way multiply to T1(w), the running time of the w2 basis.

In the limit, the running time of the 2-way method triples each time the size of the problem doubles. Therefore, its time increases as wlog(3) / log(2) = w1.585.

The running time of the 2-way routines (in cycles) is approximated by the least squares formulas

TS2(w) = 26.64wlog(3) / log(2) - 54.63w + 227 and
TM2(w) = 46.2wlog(3) / log(2) - 84.11w + 979.

Building on a basis set of w2 routines up to 33 words for square (23 words for multiply), the 2-way method takes over at 34 words (24 words) and generally dominates in these experiments up to 768 words (384 words).

Mark Shand has pointed out that this crossover point is a strong function of the ratio of the multiply time to add time on a given machine. The HP-9000/720 performs an unsigned 32 × 32 --> 64 multiply in 2 cycles (5 to 7 cycles if you include load/store overhead). On a machine with a relatively slower multiply the crossover point can be much lower.

The General Idea

We now have enough background to show the general idea behind these squaring (and multiplying) methods.

In each squaring case, the operand is split into k parts Aj. These parts are transformed via a Qij (derived from the original polynomial) into m parts Vi = QijAj. Then, each of the Vi is squared with the best method available, producing Si = Vi2.

In each multiply case, the operands are split into k parts Aj and Bj. These parts are transformed via Qij into m parts Vi = QijAj and Wi = QijBj. Then, each of the Vi is multiplied by Wi with the best method available, producing Pi = ViWi.

Now, the solution to the system MijCj = Si ( MijCj = Pi) (derived from the square of the original polynomial) can be expressed as Ci = Dik-1RkjSj ( Ci = Dik-1RkjPj) with all the coefficients in the integers. (Up to now, Dik-1, which expresses the integer divides, has not been needed as it was the identity matrix.) Finally, the Ci are shifted and added up to produce the result.

The 3-Way Method

Toom showed that circuits could be constructed for squaring integers where the size of the circuit was bounded by O(ncSqrt(log(n))) and the delay was bounded by O(cSqrt(log(n))).

Cook showed that an algorithm for squaring integers could be realized which had a running time of O(n25Sqrt(log(n))).

This algorithm, which is actually a collection of algorithms, has come to be known as the Toom-Cook method.

The first method in this collection is equivalent to Karatsuba's method.

The second method divides a number into three parts and involves solving for the Ci in the polynomial

(A2x2 + A1x + A0)2 = C4x4 + C3x3 + C2x2 + C1x + C0.

Cook solves this equation by letting x take on the values {4, 3, 2, 1, 0} in the polynomial, leading to the transformation







V4
V3
V2
V1
V0




=






16 4 1
9 3 1
4 2 1
1 1 1
0 0 1










A2
A1
A0




which, after squaring, yields the system







256 64 16 4 1
81 27 9 3 1
16 8 4 2 1
1 1 1 1 1
0 0 0 0 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






(16A2 + 4A1 + A0)2
(9A2 + 3A1 + A0)2
(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
A02




the solution of which I will express as







C4
C3
C2
C1
C0




=





-1

24 0 0 0 0
0 12 0 0 0
0 0 24 0 0
0 0 0 12 0
0 0 0 0 1










1 -4 6 -4 1
-3 14 -24 18 -5
11 -56 114 -104 35
-3 16 -36 48 -25
0 0 0 0 1










S4
S3
S2
S1
S0




The diagonal terms in the first matrix are the denominators in divides of intermediate results. In practice, there turn out to be four divides by 3 since the rest amounts to shifts of 2 or 3 bits.

Knuth solves this with x taken from {2, 1, 0, -1, -2} leading to the transformation







V4
V3
V2
V1
V0




=






4 2 1
1 1 1
0 0 1
1 -1 1
4 -2 1










A2
A1
A0




and the system







16 8 4 2 1
1 1 1 1 1
0 0 0 0 1
1 -1 1 -1 1
16 -8 4 -2 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
A02
(A2 - A1 + A0)2
(4A2 - 2A1 + A0)2




which has the somewhat more symmetrical solution







C4
C3
C2
C1
C0




=





-1

24 0 0 0 0
0 12 0 0 0
0 0 24 0 0
0 0 0 12 0
0 0 0 0 1










1 -4 6 -4 1
1 -2 0 2 -1
-1 16 -30 16 -1
-1 8 0 -8 1
0 0 1 0 0










S4
S3
S2
S1
S0




Although both of these solutions look similar, Knuth's is somewhat better than the previous one. The sign symmetry simplifies the system and greatly reduces the overhead (in the form of add, subtract, shift, and the like).

If, however, one uses reciprocal symmetry as well as sign symmetry in the choices of x, as in the set {Infinity, 2, 1, 1/2, 0}, then the greater symmetry in the system results in even greater simplification and greater reduction in overhead. (In general, x = p/q corresponds to q2n(An(p/q)n + ... + A0)2 = q2n(C2n(p/q)2n + ... + C0).)

This system







V4
V3
V2
V1
V0




=






1 0 0
4 2 1
1 1 1
1 2 4
0 0 1










A2
A1
A0




and







1 0 0 0 0
16 8 4 2 1
1 1 1 1 1
1 2 4 8 16
0 0 0 0 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






A22
(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
(A2 + 2A1 + 4A0)2
A02




has the solution







C4
C3
C2
C1
C0




=





-1

1 0 0 0 0
0 6 0 0 0
0 0 2 0 0
0 0 0 6 0
0 0 0 0 1










1 0 0 0 0
-21 2 -12 1 -6
7 -1 10 -1 7
-6 1 -12 2 -21
0 0 0 0 1










S4
S3
S2
S1
S0




(Oddly enough, it appears that reciprocal symmetry is more important than sign symmetry in the sense that, if one must break one or the other, the system seems simpler if sign symmetry is broken rather than reciprocal symmetry.)

This solution has only 2 divides by 3 and 3 shifts as well as much less overhead in the computation of the intermediate results.

The computation of S2 requires 2 extra bits and S1 and S3 each require 3.

The squaring method implied by this solution is shown in Figure 3.

Figure 3: The 3-way Method, O(w1.465)
                   
Initial Transform:     t02 = 2A2 + A1
V3 = 2t02 + A0
  t01 = A2 + A0
V2 = t01 + A1
  t00 = A1 + 2A0
V1 = A2 + 2t00
   
Squaring Stage: C4 = A22   S3 = V32   S2 = V22   S1 = V12   C0 = A02
Final Transform: t34 = 4C4 - C0   t33 = 4S2 - S3
t42 = t33 + t34
t52 = 2t34 + t42
-6C3 = 2t52 + t50
  t32 = C4 + C0
t41 = S2 - t32

t60 = t52 + t50
2C2 = 2t41 + t60
  t31 = 4S2 - S1
t40 = t31 + t30
t50 = 2t30 + t40
-6C1 = 2t50 + t52
  t30 = 4C0 - C4

Notice that it comes out looking like a complicated version of the 2-way method in that there is an initial transformation stage, a squaring stage, and a final transformation stage.

All of the operations shown can be made O(w). Shifts can be done separately or as part of the combined adds and subtracts.

The time to square (multiply) a number of length 3w becomes

T3(3w) = 5Tbest(w) + O(w).

This suggests that the asymptotic running time of the 3-way method increases as wlog(5) / log(3) = w1.465.

The improvement is not nearly so dramatic as the improvement in the 2-way method over the w2 method. The advantages of more complex routines will be even less dramatic for much more work.

The running time of the 3-way method is approximated by the formulas

TS3(w) = 64.59wlog(5) / log(3) - 243.51w + 33913 and
TM3(w) = 90.15wlog(5) / log(3) + 75.91w - 44016.

As before, to turn this method into a multiply, one duplicates those operations before the squaring stage and multiplies the corresponding terms instead of squaring them.

Running Time of Toom-Cook Style Methods

The running time of a Toom-Cook, Knuth, and k-way methods will all have three components: m = 2k - 1 squarings of elements of size w/k; some O(w) overhead; and some fixed overhead.

Assuming that one starts with a basis routine of length w0 that takes t0 time, we have

Tk(w0) = t0
Tk(kr+1w0) = mTk(krw0) + c1krw0 + c0
Tk(kw0) = mt0 + c1w0 + c0
Tk(k2w0) = m2t0 + c1w0(m + k) + c0(m + 1)
Tk(k3w0) = m3t0 + c1w0(m2 + mk + k2) + c0(m2 + m + 1)
...
Tk(krw0) = mrt0 + c1w0(mr - kr) / (m - k) + c0(mr - 1) / (m - 1)
.

Rearranging terms in this last equation gives

Tk(krw0) = (t0 + c1w0 / (m - k) + c0 / (m - 1))mr - c1w0kr / (m - k) - c0 / (m - 1)

or, in terms of w = krw0 becomes

Tk(w) = (t0 + c1w0 / (m - k) + c0 / (m - 1))(w/w0)log(m) / log(k) - c1w / (m - k) - c0 / (m - 1).

This formula has the expected form: a dominant (w/w0)log(m) / log(k) term, an O(w) term, and a constant term, both of which may be neglected for large w.

But, and this is important, the dominant term contains terms proportional to the overhead. Thus, while the overhead terms may be neglected for large w, the overhead itself may not as it directly contributes to the magnitude of the dominant term.

The 4-Way Method and Beyond

Extension to a k-way method involves solving for the Ci in the polynomial

(Akxk + ... + A0)2 = C2kx2k + ... + C0.

Cook would solve this polynomial by letting x take on the values {0, ..., 2k}.

Knuth would solve this polynomial by letting x take on the values {-k, ..., k}.

But if one chooses x from the reciprocally symmetric set {1, Infinity, 0, 2, 1/2, -2, -1/2, 3, 1/3, -3, -1/3, 3/2, 2/3, ...}, that is, the set of small rational numbers, +/- p/q, such that the GCD(p,q) = 1, the resulting system of equations is very symmetrical and the algorithm is simpler.

I will illustrate the various methods for k = 4.

Cook:







V6
V5
V4
V3
V2
V1
V0




=






216 36 6 1
125 25 5 1
64 16 4 1
27 9 3 1
8 4 2 1
1 1 1 1
0 0 0 1










A3
A2
A1
A0










46656 7776 1296 216 36 6 1
15625 3125 625 125 25 5 1
4096 1024 256 64 16 4 1
729 243 81 27 9 3 1
64 32 16 8 4 2 1
1 1 1 1 1 1 1
0 0 0 0 0 0 1










C6
C5
C4
C3
C2
C1
C0




=






S6
S5
S4
S3
S2
S1
S0




=






(256A3 + 36A2 + 6A1 + A0)2
(125A3 + 25A2 + 5A1 + A0)2
(64A3 + 16A2 + 4A1 + A0)2
(27A3 + 9A2 + 3A1 + A0)2
(8A3 + 4A2 + 2A1 + A0)2
(A3 + A2 + A1 + A0)2
A02










C6
C5
C4
C3
C2
C1
C0




=





-1

720 0 0 0 0 0 0
0 240 0 0 0 0 0
0 0 144 0 0 0 0
0 0 0 48 0 0 0
0 0 0 0 360 0 0
0 0 0 0 0 60 0
0 0 0 0 0 0 1










1 -6 15 -20 15 -6 1
-5 32 -85 120 -95 40 -7
17 -114 321 -484 411 -186 35
-15 104 -307 496 -461 232 -49
137 -972 2970 -5080 5265 -3132 812
-10 72 -225 400 -450 360 -147
0 0 0 0 0 0 1










S6
S5
S4
S3
S2
S1
S0




Knuth:







V6
V5
V4
V3
V2
V1
V0




=






27 9 3 1
8 4 2 1
1 1 1 1
0 0 0 1
-1 1 -1 1
-8 4 -2 1
-27 9 -3 1










A3
A2
A1
A0










729 243 81 27 9 3 1
64 32 16 8 4 2 1
1 1 1 1 1 1 1
0 0 0 0 0 0 1
1 -1 1 -1 1 -1 1
64 -32 16 -8 4 -2 1
729 -243 81 -27 9 -3 1










C6
C5
C4
C3
C2
C1
C0




=






S6
S5
S4
S3
S2
S1
S0




=






(27A3 + 9A2 + 3A1 + A0)2
(8A3 + 4A2 + 2A1 + A0)2
(A3 + A2 + A1 + A0)2
A02
(-A3 + A2 - A1 + A0)2
(-8A3 + 4A2 - 2A1 + A0)2
(-27A3 + 9A2 - 3A1 + A0)2










C6
C5
C4
C3
C2
C1
C0




=





-1

720 0 0 0 0 0 0
0 240 0 0 0 0 0