More On Squaring and Multiplying Large Integers


Dan Zuras


Hewlett-Packard Co.


ABSTRACT


Methods of squaring and multiplying large integers are discussed. The obvious O(n2) method turns out to be best for small numbers. Existing O(nlog(3) / log(2)) = O(n1.585) methods become better as the numbers get bigger. New methods that are O(nlog(5) / log(3)) = O(n1.465), O(nlog(7) / log(4)) = O(n1.404), and O(nlog(9) / log(5)) = O(n1.365) are presented. In actual experiments, all of these methods turn out to be faster than FFT multiplies for numbers that can be quite large (> 37,000,000 bits). Squaring seems to be fundamentally faster than multiplying but it is shown that Tmultiply < 2Tsquare + O(n).
Contents Tables & Figures
Introduction Table 1: Common Operations
The Problem Figure 1: The 2-way Method, O(w1.585)
Finding the Best Figure 2: T2(w) / T1(w)
Method 1: The w2 Basis Figure 3: The 3-way Method, O(w1.465)
The 2-Way Method Figure 4: The 4-way Method, O(w1.404)
The General Idea Figure 5: The 5-way Method, O(w1.365)
The 3-Way Method Figure 6: The FFT-2 Method
Running Time of Toom-Cook Style Methods Figure 7: The Running Time of Squaring Methods
The 4-Way Method and Beyond Figure 8: The Running Time of Multiplication Methods
The FFT Multiply Figure 9: Ti(w) / T2(w)
Running Time of FFT-k Methods Figure 10: The Exponent of the Dominant Term
Some Actual Results Figure 11: The First Overhead Term
Is Squaring Faster than Multiplying? Figure 12: Multiplication Compared to Squaring
Conclusions, Speculations, Etc.  
References  

Introduction

This is an expansion of the paper On Squaring and Multiplying Large Integers presented at the IEEE Symposium on Computer Arithmetic (ARITH-11) in July 1993. This paper presents more of the ideas behind these methods; some new results; and some speculations for areas of future work.

Much of the work presented here is based on methods that appear in Don Knuth's classic work The Art of Computer Programming as well as the work of A. L. Toom.

In actual experiments, some of the simpler methods of squaring and multiplying are shown to be best in a surprisingly large number of cases. Some new methods are also discussed which are useful out to quite large numbers. In light of these results it seems that methods such as the Schönhage and Strassen FFT multiply, while of theoretical interest, may never be best for any reasonably sized numbers.

The Problem

The problem to be discussed here is how to find the best (fastest) ways to square and multiply large numbers in software. The methods presented here were implemented in C and assembly language on an HP-9000/720 but the general observations and conclusions should be true in wider areas of application.

The approach used was to write a collection of routines for squaring and multiplying w-word numbers producing 2w-word results for various specific values of w. These were general purpose routines in the sense that w was a parameter.

In the course of writing these routines, it became clear that some common operations on large integers stored as arrays of unsigned 32-bit words would be needed. These operations were written in assembly language and performed the functions detailed in Table 1.

Table 1: These arrays were oriented "big-endian". That is, the most significant word of a[0,w-1] was stored in a[0] and the least significant word was stored in a[w-1]. While the author preferred "little-endian", both C and the HP-9000 PA-RISC architecture disagreed. They won, in the end. I will not, however, inflict the big-endian notation on the reader. In this paper, I will use the somewhat more conventional notation of mathematics. That is, little-endian. So there.
Given operands a[0,w-1], b[0,w-1], producing result r[0,w-1]
vzconst(r,w,c) r[0,w-1] = ccc...c (w c's)
vzcopy(r,w,a) r[0,w-1] = a[0,w-1]
vzneg(r,w,a) r[0,w-1] = -a[0,w-1]
vzshlM(r,w,a,c) r[0,w-1] = a[0,w-1]|c << M for M in [1,2,4,8,16,31]
vzshr(r,w,a,M) r[0,w-1] = a[0,w-1] >> M for M in [0:31]
vzadd(r,w,a,b) r[0,w-1] = a[0,w-1] + b[0,w-1]
vzsub(r,w,a,b) r[0,w-1] = a[0,w-1] - b[0,w-1]
Cout = vzaddwco(r,w,a,b) r[0,w-1] + Cout = a[0,w-1] + b[0,w-1]
vzshMadd(r,w,a,b,M) r[0,w-1] = a[0,w-1] << M + b[0,w-1]
vzshMsub(r,w,a,b,M) r[0,w-1] = a[0,w-1] << M - b[0,w-1]
t = vzge(b,w,a) t = (b[0,w-1] >= a[0,w-1])

A collection of routines for squaring and multiplying "small" integers were also written in assembly. These provided a basis upon which the larger routines could be built.

All other routines were written in C and compiler optimized.

Running time was measured for w in the range 1 to 1,175,553 words (37,617,696 bits) by counting instruction cycles on a 50MHz HP-9000/720.

Finding the Best

We will define TSi(w) as the time required for method i to square a number of length w and TMi(w) as the time required for method i to multiply a number of length w. (We will use the expression Ti(w) when discussing either method or method i in general.)

Method i is considered best for some length w if

Ti(w) <= Tj(w') for all j and for all w' >= w.

That is, a method is best for a given length if there is no faster way of squaring or multiplying numbers at least as large as this one.

Method 1: The w2 Basis

To start the ball rolling, basis routines for operating on small integers were written using the obvious method.

Roughly speaking, if the algorithm for a multiply is

FOR i = w-1 to 0 :
FOR j = w-1 to 0 :
(r[i+j,i+j+1] and CarryOut) += a[i] × b[i],

then squaring can be made strictly faster than multiply by accumulating the upper half off-diagonal elements separately and combining them with the diagonals according to the formula

Result = 2 × Offdiagonals + Diagonals.

Obviously, the running time, T1(w), increases quadratically. T1(w), in cycles, is well approximated on the HP-9000/720 by the least squares formulas

TS1(w) = 4.53w2 + 10w + 30.71 and
TM1(w) = 9.82w2 - 9.06w + 90.26.

These routines turned out to be best for all w up to 33 for square (23 for multiply).

The 2-Way Method

It seems that the first discussion of a method of multiplication (squaring, actually) that was better than w2 in the width of the operand was a short paper written by Karatsuba and Ofman that appeared in the Soviet journal Doklady in 1963. The method involves splitting the operand into 2 parts: the high order bits and the low order bits, based on the polynomial

(A1x + A0)2 = A12(x2 - x) + (A1 + A0)2x + A02(1 - x)

where x = 2n.

Knuth presented a minor improvement on this based on the slightly different polynomial

(A1x + A0)2 = A12(x2 + x) - (A1 - A0)2x + A02(x + 1).

Both of these may be regarded as special cases of solving for the Ci in the polynomial

(A1x + A0)2 = C2x2 + C1x + C0.

Karatsuba and Ofman solve this equation by letting x take on the values {Infinity, 1, 0} where x = Infinity corresponds to

lim
x --> Infinity
(A1x + A0)2/x2 =
lim
x --> Infinity
(C2x2 + C1x + C0)/x2

in this context.

Rather than express this system in the usual polynomial form, I will express it as a linear transformation of the operand in the following way







V2
V1
V0




=






A1
A1 + A0
A0




=






1 0
1 1
0 1










A1
A0




(Please forgive the fairly primitive look of these matrix expressions. It is the best I have been able to do under the limitations of html.)

After squaring each element of the vector, the solution to the resulting system







1 0 0
1 1 1
0 0 1










C2
C1
C0




=






S2
S1
S0




=






A12
(A1 + A0)2
A02




may be expressed as







C2
C1
C0




=






1 0 0
-1 1 -1
0 0 1










S2
S1
S0




This method has a small problem in that the term A1 + A0 requires one more bit than the other two. Karatsuba and Ofman address this problem by showing how you can square an n+1-bit number using an n-bit algorithm together with a couple of shifts and adds.

Knuth solves the equation by letting x take on the values {Infinity, -1, 0}. A similar transformation







V2
V1
V0




=






A1
A1 - A0
A0




=






1 0
1 -1
0 1










A1
A0




results, after squaring, in a similar system







1 0 0
1 -1 1
0 0 1










C2
C1
C0




=






S2
S1
S0




=






A12
(A1 - A0)2
A02




which has the solution







C2
C1
C0




=






1 0 0
1 -1 1
0 0 1










S2
S1
S0




Knuth presented this method as Karatsuba and Ofman's but Knuth's variation leads to a real improvement. By rearranging the algebra a bit, he was able to replace the A1 + A0 term with A1 - A0. Since the only use of the term is to square it, its sign is unimportant. Therefore, one can always subtract the lesser from the greater and save that extra bit.

The test and branch involved results in much less overhead than Karatsuba and Ofman's method.

This method can be turned into a multiply method by replacing the Si with the Pi in the formula







1 0 0
1 -1 1
0 0 1










C2
C1
C0




=






P2
P1
P0




=






A1B1
(A1 - A0)(B1 - B0)
A0B0




and doing the implied extra work.

These methods are diagrammed in Figure 1.

Figure 1: The 2-way Method, O(w1.585)
  Square   Multiply
Initial Transform: V1 = |A1 - A0|   V1 = |A1 - A0|
W1 = |B1 - B0|
Squaring/Multiplication: C2 = A12
S1 = V12
C0 = A02
  C2 = A1B1
P1 = V1W1
C0 = A0B0
Final Transform: t1 = C2 + C0
C1 = t1 - S1
  t1 = C2 + C0
C1 = t1 +/- P1

Assuming that the time for an add (or subtract) of length w is O(w) , the time to square a number (or multiply two numbers) of length 2w would be

T2(2w) = 3Tbest(w) + O(w).

The overhead involved, one add before the squaring step (two adds before the multiplying step) and two after, is strictly larger than the overhead for a w2 method. Thus, although a w2 squaring (or multiply) of length 2w would take

T1(2w) = 4Tbest(w) + O(w),

there is a definite crossover point as w increases where T1(w) is roughly the same as T2(w). This point occurs where the time involved in the extra overhead of Knuth's method (which I will call the 2-way method from this point on) is roughly equivalent to the time for the 4th multiply in the w2 method.

Below this point the w2 method wins because of its simplicity and above this point the 2-way method wins because of its better asymptotic behavior.

Figure 2 illustrates this by plotting the ratio T2(w) / T1(w) for some 2-way routines. (T1(w) is actually the least squares approximation.)

Ratio of T2(w), the running time of Knuth's 2-way multiply to T1(w), the running time of the w2 basis.

In the limit, the running time of the 2-way method triples each time the size of the problem doubles. Therefore, its time increases as wlog(3) / log(2) = w1.585.

The running time of the 2-way routines (in cycles) is approximated by the least squares formulas

TS2(w) = 26.64wlog(3) / log(2) - 54.63w + 227 and
TM2(w) = 46.2wlog(3) / log(2) - 84.11w + 979.

Building on a basis set of w2 routines up to 33 words for square (23 words for multiply), the 2-way method takes over at 34 words (24 words) and generally dominates in these experiments up to 768 words (384 words).

Mark Shand has pointed out that this crossover point is a strong function of the ratio of the multiply time to add time on a given machine. The HP-9000/720 performs an unsigned 32 × 32 --> 64 multiply in 2 cycles (5 to 7 cycles if you include load/store overhead). On a machine with a relatively slower multiply the crossover point can be much lower.

The General Idea

We now have enough background to show the general idea behind these squaring (and multiplying) methods.

In each squaring case, the operand is split into k parts Aj. These parts are transformed via a Qij (derived from the original polynomial) into m parts Vi = QijAj. Then, each of the Vi is squared with the best method available, producing Si = Vi2.

In each multiply case, the operands are split into k parts Aj and Bj. These parts are transformed via Qij into m parts Vi = QijAj and Wi = QijBj. Then, each of the Vi is multiplied by Wi with the best method available, producing Pi = ViWi.

Now, the solution to the system MijCj = Si ( MijCj = Pi) (derived from the square of the original polynomial) can be expressed as Ci = Dik-1RkjSj ( Ci = Dik-1RkjPj) with all the coefficients in the integers. (Up to now, Dik-1, which expresses the integer divides, has not been needed as it was the identity matrix.) Finally, the Ci are shifted and added up to produce the result.

The 3-Way Method

Toom showed that circuits could be constructed for squaring integers where the size of the circuit was bounded by O(ncSqrt(log(n))) and the delay was bounded by O(cSqrt(log(n))).

Cook showed that an algorithm for squaring integers could be realized which had a running time of O(n25Sqrt(log(n))).

This algorithm, which is actually a collection of algorithms, has come to be known as the Toom-Cook method.

The first method in this collection is equivalent to Karatsuba's method.

The second method divides a number into three parts and involves solving for the Ci in the polynomial

(A2x2 + A1x + A0)2 = C4x4 + C3x3 + C2x2 + C1x + C0.

Cook solves this equation by letting x take on the values {4, 3, 2, 1, 0} in the polynomial, leading to the transformation







V4
V3
V2
V1
V0




=






16 4 1
9 3 1
4 2 1
1 1 1
0 0 1










A2
A1
A0




which, after squaring, yields the system







256 64 16 4 1
81 27 9 3 1
16 8 4 2 1
1 1 1 1 1
0 0 0 0 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






(16A2 + 4A1 + A0)2
(9A2 + 3A1 + A0)2
(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
A02




the solution of which I will express as







C4
C3
C2
C1
C0




=





-1

24 0 0 0 0
0 12 0 0 0
0 0 24 0 0
0 0 0 12 0
0 0 0 0 1










1 -4 6 -4 1
-3 14 -24 18 -5
11 -56 114 -104 35
-3 16 -36 48 -25
0 0 0 0 1










S4
S3
S2
S1
S0




The diagonal terms in the first matrix are the denominators in divides of intermediate results. In practice, there turn out to be four divides by 3 since the rest amounts to shifts of 2 or 3 bits.

Knuth solves this with x taken from {2, 1, 0, -1, -2} leading to the transformation







V4
V3
V2
V1
V0




=






4 2 1
1 1 1
0 0 1
1 -1 1
4 -2 1










A2
A1
A0




and the system







16 8 4 2 1
1 1 1 1 1
0 0 0 0 1
1 -1 1 -1 1
16 -8 4 -2 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
A02
(A2 - A1 + A0)2
(4A2 - 2A1 + A0)2




which has the somewhat more symmetrical solution







C4
C3
C2
C1
C0




=





-1

24 0 0 0 0
0 12 0 0 0
0 0 24 0 0
0 0 0 12 0
0 0 0 0 1










1 -4 6 -4 1
1 -2 0 2 -1
-1 16 -30 16 -1
-1 8 0 -8 1
0 0 1 0 0










S4
S3
S2
S1
S0




Although both of these solutions look similar, Knuth's is somewhat better than the previous one. The sign symmetry simplifies the system and greatly reduces the overhead (in the form of add, subtract, shift, and the like).

If, however, one uses reciprocal symmetry as well as sign symmetry in the choices of x, as in the set {Infinity, 2, 1, 1/2, 0}, then the greater symmetry in the system results in even greater simplification and greater reduction in overhead. (In general, x = p/q corresponds to q2n(An(p/q)n + ... + A0)2 = q2n(C2n(p/q)2n + ... + C0).)

This system







V4
V3
V2
V1
V0




=






1 0 0
4 2 1
1 1 1
1 2 4
0 0 1










A2
A1
A0




and







1 0 0 0 0
16 8 4 2 1
1 1 1 1 1
1 2 4 8 16
0 0 0 0 1










C4
C3
C2
C1
C0




=






S4
S3
S2
S1
S0




=






A22
(4A2 + 2A1 + A0)2
(A2 + A1 + A0)2
(A2 + 2A1 + 4A0)2
A02




has the solution







C4
C3
C2
C1
C0




=





-1

1 0 0 0 0
0 6 0 0 0
0 0 2 0 0
0 0 0 6 0
0 0 0 0 1










1 0 0 0 0
-21 2 -12 1 -6
7 -1 10 -1 7
-6 1 -12 2 -21
0 0 0 0 1










S4
S3
S2
S1
S0




(Oddly enough, it appears that reciprocal symmetry is more important than sign symmetry in the sense that, if one must break one or the other, the system seems simpler if sign symmetry is broken rather than reciprocal symmetry.)

This solution has only 2 divides by 3 and 3 shifts as well as much less overhead in the computation of the intermediate results.

The computation of S2 requires 2 extra bits and S1 and S3 each require 3.

The squaring method implied by this solution is shown in Figure 3.

Figure 3: The 3-way Method, O(w1.465)
                   
Initial Transform:     t02 = 2A2 + A1
V3 = 2t02 + A0
  t01 = A2 + A0
V2 = t01 + A1
  t00 = A1 + 2A0
V1 = A2 + 2t00
   
Squaring Stage: C4 = A22   S3 = V32   S2 = V22   S1 = V12   C0 = A02
Final Transform: t34 = 4C4 - C0   t33 = 4S2 - S3
t42 = t33 + t34
t52 = 2t34 + t42
-6C3 = 2t52 + t50
  t32 = C4 + C0
t41 = S2 - t32

t60 = t52 + t50
2C2 = 2t41 + t60
  t31 = 4S2 - S1
t40 = t31 + t30
t50 = 2t30 + t40
-6C1 = 2t50 + t52
  t30 = 4C0 - C4

Notice that it comes out looking like a complicated version of the 2-way method in that there is an initial transformation stage, a squaring stage, and a final transformation stage.

All of the operations shown can be made O(w). Shifts can be done separately or as part of the combined adds and subtracts.

The time to square (multiply) a number of length 3w becomes

T3(3w) = 5Tbest(w) + O(w).

This suggests that the asymptotic running time of the 3-way method increases as wlog(5) / log(3) = w1.465.

The improvement is not nearly so dramatic as the improvement in the 2-way method over the w2 method. The advantages of more complex routines will be even less dramatic for much more work.

The running time of the 3-way method is approximated by the formulas

TS3(w) = 64.59wlog(5) / log(3) - 243.51w + 33913 and
TM3(w) = 90.15wlog(5) / log(3) + 75.91w - 44016.

As before, to turn this method into a multiply, one duplicates those operations before the squaring stage and multiplies the corresponding terms instead of squaring them.

Running Time of Toom-Cook Style Methods

The running time of a Toom-Cook, Knuth, and k-way methods will all have three components: m = 2k - 1 squarings of elements of size w/k; some O(w) overhead; and some fixed overhead.

Assuming that one starts with a basis routine of length w0 that takes t0 time, we have

Tk(w0) = t0
Tk(kr+1w0) = mTk(krw0) + c1krw0 + c0
Tk(kw0) = mt0 + c1w0 + c0
Tk(k2w0) = m2t0 + c1w0(m + k) + c0(m + 1)
Tk(k3w0) = m3t0 + c1w0(m2 + mk + k2) + c0(m2 + m + 1)
...
Tk(krw0) = mrt0 + c1w0(mr - kr) / (m - k) + c0(mr - 1) / (m - 1)
.

Rearranging terms in this last equation gives

Tk(krw0) = (t0 + c1w0 / (m - k) + c0 / (m - 1))mr - c1w0kr / (m - k) - c0 / (m - 1)

or, in terms of w = krw0 becomes

Tk(w) = (t0 + c1w0 / (m - k) + c0 / (m - 1))(w/w0)log(m) / log(k) - c1w / (m - k) - c0 / (m - 1).

This formula has the expected form: a dominant (w/w0)log(m) / log(k) term, an O(w) term, and a constant term, both of which may be neglected for large w.

But, and this is important, the dominant term contains terms proportional to the overhead. Thus, while the overhead terms may be neglected for large w, the overhead itself may not as it directly contributes to the magnitude of the dominant term.

The 4-Way Method and Beyond

Extension to a k-way method involves solving for the Ci in the polynomial

(Akxk + ... + A0)2 = C2kx2k + ... + C0.

Cook would solve this polynomial by letting x take on the values {0, ..., 2k}.

Knuth would solve this polynomial by letting x take on the values {-k, ..., k}.

But if one chooses x from the reciprocally symmetric set {1, Infinity, 0, 2, 1/2, -2, -1/2, 3, 1/3, -3, -1/3, 3/2, 2/3, ...}, that is, the set of small rational numbers, +/- p/q, such that the GCD(p,q) = 1, the resulting system of equations is very symmetrical and the algorithm is simpler.

I will illustrate the various methods for k = 4.

Cook:







V6
V5
V4
V3
V2
V1
V0




=






216 36 6 1
125 25 5 1
64 16 4 1
27 9 3 1
8 4 2 1
1 1 1 1
0 0 0 1










A3
A2
A1
A0










46656 7776 1296 216 36 6 1
15625 3125 625 125 25 5 1
4096 1024 256 64 16 4 1
729 243 81 27 9 3 1
64 32 16 8 4 2 1
1 1 1 1 1 1 1
0 0 0 0 0 0 1










C6
C5
C4
C3
C2
C1
C0




=






S6
S5
S4
S3
S2
S1
S0




=






(256A3 + 36A2 + 6A1 + A0)2
(125A3 + 25A2 + 5A1 + A0)2
(64A3 + 16A2 + 4A1 + A0)2
(27A3 + 9A2 + 3A1 + A0)2
(8A3 + 4A2 + 2A1 + A0)2
(A3 + A2 + A1 + A0)2
A02










C6
C5
C4
C3
C2
C1
C0




=





-1

720 0 0 0 0 0 0
0 240 0 0 0 0 0
0 0 144 0 0 0 0
0 0 0 48 0 0 0
0 0 0 0 360 0 0
0 0 0 0 0 60 0
0 0 0 0 0 0 1










1 -6 15 -20 15 -6 1
-5 32 -85 120 -95 40 -7
17 -114 321 -484 411 -186 35
-15 104 -307 496 -461 232 -49
137 -972 2970 -5080 5265 -3132 812
-10 72 -225 400 -450 360 -147
0 0 0 0 0 0 1










S6
S5
S4
S3
S2
S1
S0




Knuth:







V6
V5
V4
V3
V2
V1
V0




=






27 9 3 1
8 4 2 1
1 1 1 1
0 0 0 1
-1 1 -1 1
-8 4 -2 1
-27 9 -3 1










A3
A2
A1
A0










729 243 81 27 9 3 1
64 32 16 8 4 2 1
1 1 1 1 1 1 1
0 0 0 0 0 0 1
1 -1 1 -1 1 -1 1
64 -32 16 -8 4 -2 1
729 -243 81 -27 9 -3 1










C6
C5
C4
C3
C2
C1
C0




=






S6
S5
S4
S3
S2
S1
S0




=






(27A3 + 9A2 + 3A1 + A0)2
(8A3 + 4A2 + 2A1 + A0)2
(A3 + A2 + A1 + A0)2
A02
(-A3 + A2 - A1 + A0)2
(-8A3 + 4A2 - 2A1 + A0)2
(-27A3 + 9A2 - 3A1 + A0)2










C6
C5
C4
C3
C2
C1
C0




=





-1

720 0 0 0 0 0 0
0 240 0 0 0 0 0
0 0 144 0 0 0 0
0 0 0 48 0 0 0
0 0 0 0 360 0 0
0 0 0 0 0 60 0
0 0 0 0 0 0 1










1 -6 15 -20 15 -6 1
1 -4 5 0 -5 4 -1
-1 12 -39 56 -39 12 -1
-1 8 -13 0 13 -8 1
2 -27 270 -490 270 -27 2
1 -9 45 0 -45 9 -1
0 0 0 1 0 0 0










S6
S5
S4
S3
S2
S1
S0




4-way:







V6
V5
V4
V3
V2
V1
V0




=






1 0 0 0
8 4 2 1
-8 4 -2 1
1 1 1 1
1 -2 4 -8
1 2 4 8
0 0 0 1










A3
A2
A1
A0










1 0 0 0 0 0 0
64 32 16 8 4 2 1
64 -32 16 -8 4 -2 1
1 1 1 1 1 1 1
1 -2 4 -8 16 -32 64
1 2 4 8 16 32 64
0 0 0 0 0 0 1










C6
C5
C4
C3
C2
C1
C0




=






S6
S5
S4
S3
S2
S1
S0




=






A32
(8A3 + 4A2 + 2A1 + A0)2
(-8A3 + 4A2 - 2A1 + A0)2
(A3 + A2 + A1 + A0)2
(A3 - 2A2 + 4A1 - 8A0)2
(A3 + 2A2 + 4A1 + 8A0)2
A02




This last solution was implemented as shown in the rather busy Figure 4.

Figure 4: The 4-way Method, O(w1.404)
               
Initial:   t04 = 4A2 + A0
V5 = 2t05 + t04
t03 = A3 + A2
V4 = |2t05 - t04|
t05 = 4A3 + A1
t00 = 4A0 + A2
V3 = t03 + t02
t02 = A1 + A0
V2 = |2t00 - t01|
t01 = 4A1 + A3
V1 = 2t00 + t01
 
Square: C6 = A32 S5 = V52 S4 = V42 S3 = V32 S2 = V22 S1 = V12 C0 = A02
Final: t39 = 4t25 + C6
t38 = 4S5 - t24
t45 = 32t39
t53 = t45 - 2t39
t25 = 4C6 - C0
t37 = 2S5 + t24
t44 = 2t38 + t32
t61 = t52 - 2t22
180C5 = t44 - 2t51
t24 = S5 + S4
t36 = 4t24 - t21
t43 = t37 + t32
t52 = 8t43 + t43
120C4 = t36 - t53
t23 = C6 + C0
t22 = S4 + S2
t35 = 2t23
t34 = 2t23 + S3
360C3 = t60 - t61
t21 = S2 + S1
t33 = 4t21 - t24
t42 = 8t34 + t35
t51 = 4t42 + t42
120C2 = t33 - t50
t20 = 4C0 - C6
t32 = 2S1 + t21
t41 = 2t31 + t37
t60 = 16t51 + t51
180C1 = t41 - 2t51
t30 = 4t20 + C0
t31 = 4S1 - t21
t40 = 32t30
t50 = t41 - 2t30

Except for its complexity, it is much the same as the other methods.

The computation of the S3 term requires 2 extra bits and S1, S2, S4, and S5 each require 4.

The 4-way method is asymptotically wlog(7) / log(4) = w1.404 and is approximated on the HP-9000/720 by the formula

TS4(w) = 120.79wlog(7) / log(4) - 858.7w + 976623.

Figure 5 shows the equations necessary to implement a 5-way method. This method would be asymptotically wlog(9) / log(5) = w1.365 but has not yet been implemented. Out of a sense of mercy to the reader, I won't even attempt a graph of the solution.

Figure 5: The 5-way Method, O(w1.365)
 
With x chosen from {1, Infinity, 0, 2, 1/2, -2, -1/2, 3, 1/3} we have
Vi = QijAj; Si = Vi2; MijCj = Si; and Ci = Dik-1RkjSj where
Qij =






1 0 0 0 0
16 8 4 2 1
16 -8 4 -2 1
81 27 9 3 1
1 1 1 1 1
1 3 9 27 81
1 -2 4 -8 16
1 2 4 8 16
0 0 0 0 1




Mij =






1 0 0 0 0 0 0 0 0
256 128 64 32 16 8 4 2 1
256 -128 64 -32 16 -8 4 -2 1
6561 2187 729 243 81 27 9 3 1
1 1 1 1 1 1 1 1 1
1 3 9 27 81 243 729 2187 6561
1 -2 4 -8 16 -32 64 -128 256
1 2 4 8 16 32 64 128 256
0 0 0 0 0 0 0 0 1




Dik =






1 0 0 0 0 0 0 0 0
0 2100 0 0 0 0 0 0 0
0 0 12600 0 0 0 0 0 0
0 0 0 8400 0 0 0 0 0
0 0 0 0 12600 0 0 0 0
0 0 0 0 0 8400 0 0 0
0 0 0 0 0 0 12600 0 0
0 0 0 0 0 0 0 2100 0
0 0 0 0 0 0 0 0 1




Rkj =






1 0 0 0 0 0 0 0 0
-2100 -42 -2 3 700 1 1 -21 -9100
54600 588 76 -24 -14000 -24 -29 483 1050
-700 770 -102 -47 -9100 -5 10 154 146300
-219450 -2079 97 102 59500 102 97 -2079 -219450
146300 154 10 -5 -9100 -47 -102 770 -700
1050 483 -29 -24 -14000 -24 76 588 54600
-9100 -21 1 1 700 3 -2 -42 -2100
0 0 0 0 0 0 0 0 1




The FFT Multiply

The best known method of squaring or multiplying integers is the FFT multiply discovered by Schönhage and Strassen. That is, this is the method with the best known asymptotic behavior of O(wlog(w)log(log(w))).

I shall not present this method here in any great detail but I recommend the excellent description in Chapter 7 of Aho, Hopcroft, and Ullman.

In brief, the FFT multiply of order k splits a number into 2k-1 parts for increasingly large values of k, performs an order k FFT on it, squares (or multiplies) those 2k elements, and performs an inverse FFT on the result. All arithmetic is performed in a ring with a solution to the equation z2k-1 = -1.

If the ring chosen is the integers mod some base, b, the equation becomes z2k-1 = -1 mod b and b must be larger than the result coefficients.

It is common to choose b = 2m2k-1 + 1 with m2k-1 > 2w(log(w) / log(2)) so that z = 2m, although other choices are possible. (This transform is also known as a Number Theoretic Transform or NTT.)

The FFT multiply has the advantages that: it is easily extensible to any value of k; both the initial and final transformations can be made to take place in stages that have identical connectivity; and, since all the divides are by powers of 2, they can be done as shifts.

But, it has a major disadvantage in that, since the ring must be large enough to represent the result coefficients, all of the basic operations are twice as large as in a corresponding Toom-Cook style method.

Therefore, the overhead involved can be large compared to the k-way methods.

For example, with z = 2m and b = 22m + 1, we have z2 = -1 mod b and the transformation for an FFT-2 would look like







V3
V2
V1
V0




=






A0 + A1
A0 - A1
A0 + A1z
A0 - A1z




=






1 0
1 -1
1 z
1 -z










A1
A0




and







z -1 -z 1
-z -1 z 1
-1 1 -1 1
1 1 1 1










C3
C2
C1
C0




=






S3
S2
S1
S0




=






V32
V22
V12
V02




with the solution







C0
C1
C2
C3




=





-1

4 0 0 0
0 0 0 4
0 0 4 0
0 4 0 0










1 1 1 1
1 0 -1 + z -z
1 1 -1 -1
1 0 -1 - z z










S3
S2
S1
S0




This method is shown in Figure 6.

Figure 6: The FFT-2 Method
               
Initial Transform: V3 = A0 - A1w   V3 = A0 + A1w   V3 = A0 - A1   V3 = A0 + A1
Squaring Stage: S3 = V32   S2 = V22   S1 = V12   S0 = V02
Final Transform: t03 = S2 - S3
4C1 = t02 - t03w
  t02 = S0 - S1
4C2 = t00 - t01
  t01 = S2 + S3
4C3 = t02 + t03w
  t00 = S0 + S1
4C0 = t00 + t01

Running Time of FFT-k Methods

Since b must be large enough to contain the result coefficients, each of the 4 squares in an FFT-2 method must be done with a method which is slightly larger than the entire operand! Therefore, this method is not useful for constructing larger squares.

The first method that is useful in that sense is the FFT-3. The operand is divided into 4 parts and 8 squares are performed in a ring that is slightly larger than half the operand. Therefore, this method is asymptotically O(wlog(8) / log(2)) = O(w3)!

In general, an FFT-k will have 2k squares in a ring slightly larger than 2k-2 times smaller than the operand. The asymptotic behavior is therefore O(wk / (k-2)) = O(w1 + 2 / (k-2)).

A timing formula for an FFT-k may be derived in the same way as the previous methods as follows

Tk(w0) = t0
Tk(2(r+1)(k-2)w0) = 2kTk(2r(k-2)w0) + c1k2rkw0 + c0
Tk(2k-2w0) = 2kt0 + c1kw0 + c0
Tk(22(k-2)w0) = 22kt0 + 2c1kw02k + (2k + 1)c0
Tk(23(k-2)w0) = 23kt0 + 3c1kw022k + (22k + 2k + 1)c0
...
Tk(2r(k-2)w0) = 2rkt0 + c1kw0r2rk + ((2rk - 1) / (2k - 1))c0
.

Rearranging terms gives

Tk(2r(k-2)w0) = (t0 + 2-kc1kw0r + c0 / (2k - 1))2rk - c0 / (2k - 1)

or, with w = 2r(k-2)w0, becomes

Tk(w) = (t0 + c1kw0log(w/w0) / (2k(k-1)log(2)) + c0 / (2k - 1))(w/w0)k / (k-2) - c0 / (2k - 1).

While this timing formula is much like that for the Toom-Cook style methods, there is a sense in which the exponent of the dominant term shrinks more slowly with increasing k relative to the growth in overhead.

Some Actual Results

Figure 7 compares the squaring methods discussed here and Figure 8 compares the multiplies. (The FFT's only appear in the squaring results since, as it will soon become clear, they do not show very much promise.) Time is in cycles at 50 MHz on the HP-9000/720 and size is in 32-bit words. For scale, this shows that one can square a 360,000-bit number in one second on this machine.

The running time (in cycles) of all the squaring methods described here.

The running time (in cycles) of the Toom-Cook style multiplication methods described here.

As a log-log plot, Figure 7 diminishes the roughly 3-to-1 differences between the various methods. We can illustrate those differences better by rescaling the data by the least squares approximation to T2(w).

Figure 9 shows a log-linear plot of Ti(w) / T2(w). We can now see that the k-way methods are faster as a class than the FFT-k methods for w < 1.1×106 words (37×106 bits) but we still don't have a clear idea of the asymptotic behavior.

Ratio of Ti(w) to T2(w). T2(w) (actually TS2(w)) is approximated by 26.64wlog(3) / log(2) - 54.63w + 227.

For that, the log of the ratio of the time to the basis time over log(k), which is

log(Tk(w) / Tk(w/k)) / log(k)

for the k-way methods and

log(Tk(w) / Tk(w / 2k-2)) / ((k-2)log(2))

for the FFT-k methods, gives a good approximation to the exponent in the dominant term and will illustrate how quickly each approaches its asymptote. Figure 10 shows this.

As w --> Infinity, this expression approaches the value of the exponent of the dominant term in the equation for the running time of each of the methods described here.

Another graph of interest is Figure 11. This shows a term which is asymptotically equal to the first overhead term.

As w --> Infinity, this expression should approach the value of the first overhead term in the running time equations. In reality, it seems to approach that value and then diverge away again. This would appear to be due to the influence of the working set size of the algorithm in question versus the size of the cache.

Besides showing that the overhead for the k-way methods is much less than for the FFT-k's, Figure 11 seems to show the beginnings of cache effects and suggests that the k-way methods also have slightly better locality of reference than the FFT-k's.

None of the FFT-k methods tried have any chance of being best. An FFT-6, asymptotically O(wlog(64) / log(16)) = O(w1.5), will also be too slow. An FFT-7, O(wlog(128) / log(32)) = O(w1.4), is asymptotically faster than the 4-way method but the 5-way method, O(wlog(9) / log(5)) = O(w1.365), very likely has much less overhead and will win first.

In spite of this behavior for small orders, it is possible that the FFT-k methods will eventually win for some w >> 1,175,000 words.

The overhead in an FFT-k is known to grow as something like O(k2k). A Toom-Cook style kTC-way method is asymptotically comparable to an FFT-kSS method when kTC = 2(kSS-2) / 2. If the overhead of the k-way methods, dominated by a matrix inversion, grows as fast as O(kTC3) = O(23(kSS-2) / 2), the FFT-k's may eventually win.

Is Squaring Faster than Multiplying?

Squaring is a special case of multiplying. Therefore, we have trivially that

Tsquare <= Tmultiply.

But, in all the methods presented here, squaring involves strictly less work than multiply. Further, most of this savings is in the overhead and, in the limit of large numbers, virtually all of the time spent in a multiply is in the overhead.

Therefore, we are led to ask the question: Is it possible that there are squaring methods that are of an order faster than any multiply methods?

The answer is, unfortunately: no.

While it is possible that we may discover some method of squaring that is strictly faster than any existing multiply method, any squaring method can be used to construct a multiply method that is no more than a constant slower.

A simple proof is

XY = ((X + Y)2 - (X2 + Y2)) / 2

which, assuming that add and shift are no worse than O(n), shows that

Tmultiply <= 3Tsquare + O(n).

Karatsuba presented a better method

XY = ((X + Y)2 - (X - Y)2) / 4

which gives a multiply in 2 squares and O(n). Therefore,

Tmultiply <= 2Tsquare + O(n).

Figure 12 compares the best squaring times and the best multiply times.

Multiplication seems to range from about 1.4× to 1.7× slower than the corresponding squaring. (The squaring times are included and are not identically one in that the scaling formula breaks down near the ends of the interval.)

The data in Figure 12 are scaled by the formula

Tbest = 200.377w1.40059 - (6.51172 - 3.74336 / ln(w)) / ln(w)

which roughly approximates the best squaring times.

Conclusions, Speculations, Etc.

It would appear that many of the simpler methods of multiplying are best all the way out to quite large numbers. Certainly, into the tens of millions of bits. Possibly, much farther.

In spite of the fact that squaring is fundamentally faster than multiply, it can be no better than a constant faster in the limit of large numbers.

It is still possible that the Schönhage and Strassen method will win in the end in spite of a slight asymptotic disadvantage. This would be a useful area for further work.

A closely related area is that of what is the minimum amount of overhead possible in Toom-Cook style methods. Different assumptions about the cost of overhead might lead to different trade-offs.

For example, in the approach taken in this paper, what are the best choices of x = +/- p/q so as to minimize add, shift, and divide overhead?

References

  1. Aho, A.V., Hopcroft J.E., Ullman J.D., The Design and Analysis of Computer Algorithms, Addison Wesley, 1974, Chapter 7.
  2. Cook, S.A. On the Minimum Computation Time of Functions, Thesis, Harvard University, May 1966, pages 51-77.
  3. Karatsuba, A. and Ofman, Yu. Multiplication of Multidigit Numbers on Automata, Soviet Physics - Doklady, Vol. 7, #7, January 1963, pages 595-596.
  4. Knuth, D.E. The Art of Computer Programming, Vol. 2., Second Edition, Addison-Wesley, Reading Mass., 1981, Chapter 4, Section 3.3, pages 278-301.
  5. Schönhage, A. and Strassen, V. Computing 7 (1971), 281-292 (in German).
  6. Shand, M., Digital Equipment Corp., Paris Research Laboratory, personal communication, July 1993.
  7. Toom, A.L. The Complexity of a Scheme of Functional Elements Realizing the Multiplication of Integers, Soviet Mathematics, Vol. 3, 1963, pages 714-716.
  8. Zuras, D. On Squaring and Multiplying Large Integers, 11th IEEE Symposium on Computer Arithmetic, pages 260-271.