More On Squaring and Multiplying Large Integers
Dan Zuras
Hewlett-Packard Co.
ABSTRACT
Methods of squaring and multiplying large integers are discussed. The
obvious O(n2) method turns out to be
best for small numbers. Existing O(nlog(3) /
log(2)) = O(n1.585) methods become better as
the numbers get bigger. New methods that are
O(nlog(5) / log(3)) = O(n1.465), O(nlog(7) /
log(4)) = O(n1.404), and
O(nlog(9) / log(5)) = O(n1.365) are
presented. In actual experiments, all of these methods turn out to be
faster than FFT multiplies for numbers that can be quite large
(> 37,000,000 bits). Squaring seems to be fundamentally faster than
multiplying but it is shown that Tmultiply
< 2Tsquare + O(n).
Introduction
This is an expansion of the paper On Squaring
and Multiplying Large Integers presented at the IEEE Symposium on
Computer Arithmetic (ARITH-11) in July 1993. This paper presents more
of the ideas behind these methods; some new results; and some
speculations for areas of future work.
Much of the work presented here is based on methods that appear in
Don Knuth's classic work The Art of Computer
Programming as well as the work of
A. L. Toom.
In actual experiments, some of the simpler methods of squaring and
multiplying are shown to be best in a surprisingly large number of
cases. Some new methods are also discussed which are useful out to
quite large numbers. In light of these results it seems that methods
such as the Schönhage and Strassen FFT
multiply, while of theoretical interest, may never be best for any
reasonably sized numbers.
The Problem
The problem to be discussed here is how to find the best (fastest)
ways to square and multiply large numbers in software. The methods
presented here were implemented in C and assembly language on an
HP-9000/720 but the general observations and conclusions should be true
in wider areas of application.
The approach used was to write a collection of routines for squaring
and multiplying w-word numbers producing 2w-word results for various specific values of w. These were general purpose routines in the sense
that w was a parameter.
In the course of writing these routines, it became clear that some
common operations on large integers stored as arrays of unsigned 32-bit
words would be needed. These operations were written in assembly
language and performed the functions detailed in Table
1.
Table 1: These arrays were oriented "big-endian".
That is, the most significant word of a[0,w-1] was
stored in a[0] and the least significant word was
stored in a[w-1]. While the author preferred
"little-endian", both C and the HP-9000 PA-RISC architecture
disagreed. They won, in the end. I will not,
however, inflict the big-endian notation on the reader. In this
paper, I will use the somewhat more conventional notation of
mathematics. That is, little-endian. So there.
Given operands a[0,w-1], b[0,w-1], producing result r[0,w-1]
|
vzconst(r,w,c) |
r[0,w-1] = ccc...c (w c's) |
vzcopy(r,w,a) |
r[0,w-1] = a[0,w-1] |
vzneg(r,w,a) |
r[0,w-1] = -a[0,w-1] |
vzshlM(r,w,a,c) |
r[0,w-1] = a[0,w-1]|c << M for M in [1,2,4,8,16,31]
|
vzshr(r,w,a,M) |
r[0,w-1] = a[0,w-1] >> M for M in [0:31] |
vzadd(r,w,a,b) |
r[0,w-1] = a[0,w-1] + b[0,w-1] |
vzsub(r,w,a,b) |
r[0,w-1] = a[0,w-1] - b[0,w-1] |
Cout = vzaddwco(r,w,a,b) |
r[0,w-1] + Cout = a[0,w-1] + b[0,w-1] |
vzshMadd(r,w,a,b,M) |
r[0,w-1] = a[0,w-1] << M + b[0,w-1] |
vzshMsub(r,w,a,b,M) |
r[0,w-1] = a[0,w-1] << M - b[0,w-1] |
t = vzge(b,w,a) |
t = (b[0,w-1] >= a[0,w-1]) |
A collection of routines for squaring and multiplying "small"
integers were also written in assembly. These provided a basis upon
which the larger routines could be built.
All other routines were written in C and compiler optimized.
Running time was measured for w in the
range 1 to 1,175,553 words (37,617,696 bits) by counting instruction
cycles on a 50MHz HP-9000/720.
Finding the Best
We will define TSi(w) as the
time required for method i to square a number
of length w and
TMi(w) as the time required for method i to multiply a number of length
w. (We will use the expression Ti(w)
when discussing either method or method i
in general.)
Method i is considered best for
some length w if
Ti(w) <= Tj(w')
for all j and for all w'
>= w.
That is, a method is best for a given length if there is no faster
way of squaring or multiplying numbers at least as large as this
one.
Method 1: The w2
Basis
To start the ball rolling, basis routines for operating on small
integers were written using the obvious method.
Roughly speaking, if the algorithm for a multiply is
FOR i = w-1 to
0 :
FOR j = w-1 to
0 :
(r[i+j,i+j+1] and
CarryOut) += a[i] × b[i],
then squaring can be made strictly faster than multiply by
accumulating the upper half off-diagonal elements separately and
combining them with the diagonals according to the formula
Result = 2 ×
Offdiagonals + Diagonals.
Obviously, the running time,
T1(w), increases quadratically.
T1(w), in cycles, is well approximated on the
HP-9000/720 by the least squares formulas
TS1(w) = 4.53w2 + 10w + 30.71
and
TM1(w) =
9.82w2 - 9.06w + 90.26.
These routines turned out to be best for all w
up to 33 for square (23 for multiply).
The 2-Way Method
It seems that the first discussion of a method of multiplication
(squaring, actually) that was better than w2
in the width of the operand was a short paper written by Karatsuba and Ofman that appeared in the Soviet
journal Doklady in 1963. The method involves splitting the operand into
2 parts: the high order bits and the low order bits, based on the
polynomial
(A1x + A0)2 =
A12(x2 - x) + (A1 +
A0)2x + A02(1 - x)
where x = 2n.
Knuth presented a minor improvement on this based on the slightly
different polynomial
(A1x + A0)2 =
A12(x2 + x) - (A1 -
A0)2x + A02(x +
1).
Both of these may be regarded as special cases of solving for the
Ci in the polynomial
(A1x + A0)2 =
C2x2 + C1x + C0.
Karatsuba and Ofman solve this equation by letting
x take on the values {Infinity, 1, 0}
where x = Infinity corresponds to
|
|
(A1x +
A0)2/x2 =
|
|
(C2x2 + C1x +
C0)/x2
|
in this context.
Rather than express this system in the usual polynomial form, I will
express it as a linear transformation of the operand in the following
way
(Please forgive the fairly primitive look of these matrix
expressions. It is the best I have been able to do under the
limitations of html.)
After squaring each element of the vector, the solution to the
resulting system
may be expressed as
This method has a small problem in that the term
A1 + A0 requires one more bit than the
other two. Karatsuba and Ofman address this problem by showing how you
can square an n+1-bit number using an n-bit algorithm together with a couple of shifts and
adds.
Knuth solves the equation by letting x take
on the values {Infinity, -1, 0}. A similar
transformation
results, after squaring, in a similar system
which has the solution
Knuth presented this method as Karatsuba and Ofman's but Knuth's
variation leads to a real improvement. By rearranging the algebra a
bit, he was able to replace the A1 +
A0 term with A1 -
A0. Since the only use of the term is to square it,
its sign is unimportant. Therefore, one can always subtract the lesser
from the greater and save that extra bit.
The test and branch involved results in much less overhead than
Karatsuba and Ofman's method.
This method can be turned into a multiply method by replacing the
Si with the
Pi in the formula
|
|
|
= |
|
= |
|
|
|
|
|
|
|
A1B1
|
|
(A1 - A0)(B1 -
B0)
|
|
A0B0
|
|
|
|
|
|
|
and doing the implied extra work.
These methods are diagrammed in Figure 1.
Figure 1: The 2-way Method, O(w1.585)
| |
Square |
|
Multiply |
| Initial Transform: |
V1 = |A1 - A0|
|
|
V1 = |A1 - A0|
W1 = |B1 - B0|
|
| Squaring/Multiplication: |
C2 = A12
S1 = V12
C0 = A02
|
|
C2 = A1B1
P1 = V1W1
C0 = A0B0
|
| Final Transform: |
t1 = C2 + C0
C1 = t1 - S1
|
|
t1 = C2 + C0
C1 = t1 +/- P1
|
|
Assuming that the time for an add (or subtract) of length w is O(w) , the time to
square a number (or multiply two numbers) of length 2w
would be
T2(2w) = 3Tbest(w) + O(w).
The overhead involved, one add before the squaring step (two adds
before the multiplying step) and two after, is strictly larger than the
overhead for a w2 method. Thus,
although a w2 squaring (or multiply)
of length 2w would take
T1(2w) = 4Tbest(w) + O(w),
there is a definite crossover point as w
increases where T1(w) is roughly the
same as T2(w). This point occurs
where the time involved in the extra overhead of Knuth's method (which I
will call the 2-way method from this point on)
is roughly equivalent to the time for the 4th multiply in the
w2 method.
Below this point the w2 method
wins because of its simplicity and above this point the
2-way method wins because of its better asymptotic behavior.
Figure 2 illustrates this by plotting the
ratio T2(w) / T1(w) for
some 2-way routines. (T1(w) is actually the least squares
approximation.)
Ratio of T2(w), the running
time of Knuth's 2-way multiply to T1(w), the running time of the
w2 basis.
|
In the limit, the running time of the 2-way
method triples each time the size of the problem doubles. Therefore,
its time increases as wlog(3) / log(2) =
w1.585.
The running time of the 2-way routines (in
cycles) is approximated by the least squares formulas
TS2(w) = 26.64wlog(3) / log(2)
- 54.63w + 227 and
TM2(w) =
46.2wlog(3) / log(2) - 84.11w + 979.
Building on a basis set of w2
routines up to 33 words for square (23 words for multiply), the 2-way method takes over at 34 words (24 words) and
generally dominates in these experiments up to 768 words (384
words).
Mark Shand has pointed out that this
crossover point is a strong function of the ratio of the multiply time
to add time on a given machine. The HP-9000/720 performs an unsigned
32 × 32 --> 64 multiply in 2 cycles (5 to
7 cycles if you include load/store overhead). On a machine with a
relatively slower multiply the crossover point can be much lower.
The General Idea
We now have enough background to show the general idea behind these
squaring (and multiplying) methods.
In each squaring case, the operand is split into k
parts Aj. These parts are
transformed via a Qij (derived from
the original polynomial) into m parts Vi = QijAj. Then,
each of the Vi is squared with the
best method available, producing Si =
Vi2.
In each multiply case, the operands are split into
k parts Aj and Bj. These parts are transformed via Qij into m parts
Vi = QijAj and
Wi = QijBj.
Then, each of the Vi is multiplied
by Wi with the best method
available, producing Pi =
ViWi.
Now, the solution to the system
MijCj = Si (
MijCj = Pi) (derived from the
square of the original polynomial) can be expressed as
Ci = Dik-1RkjSj
( Ci =
Dik-1RkjPj) with all
the coefficients in the integers. (Up to now,
Dik-1, which expresses the integer divides,
has not been needed as it was the identity matrix.) Finally, the Ci are shifted and added up to produce the
result.
The 3-Way Method
Toom showed that circuits could be
constructed for squaring integers where the size of the circuit was
bounded by O(ncSqrt(log(n))) and the
delay was bounded by
O(cSqrt(log(n))).
Cook showed that an algorithm for squaring
integers could be realized which had a running time of
O(n25Sqrt(log(n))).
This algorithm, which is actually a collection of algorithms, has
come to be known as the Toom-Cook method.
The first method in this collection is equivalent to Karatsuba's
method.
The second method divides a number into three parts and involves
solving for the Ci in the
polynomial
(A2x2 + A1x +
A0)2 = C4x4 +
C3x3 + C2x2 +
C1x + C0.
Cook solves this equation by letting x take
on the values {4, 3, 2, 1, 0} in the
polynomial, leading to the transformation
|
|
= |
|
|
|
|
|
|
| 16 |
4 |
1 |
| 9 |
3 |
1 |
| 4 |
2 |
1 |
| 1 |
1 |
1 |
| 0 |
0 |
1 |
|
|
|
|
|
|
|
which, after squaring, yields the system
|
|
|
|
|
|
| 256 |
64 |
16 |
4 |
1 |
| 81 |
27 |
9 |
3 |
1 |
| 16 |
8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
(16A2 + 4A1 +
A0)2
|
|
(9A2 + 3A1 +
A0)2
|
|
(4A2 + 2A1 +
A0)2
|
|
(A2 + A1 +
A0)2
|
|
A02
|
|
|
|
|
|
|
the solution of which I will express as
|
|
= |
|
|
|
|
|
-1 |
|
| 24 |
0 |
0 |
0 |
0 |
| 0 |
12 |
0 |
0 |
0 |
| 0 |
0 |
24 |
0 |
0 |
| 0 |
0 |
0 |
12 |
0 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
-4 |
6 |
-4 |
1 |
| -3 |
14 |
-24 |
18 |
-5 |
| 11 |
-56 |
114 |
-104 |
35 |
| -3 |
16 |
-36 |
48 |
-25 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
The diagonal terms in the first matrix are the denominators in
divides of intermediate results. In practice, there turn out to be four
divides by 3 since the rest amounts to shifts of 2 or 3 bits.
Knuth solves this with x taken from {2, 1, 0, -1, -2} leading to the transformation
|
|
= |
|
|
|
|
|
|
| 4 |
2 |
1 |
| 1 |
1 |
1 |
| 0 |
0 |
1 |
| 1 |
-1 |
1 |
| 4 |
-2 |
1 |
|
|
|
|
|
|
|
and the system
|
|
|
|
|
|
| 16 |
8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
0 |
1 |
| 1 |
-1 |
1 |
-1 |
1 |
| 16 |
-8 |
4 |
-2 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
(4A2 + 2A1 +
A0)2
|
|
(A2 + A1 +
A0)2
|
|
A02
|
|
(A2 - A1 +
A0)2
|
|
(4A2 - 2A1 +
A0)2
|
|
|
|
|
|
|
which has the somewhat more symmetrical solution
|
|
= |
|
|
|
|
|
-1 |
|
| 24 |
0 |
0 |
0 |
0 |
| 0 |
12 |
0 |
0 |
0 |
| 0 |
0 |
24 |
0 |
0 |
| 0 |
0 |
0 |
12 |
0 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
-4 |
6 |
-4 |
1 |
| 1 |
-2 |
0 |
2 |
-1 |
| -1 |
16 |
-30 |
16 |
-1 |
| -1 |
8 |
0 |
-8 |
1 |
| 0 |
0 |
1 |
0 |
0 |
|
|
|
|
|
|
|
Although both of these solutions look similar, Knuth's is somewhat
better than the previous one. The sign symmetry simplifies the system
and greatly reduces the overhead (in the form of add, subtract, shift,
and the like).
If, however, one uses reciprocal symmetry as well as sign symmetry
in the choices of x, as in the set {Infinity, 2, 1, 1/2, 0}, then the greater symmetry in
the system results in even greater simplification and greater reduction
in overhead. (In general, x = p/q corresponds
to q2n(An(p/q)n + ...
+ A0)2 =
q2n(C2n(p/q)2n + ... +
C0).)
This system
|
|
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
| 4 |
2 |
1 |
| 1 |
1 |
1 |
| 1 |
2 |
4 |
| 0 |
0 |
1 |
|
|
|
|
|
|
|
and
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
| 16 |
8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
| 1 |
2 |
4 |
8 |
16 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
A22
|
|
(4A2 + 2A1 +
A0)2
|
|
(A2 + A1 +
A0)2
|
|
(A2 + 2A1 +
4A0)2
|
|
A02
|
|
|
|
|
|
|
has the solution
|
|
= |
|
|
|
|
|
-1 |
|
| 1 |
0 |
0 |
0 |
0 |
| 0 |
6 |
0 |
0 |
0 |
| 0 |
0 |
2 |
0 |
0 |
| 0 |
0 |
0 |
6 |
0 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
| -21 |
2 |
-12 |
1 |
-6 |
| 7 |
-1 |
10 |
-1 |
7 |
| -6 |
1 |
-12 |
2 |
-21 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
(Oddly enough, it appears that reciprocal symmetry is more important
than sign symmetry in the sense that, if one must break one or the
other, the system seems simpler if sign symmetry is broken rather than
reciprocal symmetry.)
This solution has only 2 divides by 3 and 3 shifts as well as much
less overhead in the computation of the intermediate results.
The computation of S2 requires 2
extra bits and S1 and S3 each require 3.
The squaring method implied by this solution is shown in Figure 3.
Figure 3: The 3-way Method, O(w1.465)
| | | | |
| | | |
| |
| Initial Transform: |
|
|
t02 = 2A2 + A1
V3 = 2t02 + A0
|
|
t01 = A2 + A0
V2 = t01 + A1
|
|
t00 = A1 + 2A0
V1 = A2 + 2t00
|
|
|
| Squaring Stage: |
C4 = A22
|
|
S3 = V32
|
|
S2 = V22
|
|
S1 = V12
|
|
C0 = A02
|
| Final Transform: |
t34 = 4C4 - C0
|
|
t33 = 4S2 - S3
t42 = t33 + t34
t52 = 2t34 + t42
-6C3 = 2t52 + t50
|
|
t32 = C4 + C0
t41 = S2 - t32
t60 = t52 + t50
2C2 = 2t41 + t60
|
|
t31 = 4S2 - S1
t40 = t31 + t30
t50 = 2t30 + t40
-6C1 = 2t50 + t52
|
|
t30 = 4C0 - C4
|
|
Notice that it comes out looking like a complicated version of the
2-way method in that there is an initial
transformation stage, a squaring stage, and a final transformation
stage.
All of the operations shown can be made
O(w). Shifts can be done separately or as part of the combined
adds and subtracts.
The time to square (multiply) a number of length 3w
becomes
T3(3w) = 5Tbest(w) + O(w).
This suggests that the asymptotic running time of the 3-way method increases as
wlog(5) / log(3) = w1.465.
The improvement is not nearly so dramatic as the improvement in the
2-way method over the
w2 method. The advantages of more complex routines
will be even less dramatic for much more work.
The running time of the 3-way method is
approximated by the formulas
TS3(w) = 64.59wlog(5) / log(3)
- 243.51w + 33913 and
TM3(w) = 90.15wlog(5) / log(3) + 75.91w -
44016.
As before, to turn this method into a multiply, one duplicates those
operations before the squaring stage and multiplies the corresponding
terms instead of squaring them.
Running Time of Toom-Cook Style Methods
The running time of a Toom-Cook, Knuth, and
k-way methods will all have three components:
m = 2k - 1 squarings of elements of size
w/k; some O(w) overhead; and some fixed
overhead.
Assuming that one starts with a basis routine of length w0 that takes
t0 time, we have
Tk(w0) = t0
Tk(kr+1w0) =
mTk(krw0) +
c1krw0 + c0
Tk(kw0) = mt0 +
c1w0 + c0
Tk(k2w0) =
m2t0 + c1w0(m + k) +
c0(m + 1)
Tk(k3w0)
= m3t0 +
c1w0(m2 + mk + k2) +
c0(m2 + m + 1)
...
Tk(krw0) =
mrt0 +
c1w0(mr - kr) / (m - k) +
c0(mr - 1) / (m - 1).
Rearranging terms in this last equation gives
Tk(krw0) = (t0 +
c1w0 / (m - k) +
c0 / (m - 1))mr -
c1w0kr / (m - k) -
c0 / (m - 1)
or, in terms of w = krw0
becomes
Tk(w) = (t0 +
c1w0 / (m - k) + c0 / (m -
1))(w/w0)log(m) / log(k) - c1w / (m -
k) - c0 / (m - 1).
This formula has the expected form: a dominant
(w/w0)log(m) / log(k) term, an O(w) term, and a constant term, both of which may be
neglected for large w.
But, and this is important, the dominant term contains terms
proportional to the overhead. Thus, while the overhead terms may
be neglected for large w, the overhead
itself may not as it directly contributes to the
magnitude of the dominant term.
The 4-Way Method and
Beyond
Extension to a k-way method involves solving
for the Ci in the polynomial
(Akxk + ... +
A0)2 = C2kx2k + ... +
C0.
Cook would solve this polynomial by letting x
take on the values {0, ..., 2k}.
Knuth would solve this polynomial by letting x
take on the values {-k, ..., k}.
But if one chooses x from the reciprocally
symmetric set {1, Infinity, 0, 2, 1/2, -2, -1/2, 3,
1/3, -3, -1/3, 3/2, 2/3, ...}, that is, the set of small rational
numbers, +/- p/q, such that the
GCD(p,q) = 1, the resulting system of equations is very
symmetrical and the algorithm is simpler.
I will illustrate the various methods for k =
4.
Cook:
|
|
= |
|
|
|
|
|
|
| 216 |
36 |
6 |
1 |
| 125 |
25 |
5 |
1 |
| 64 |
16 |
4 |
1 |
| 27 |
9 |
3 |
1 |
| 8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46656 |
7776 |
1296 |
216 |
36 |
6 |
1 |
| 15625 |
3125 |
625 |
125 |
25 |
5 |
1 |
| 4096 |
1024 |
256 |
64 |
16 |
4 |
1 |
| 729 |
243 |
81 |
27 |
9 |
3 |
1 |
| 64 |
32 |
16 |
8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
(256A3 + 36A2 + 6A1 +
A0)2
|
|
(125A3 + 25A2 + 5A1 +
A0)2
|
|
(64A3 + 16A2 + 4A1 +
A0)2
|
|
(27A3 + 9A2 + 3A1 +
A0)2
|
|
(8A3 + 4A2 + 2A1 +
A0)2
|
|
(A3 + A2 + A1 +
A0)2
|
|
A02
|
|
|
|
|
|
|
|
|
= |
|
|
|
|
|
-1 |
|
| 720 |
0 |
0 |
0 |
0 |
0 |
0 |
| 0 |
240 |
0 |
0 |
0 |
0 |
0 |
| 0 |
0 |
144 |
0 |
0 |
0 |
0 |
| 0 |
0 |
0 |
48 |
0 |
0 |
0 |
| 0 |
0 |
0 |
0 |
360 |
0 |
0 |
| 0 |
0 |
0 |
0 |
0 |
60 |
0 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
-6 |
15 |
-20 |
15 |
-6 |
1 |
| -5 |
32 |
-85 |
120 |
-95 |
40 |
-7 |
| 17 |
-114 |
321 |
-484 |
411 |
-186 |
35 |
| -15 |
104 |
-307 |
496 |
-461 |
232 |
-49 |
| 137 |
-972 |
2970 |
-5080 |
5265 |
-3132 |
812 |
| -10 |
72 |
-225 |
400 |
-450 |
360 |
-147 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
Knuth:
|
|
= |
|
|
|
|
|
|
| 27 |
9 |
3 |
1 |
| 8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
1 |
| -1 |
1 |
-1 |
1 |
| -8 |
4 |
-2 |
1 |
| -27 |
9 |
-3 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 729 |
243 |
81 |
27 |
9 |
3 |
1 |
| 64 |
32 |
16 |
8 |
4 |
2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
| 1 |
-1 |
1 |
-1 |
1 |
-1 |
1 |
| 64 |
-32 |
16 |
-8 |
4 |
-2 |
1 |
| 729 |
-243 |
81 |
-27 |
9 |
-3 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
(27A3 + 9A2 + 3A1 +
A0)2
|
|
(8A3 + 4A2 + 2A1 +
A0)2
|
|
(A3 + A2 + A1 +
A0)2
|
|
A02
|
|
(-A3 + A2 - A1 +
A0)2
|
|
(-8A3 + 4A2 - 2A1 +
A0)2
|
|
(-27A3 + 9A2 - 3A1 +
A0)2
|
|
|
|
|
|
|
|
|
= |
|
|
|
|
|
-1 |
|
| 720 |
0 |
0 |
0 |
0 |
0 |
0 |
| 0 |
240 |
0 |
0 |
0 |
0 |
0 |
| 0 |
0 |
144 |
0 |
0 |
0 |
0 |
| 0 |
0 |
0 |
48 |
0 |
0 |
0 |
| 0 |
0 |
0 |
0 |
360 |
0 |
0 |
| 0 |
0 |
0 |
0 |
0 |
60 |
0 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
-6 |
15 |
-20 |
15 |
-6 |
1 |
| 1 |
-4 |
5 |
0 |
-5 |
4 |
-1 |
| -1 |
12 |
-39 |
56 |
-39 |
12 |
-1 |
| -1 |
8 |
-13 |
0 |
13 |
-8 |
1 |
| 2 |
-27 |
270 |
-490 |
270 |
-27 |
2 |
| 1 |
-9 |
45 |
0 |
-45 |
9 |
-1 |
| 0 |
0 |
0 |
1 |
0 |
0 |
0 |
|
|
|
|
|
|
|
4-way:
|
|
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
| 8 |
4 |
2 |
1 |
| -8 |
4 |
-2 |
1 |
| 1 |
1 |
1 |
1 |
| 1 |
-2 |
4 |
-8 |
| 1 |
2 |
4 |
8 |
| 0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 64 |
32 |
16 |
8 |
4 |
2 |
1 |
| 64 |
-32 |
16 |
-8 |
4 |
-2 |
1 |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 1 |
-2 |
4 |
-8 |
16 |
-32 |
64 |
| 1 |
2 |
4 |
8 |
16 |
32 |
64 |
| 0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
|
|
|
|
|
|
A32
|
|
(8A3 + 4A2 + 2A1 +
A0)2
|
|
(-8A3 + 4A2 - 2A1 +
A0)2
|
|
(A3 + A2 + A1 +
A0)2
|
|
(A3 - 2A2 + 4A1 -
8A0)2
|
|
(A3 + 2A2 + 4A1 +
8A0)2
|
|
A02
|
|
|
|
|
|
|
This last solution was implemented as shown in the rather busy Figure 4.
Figure 4: The 4-way Method, O(w1.404)
| | | | |
| | | |
| Initial: |
|
t04 = 4A2 + A0
V5 = 2t05 + t04
|
t03 = A3 + A2
V4 = |2t05 - t04|
|
t05 = 4A3 + A1
t00 = 4A0 + A2
V3 = t03 + t02
|
t02 = A1 + A0
V2 = |2t00 - t01|
|
t01 = 4A1 + A3
V1 = 2t00 + t01
|
|
| Square: |
C6 = A32
|
S5 = V52
|
S4 = V42
|
S3 = V32
|
S2 = V22
|
S1 = V12
|
C0 = A02
|
| Final: |
t39 = 4t25 + C6
t38 = 4S5 - t24
t45 = 32t39
t53 = t45 - 2t39
|
t25 = 4C6 - C0
t37 = 2S5 + t24
t44 = 2t38 + t32
t61 = t52 - 2t22
180C5 = t44 - 2t51
|
t24 = S5 + S4
t36 = 4t24 - t21
t43 = t37 + t32
t52 = 8t43 + t43
120C4 = t36 - t53
|
t23 = C6 + C0
t22 = S4 + S2
t35 = 2t23
t34 = 2t23 + S3
360C3 = t60 - t61
|
t21 = S2 + S1
t33 = 4t21 - t24
t42 = 8t34 + t35
t51 = 4t42 + t42
120C2 = t33 - t50
|
t20 = 4C0 - C6
t32 = 2S1 + t21
t41 = 2t31 + t37
t60 = 16t51 + t51
180C1 = t41 - 2t51
|
t30 = 4t20 + C0
t31 = 4S1 - t21
t40 = 32t30
t50 = t41 - 2t30
|
|
Except for its complexity, it is much the same as the other
methods.
The computation of the S3 term
requires 2 extra bits and S1, S2, S4,
and S5 each require 4.
The 4-way method is asymptotically wlog(7) / log(4) = w1.404 and is
approximated on the HP-9000/720 by the formula
TS4(w) = 120.79wlog(7) / log(4)
- 858.7w + 976623.
Figure 5 shows the equations necessary to
implement a 5-way method. This method would be
asymptotically wlog(9) / log(5) =
w1.365 but has not yet been implemented. Out of a
sense of mercy to the reader, I won't even attempt a graph of
the solution.
Figure 5: The 5-way Method, O(w1.365)
| |
|
With x chosen from
{1, Infinity, 0, 2, 1/2, -2, -1/2, 3, 1/3} we have
|
|
Vi =
QijAj;
Si = Vi2; MijCj =
Si; and Ci =
Dik-1RkjSj
where
|
| Qij |
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
| 16 |
8 |
4 |
2 |
1 |
| 16 |
-8 |
4 |
-2 |
1 |
| 81 |
27 |
9 |
3 |
1 |
| 1 |
1 |
1 |
1 |
1 |
| 1 |
3 |
9 |
27 |
81 |
| 1 |
-2 |
4 |
-8 |
16 |
| 1 |
2 |
4 |
8 |
16 |
| 0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
| Mij |
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 256 |
128 |
64 |
32 |
16 |
8 |
4 |
2 |
1 |
| 256 |
-128 |
64 |
-32 |
16 |
-8 |
4 |
-2 |
1 |
| 6561 |
2187 |
729 |
243 |
81 |
27 |
9 |
3 |
1 |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 1 |
3 |
9 |
27 |
81 |
243 |
729 |
2187 |
6561 |
| 1 |
-2 |
4 |
-8 |
16 |
-32 |
64 |
-128 |
256 |
| 1 |
2 |
4 |
8 |
16 |
32 |
64 |
128 |
256 |
| 0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
| Dik |
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 0 |
2100 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| 0 |
0 |
12600 |
0 |
0 |
0 |
0 |
0 |
0 |
| 0 |
0 |
0 |
8400 |
0 |
0 |
0 |
0 |
0 |
| 0 |
0 |
0 |
0 |
12600 |
0 |
0 |
0 |
0 |
| 0 |
0 |
0 |
0 |
0 |
8400 |
0 |
0 |
0 |
| 0 |
0 |
0 |
0 |
0 |
0 |
12600 |
0 |
0 |
| 0 |
0 |
0 |
0 |
0 |
0 |
0 |
2100 |
0 |
| 0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
| Rkj |
= |
|
|
|
|
|
|
| 1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
| -2100 |
-42 |
-2 |
3 |
700 |
1 |
1 |
-21 |
-9100 |
| 54600 |
588 |
76 |
-24 |
-14000 |
-24 |
-29 |
483 |
1050 |
| -700 |
770 |
-102 |
-47 |
-9100 |
-5 |
10 |
154 |
146300 |
| -219450 |
-2079 |
97 |
102 |
59500 |
102 |
97 |
-2079 |
-219450 |
| 146300 |
154 |
10 |
-5 |
-9100 |
-47 |
-102 |
770 |
-700 |
| 1050 |
483 |
-29 |
-24 |
-14000 |
-24 |
76 |
588 |
54600 |
| -9100 |
-21 |
1 |
1 |
700 |
3 |
-2 |
-42 |
-2100 |
| 0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
|
|
|
|
|
|
|
|
The FFT Multiply
The best known method of squaring or multiplying integers is the FFT
multiply discovered by Schönhage and
Strassen. That is, this is the method with the best known
asymptotic behavior of
O(wlog(w)log(log(w))).
I shall not present this method here in any great detail but I
recommend the excellent description in Chapter 7 of
Aho, Hopcroft, and Ullman.
In brief, the FFT multiply of order k
splits a number into 2k-1 parts for
increasingly large values of k, performs an
order k FFT on it, squares (or multiplies)
those 2k elements, and performs an
inverse FFT on the result. All arithmetic is performed in a ring with a
solution to the equation z2k-1 =
-1.
If the ring chosen is the integers mod some base,
b, the equation becomes
z2k-1 = -1 mod b and b
must be larger than the result coefficients.
It is common to choose b =
2m2k-1 + 1 with
m2k-1 > 2w(log(w) / log(2)) so that
z = 2m, although other choices are possible. (This
transform is also known as a Number Theoretic Transform or NTT.)
The FFT multiply has the advantages that: it is easily extensible
to any value of k; both the initial and final
transformations can be made to take place in stages that have identical
connectivity; and, since all the divides are by powers of 2, they can be
done as shifts.
But, it has a major disadvantage in that, since the ring must be
large enough to represent the result coefficients, all of the basic
operations are twice as large as in a corresponding Toom-Cook
style method.
Therefore, the overhead involved can be large compared to the k-way methods.
For example, with z = 2m and
b = 22m + 1, we have
z2 = -1 mod b and the transformation for an FFT-2 would look like
|
|
= |
|
|
|
|
|
|
|
A0 + A1
|
|
A0 - A1
|
|
A0 + A1z
|
|
A0 - A1z
|
|
|
|
|
|
|
= |
|
|
and
|
|
|
|
|
|
| z |
-1 |
-z |
1 |
| -z |
-1 |
z |
1 |
| -1 |
1 |
-1 |
1 |
| 1 |
1 |
1 |
1 |
|
|
|
|
|
|
|
= |
|
= |
|
with the solution
|
|
= |
|
|
|
|
|
-1 |
|
| 4 |
0 |
0 |
0 |
| 0 |
0 |
0 |
4 |
| 0 |
0 |
4 |
0 |
| 0 |
4 |
0 |
0 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
1 |
1 |
1 |
| 1 |
0 |
-1 + z |
-z |
| 1 |
1 |
-1 |
-1 |
| 1 |
0 |
-1 - z |
z |
|
|
|
|
|
|
|
This method is shown in Figure 6.
Figure 6: The FFT-2 Method
| | | | |
| | | |
| Initial Transform: |
V3 = A0 - A1w
|
|
V3 = A0 + A1w
|
|
V3 = A0 - A1
|
|
V3 = A0 + A1
|
| Squaring Stage: |
S3 = V32
|
|
S2 = V22
|
|
S1 = V12
|
|
S0 = V02
|
| Final Transform: |
t03 = S2 - S3
4C1 = t02 - t03w
|
|
t02 = S0 - S1
4C2 = t00 - t01
|
|
t01 = S2 + S3
4C3 = t02 + t03w
|
|
t00 = S0 + S1
4C0 = t00 + t01
|
|
Running Time of FFT-k
Methods
Since b must be large enough to contain the
result coefficients, each of the 4 squares in an FFT-2
method must be done with a method which is slightly larger
than the entire operand! Therefore, this method is not useful for
constructing larger squares.
The first method that is useful in that sense is the FFT-3. The operand is divided into 4 parts and 8 squares
are performed in a ring that is slightly larger than half the operand.
Therefore, this method is asymptotically
O(wlog(8) / log(2)) = O(w3)!
In general, an FFT-k will have 2k squares in a ring slightly larger than
2k-2 times smaller than the operand.
The asymptotic behavior is therefore
O(wk / (k-2)) = O(w1 + 2 / (k-2)).
A timing formula for an FFT-k may be derived
in the same way as the previous methods as follows
Tk(w0) = t0
Tk(2(r+1)(k-2)w0) =
2kTk(2r(k-2)w0) +
c1k2rkw0 + c0
Tk(2k-2w0) =
2kt0 + c1kw0 +
c0
Tk(22(k-2)w0) =
22kt0 +
2c1kw02k + (2k +
1)c0
Tk(23(k-2)w0)
= 23kt0 +
3c1kw022k + (22k +
2k + 1)c0
...
Tk(2r(k-2)w0) =
2rkt0 +
c1kw0r2rk + ((2rk -
1) / (2k - 1))c0.
Rearranging terms gives
Tk(2r(k-2)w0) = (t0
+ 2-kc1kw0r + c0 /
(2k - 1))2rk - c0 /
(2k - 1)
or, with w =
2r(k-2)w0, becomes
Tk(w) = (t0 +
c1kw0log(w/w0) /
(2k(k-1)log(2)) + c0 / (2k -
1))(w/w0)k / (k-2) - c0 /
(2k - 1).
While this timing formula is much like that for the Toom-Cook style
methods, there is a sense in which the exponent of the dominant term
shrinks more slowly with increasing k relative
to the growth in overhead.
Some Actual Results
Figure 7 compares the squaring methods
discussed here and Figure 8 compares the
multiplies. (The FFT's only appear in the squaring results since, as it
will soon become clear, they do not show very much promise.) Time is in
cycles at 50 MHz on the HP-9000/720 and size is in 32-bit words. For
scale, this shows that one can square a 360,000-bit number in one second
on this machine.
The running time (in cycles) of all the squaring methods
described here.
|
The running time (in cycles) of the Toom-Cook style
multiplication methods described here.
|
As a log-log plot, Figure 7 diminishes the
roughly 3-to-1 differences between the various methods. We can
illustrate those differences better by rescaling the data by the least
squares approximation to T2(w).
Figure 9 shows a log-linear plot of Ti(w) / T2(w). We can now see
that the k-way methods are faster as a class
than the FFT-k methods for w
< 1.1×106 words (37×106 bits) but we still don't have
a clear idea of the asymptotic behavior.
Ratio of Ti(w) to T2(w).
T2(w) (actually
TS2(w)) is approximated by
26.64wlog(3) / log(2) - 54.63w + 227.
|
For that, the log of the ratio of the time to the basis time over
log(k), which is
log(Tk(w) / Tk(w/k)) / log(k)
for the k-way methods and
log(Tk(w) / Tk(w / 2k-2)) /
((k-2)log(2))
for the FFT-k methods, gives a good
approximation to the exponent in the dominant term and will illustrate
how quickly each approaches its asymptote. Figure
10 shows this.
As w --> Infinity, this expression
approaches the value of the exponent of the dominant term in the
equation for the running time of each of the methods described
here.
|
Another graph of interest is Figure 11.
This shows a term which is asymptotically equal to the first overhead
term.
As w --> Infinity, this expression
should approach the value of the first overhead term in
the running time equations. In reality, it seems to approach
that value and then diverge away again. This would appear to be
due to the influence of the working set size of the algorithm in
question versus the size of the cache.
|
Besides showing that the overhead for the
k-way methods is much less than for the FFT-k's, Figure 11 seems to show
the beginnings of cache effects and suggests that the
k-way methods also have slightly better locality of reference
than the FFT-k's.
None of the FFT-k methods tried have any
chance of being best. An FFT-6, asymptotically
O(wlog(64) / log(16)) =
O(w1.5), will also be too slow. An FFT-7, O(wlog(128) / log(32)) =
O(w1.4), is asymptotically faster than the 4-way method but the 5-way
method, O(wlog(9) / log(5)) =
O(w1.365), very likely has much less overhead and will
win first.
In spite of this behavior for small orders, it is possible that the
FFT-k methods will eventually win for some w >> 1,175,000 words.
The overhead in an FFT-k is known to grow as
something like O(k2k). A Toom-Cook
style kTC-way method is
asymptotically comparable to an FFT-kSS
method when kTC =
2(kSS-2) / 2. If the overhead of the k-way methods, dominated by a matrix inversion, grows
as fast as O(kTC3) =
O(23(kSS-2) / 2), the FFT-k's may eventually win.
Is Squaring Faster than Multiplying?
Squaring is a special case of multiplying. Therefore, we have
trivially that
Tsquare <= Tmultiply.
But, in all the methods presented here, squaring involves
strictly less work than multiply. Further, most of this savings
is in the overhead and, in the limit of large numbers, virtually all of
the time spent in a multiply is in the overhead.
Therefore, we are led to ask the question: Is it possible that
there are squaring methods that are of an order faster than
any multiply methods?
The answer is, unfortunately: no.
While it is possible that we may discover some method of squaring
that is strictly faster than any existing multiply method,
any squaring method can be used to construct a multiply method that is
no more than a constant slower.
A simple proof is
XY = ((X + Y)2 - (X2 + Y2)) / 2
which, assuming that add and shift are no worse than O(n), shows that
Tmultiply <= 3Tsquare +
O(n).
Karatsuba presented a better method
XY = ((X + Y)2 - (X - Y)2) / 4
which gives a multiply in 2 squares and
O(n). Therefore,
Tmultiply <= 2Tsquare +
O(n).
Figure 12 compares the best squaring times
and the best multiply times.
Multiplication seems to range from about
1.4× to 1.7× slower
than the corresponding squaring. (The squaring times are
included and are not identically one in that the scaling formula
breaks down near the ends of the interval.)
|
The data in Figure 12 are scaled by the
formula
Tbest = 200.377w1.40059 - (6.51172 -
3.74336 / ln(w)) / ln(w)
which roughly approximates the best squaring times.
Conclusions, Speculations, Etc.
It would appear that many of the simpler methods of multiplying are
best all the way out to quite large numbers. Certainly, into the tens
of millions of bits. Possibly, much farther.
In spite of the fact that squaring is fundamentally faster than
multiply, it can be no better than a constant faster in the limit of
large numbers.
It is still possible that the Schönhage and Strassen method
will win in the end in spite of a slight asymptotic disadvantage. This
would be a useful area for further work.
A closely related area is that of what is the minimum amount of
overhead possible in Toom-Cook style methods. Different assumptions
about the cost of overhead might lead to different trade-offs.
For example, in the approach taken in this paper, what are the best
choices of x = +/- p/q so as to minimize add,
shift, and divide overhead?
References
- Aho, A.V., Hopcroft J.E., Ullman J.D.,
The Design and Analysis of Computer Algorithms, Addison
Wesley, 1974, Chapter 7.
- Cook, S.A. On the Minimum Computation
Time of Functions, Thesis, Harvard University, May 1966,
pages 51-77.
- Karatsuba, A. and Ofman, Yu.
Multiplication of Multidigit Numbers on Automata, Soviet
Physics - Doklady, Vol. 7, #7, January 1963, pages 595-596.
- Knuth, D.E. The Art of Computer
Programming, Vol. 2., Second Edition, Addison-Wesley,
Reading Mass., 1981, Chapter 4, Section 3.3, pages 278-301.
- Schönhage, A. and Strassen, V.
Computing 7 (1971), 281-292 (in German).
- Shand, M., Digital Equipment Corp., Paris
Research Laboratory, personal communication, July 1993.
- Toom, A.L. The Complexity of a Scheme
of Functional Elements Realizing the Multiplication of Integers,
Soviet Mathematics, Vol. 3, 1963, pages 714-716.
- Zuras, D. On Squaring and Multiplying
Large Integers, 11th IEEE Symposium on Computer
Arithmetic, pages 260-271.