I have just returned from the Second SIAM Conference on
Parallel Processing for Scientific Computing in Norfolk,
Virginia. There I heard about 1,000 processor systems, 4,000
processor systems, and even a proposed 1,000,000 processor
system. Since I wonder if such systems are the best way to
do general purpose, scientific computing, I am making the
following offer.

    I will pay $100 to the first person to demonstrate a
    speed-up of at least 200 on a general purpose, MIMD
    computer used for scientific computing. This offer
    will be withdrawn at 11:59 PM on 31 December 1995.

Some definitions are in order.

Speed-up: The time taken to run an application on one
    processor using the best sequential algorithm divided
    by the time to run the same application on N
    processors using the best parallel algorithm.

General purpose: The parallel system should be able to run
    a wide variety of applications. For the purposes of
    this test, the machine must run 3 different programs
    that demonstrate its ability to solve different
    problems. I suggest a large, dynamic structures
    calculation, a trans-sonic fluid flow past a complex
    barrier, and a large problem in econometric modelling.
    These are only suggestions; I will be quite flexible
    in the choice of the problems. Several people have
    volunteered to adjudicate disputes in the selection of
    the applications.

Application: The problems run for this test must be
    complete applications, no computational kernels
    allowed. They must contain all input, data transfers
    from host to parallel processors, and all output.
    The problems chosen should be the kind of job that a
    working scientist or engineer would submit as a batch
    job to a large supercomputer. In addition, I am
    arbitrarily disqualifying all problems that Cleve
    Moler calls "embarrassingly parallel". These include
    signal processing with multiple independent data
    streams, the computation of the Mandelbrot set, etc.

There are some rather obvious ground rules.

    Simulations of the hardware are not permitted.  I am
    looking for a demonstration on a running piece of
    hardware.

    The same problem should be run on the sequential and
    parallel processors.

    It is not fair to cripple the sequential processor.
    For example, if your operating system uses 99% of one
    processor, the single processor run will spend all its
    time in the operating system. Super-linear speed-up as
    the number of processors is increased is evidence of
    this problem.

    It is not fair to memory starve the single processor
    run. If you have 100K words of memory on each of 1,000
    processors, and you run a 10 MW problem, it is not
    fair for the sequential run to be made on a 100 KW
    processor. After all, we are not interested in seeing
    the impact of additional memory; we want to see how
    much the extra CPUs help.

It may not be possible to follow all these additional rules.
For example, you can't be expected to build a single processor
with lots of extra memory or write a new operating system just
for this test. However, you should do your best to make the
demonstration fair. A third party has agreed to adjudicate any
scaling disputes.

Anyone claiming this prize will be expected to present his or
her results at a suitable conference. At the end of the
presentation I will have the audience vote on whether the
demonstration was successful. Remember, the purpose of the
bet is to force developers to demonstrate the utility of
massively parallel systems, not to show they can find holes
in the rules.

Alan Karp
na.karp@su-score.arpa
-------

From GOLUB@SU-SCORE.ARPA Tue Dec 17 18:41:39 1985
Received: from SU-SCORE.ARPA (su-score.arpa.ARPA) by anl-mcs.ARPA (4.12/4.9)
	id AA09204; Tue, 17 Dec 85 18:41:05 cst
Return-Path: <jrr@Purdue.EDU>
Received: from su-navajo.arpa by SU-SCORE.ARPA with TCP; Mon 16 Dec 85 08:33:16-PST
Received: from arthur.Purdue.EDU (purdue.edu) by su-navajo.arpa with Sendmail; Mon, 16 Dec 85 08:32:01 pst
Date: Mon, 16 Dec 85 11:32:23 EST
From: "John Rice" <jrr@Purdue.EDU>
Message-Id: <8512161632.AA04135@arthur.Purdue.EDU>
Received: by arthur.Purdue.EDU; Mon, 16 Dec 85 11:32:23 EST
To: golub@su-navajo.ARPA
Subject: Reply to Karp
Cc: bajaj@Purdue.EDU, jrr@Purdue.EDU
Resent-Date: Tue 17 Dec 85 16:28:32-PST
Resent-From: Gene Golub (415/497-3124) <GOLUB@SU-SCORE.ARPA>
Resent-To: :;
Resent-Message-Id: <12167943241.14.GOLUB@SU-SCORE.ARPA>
Status: R

   Dear Gene,                          December 16, 1985
       I attach below a reply to the recent "prize" announcement
   forwarded from Alan Karp.  He has touched on an interesting
   problem but I think the "contest" should be made more precise
   in its goals and procedures.

	Thanks for considering this reply for wider distribution
					John Rice

----------------------------------------------------------------------

  Alan Karp has offered a prize of $100 for a demonstration of a speed-up
  of 200 or more for a complete application run on a general purpose
  MIMD machine.  This prize reflects the high interest in this important
  scientific question; however, the context of the prize needs to be made
  more precise.  The following points come to our minds on a first reading
  of the rules.

  1.  BEST ALGORITHM.  If no restriction is placed on the algorithm class,
  then the best is invariably to "print out" the answer.  Thus the question
  would revolve about parallelism in output devices which is probably
  not Karp's intent.  If reasonable restrictions are placed on the
  algorithm class, then the best algorithm (either sequential or parallel)
  is unknown for almost any interesting application.  If "best" is replaced
  by "commonly used", then one allows some terrible algorithms.  Note
  that given algorithm X, it is highly probable that there is an "expert"
  with "impeccable credentials" who will testify that X is best (another
  can be found who will testify that X is worst).  Karp probably intends
  to accept algorithms which are "widely accepted as good"; in any case,
  this term needs to be clarified.

  Note that best (or even good) algorithms depend on the application size,
  the number N of MIMD processors, and the architecture, among other things.

  2.  APPLICATION.  It seems that an application is expected to be a
  large complex computation.  The cost to create a "good", "production
  quality" program of 10,000-100,000 lines is the order of $1 million.
  Few people or agencies can afford to create two such programs just to
  prove the point of this contest.

  3.  EMBARRASSINGLY PARALLEL.  A working definition of an embarrassingly
  parallel application is one which can be sped-up by a factor of 199
  or more by an MIMD machine.  Then the scientific questions of interest
  is how many applications are embarrassingly parallel and this contest
  is irrelevant.  A reasonable definition of the excluded problems should
  be given.  Is the method of lines excluded?  If not, consider 3D parabolic
  problem with a 100x100x100 spatial grid and a reasonably complex (say,
  500-1000 operations to evaluate) f(x,y,z,u,ux,uy,uz,uxx,uyy,uzz) term.

  4.  WHAT IS AN ANSWER?  For simple linear algebra problems one would
  insist on both programs producing identical numbers (modulo round-off?).
  For real applications, the term answer is not yet well defined.  For
  example, I can "solve" a PDE by producing

      A.  A table of values on a 500x500 grid that are accurate to .001.
      B.  A table of values on a 13x11 grid that are accurate to .000001.
      C.  The 2500 coefficients of a power series that is accurate to
	  .0002.
      D.  A plot of the solution that is accurate to within the eye's
	  ability to distinguish solutions.

  Are these equally acceptable?  Are they equally acceptable in all
  accuracies?  Are all .01?
  It is almost certain that the best sequential and parallel algorithm
  for solving a particular PDE (or any application involving one) will
  not produce identical numbers -- or numbers that should be identical.

  5.  WHAT IS "CRIPPLING"?  The "proper" balance between CPU power,
  memory and I/O capacity depends significantly on the architecture;
  there will be large differences in the balance between a uniprocessor
  and a 1000 processor machine.  In practice, however, we do not care
  about the proper balance; we care about the actual balance.  Does the
  contest involve "properly" balanced machines or actual machines?  A
  1000 node NCUBE machine has 128 megabytes of memory; is it fair to
  compare it to a Cray with 8 megawords of memory?  And, if it is
  unfair, to which is it unfair?

				       C. Bajaj and J. Rice
				       bajaj@purdue  jrr@purdue