Guy Robinson
This book describes a set of application and systems software research projects undertaken by the Caltech Concurrent Computation Program (C P) from 1983-1990. This parallel computing activity is organized so that applications with similar algorithmic and software challenges are grouped together. Thus, one can not only learn that parallel computing is effective on a broad range of problems but also why it works, what algorithms are needed, and what features the software should support. The description of the software has been updated through 1993 to reflect the current interests of Geoffrey Fox, now at Syracuse University but still working with many C P collaborators through the auspices of the NSF Center for Research in Parallel Computation (CRPC).
Many C P members wrote sections of this book. John Apostolakis wrote Section 7.4; Clive Baillie, Sections 4.3, 4.4, 7.2 and 12.6; Vas Bala, Section 13.2; Ted Barnes, Section 7.3; Roberto Battitti, Sections 6.5, 6.7, 6.8 and 9.9; Rob Clayton, Section 18.2; Dave Curkendall, Section 18.3; Hong Ding, Sections 6.3 and 6.4; David Edelsohn, Section 12.8; Jon Flower, Sections 5.2, 5.3, 5.4 and 13.5; Tom Gottschalk, Sections 9.8 and 18.4; Gary Gutt, Section 4.5; Wojtek Furmanski, Chapter 17; Mark Johnson, Section 14.2; Jeff Koller, Sections 13.4 and 15.2; Aron Kuppermann, Section 8.2; Paulette Liewer, Section 9.3; Vince McKoy, Section 8.3; Paul Messina, Chapter 2; Steve Otto, Sections 6.6, 11.4, 12.7, 13.6 and 14.3; Jean Patterson, Section 9.4; Francois Pepin, Section 12.5; Peter Reiher, Section 15.3; John Salmon, Section 12.4; Tony Skjellum, Sections 9.5, 9.6 and Chapter 16; Michael Speight, Section 7.6; Eric Van de Velde, Section 9.7; David Walker, Sections 6.2 and 8.1; Brad Werner, Section 9.2; Roy Williams, Sections 11.1, 12.2, 12.3 and Chapter 10. Geoffrey Fox wrote the remaining text. Appendix B describes many of the key C P contributors, with brief biographies.
C P's research depended on the support of many sponsors; central support for major projects was given by the Department of Energy and the Electronic Systems Division of the USAF. Other federal sponsors were the Joint Tactical Fusion office, NASA, NSF and the National Security Agency. C P's start up was only possible due to two private donations from the Parsons and System Development Foundations. Generous corporate support came from ALCOA, Digital Equipment, General Dynamics, General Motors, Hitachi, Hughes, IBM, INTEL, Lockheed, McDonnell Douglas, MOTOROLA, Nippon Steel, nCUBE, Sandia National Laboratories, and Shell.
Production of this book would have been impossible without the dedicated help of Richard Alonso, Lisa Deyo, Keri Arnold, Blaise Canzian and especially Terri Canzian.
Guy Robinson
Guy Robinson
This book describes the activities of the Caltech Concurrent Computation Program (C P). This was a seven-year project (1983-1990), focussed on the question, ``Can parallel computers be used effectively for large scale scientific computation?'' The title of the book, ``Parallel Computing Works,'' reveals our belief that we answered the question in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high-performance computing facility based exclusively on parallel computers. While the initial focus of C P was the hypercube architecture developed by C. Seitz at Caltech, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Of course, C P was only one of many projects contributing to this field and so the contents of this book are only representative of the important activities in parallel computing during the last ten years. However, we believe that the project did address a wide range of issues and applications areas. Thus, a book focussed on C P has some general interest. We do, of course, cite other activities but surely not completely. Other general references which the reader will find valuable are [ Almasi:89a ], [ Andrews:91a ], [ Arbib:90a ], [ Blelloch:90a ], [ Brawer:89a ], [ Doyle:91a ], [ Duncan:90a ], [ Fox:88a ], [ Golub:89a ], [ Hayes:89a ], [ Hennessy:91a ], [ Hillis:85a ], [ Hockney:81a ], [ Hord:90a ], [ Hwang:89a ], [ IEEE:91a ], [ Laksh:90a ], [ Lazou:87a ], [Messina:87a;91d], [ Schneck:87a ], [ Skerrett:92a ], [ Stone:91a ], [ Trew:91a ], [ Zima:91a ].
C P was both a technical and social experiment. It involved a wide range of disciplines working together to understand the hardware, software, and algorithmic (applications) issues in parallel computing. Such multidisciplinary activities are generally considered of growing relevance to many new academic and research activities-including the federal high-performance computing and communication initiative. Many of the participants of C P are no longer at Caltech, and this has positive and negative messages. C P was not set up in a traditional academic fashion since its core interdisciplinary field, computational science, is not well understood or implemented either nationally or in specific universities. This is explored further in Chapter 20 . C P has led to flourishing follow-on projects at Caltech, Syracuse University, and elsewhere. These differ from C P just as parallel computing has changed from an exploratory field to one that is in a transitional stage into development, production, and exploitation.
Guy Robinson
The technological driving force behind parallel computing is VLSI, or very large scale integration-the same technology that created the personal computer and workstation market over the last decade. In 1980, the Intel 8086 used 50,000 transistors; in 1992, the latest Digital alpha RISC chip contains 1,680,000 transistors-a factor of 30 increase. The dramatic improvement in chip density comes together with an increase in clock speed and improved design so that the alpha performs better by a factor of over one thousand on scientific problems than the 8086-8087 chip pair of the early 1980s.
The increasing density of transistors on a chip follows directly from a decreasing feature size which is now for the alpha. Feature size will continue to decrease and by the year 2000, chips with 50 million transistors are expected to be available. What can we do with all these transistors?
With around a million transistors on a chip, designers were able to move full mainframe functionality to about of a chip. This enabled the personal computing and workstation revolutions. The next factors of ten increase in transistor density must go into some form of parallelism by replicating several CPUs on a single chip.
By the year 2000, parallelism is thus inevitable to all computers, from your children's video game to personal computers, workstations, and supercomputers. Today we see it in the larger machines as we replicate many chips and printed circuit boards to build systems as arrays of nodes, each unit of which is some variant of the microprocessor. This is illustrated in Figure 1.1 (Color Plate), which shows an nCUBE parallel supercomputer with 64 identical nodes on each board-each node is a single-chip CPU with additional memory chips. To be useful, these nodes must be linked in some way and this is still a matter of much research and experimentation. Further, we can argue as to the most appropriate node to replicate; is it a ``small'' node as in the nCUBE of Figure 1.1 (Color Plate), or more powerful ``fat'' nodes such as those offered in CM-5 and Intel Touchstone shown in Figures 1.2 and 1.3 (Color Plates) where each node is a sophisticated multichip printed circuit board? However, these details should not obscure the basic point: Parallelism allows one to build the world's fastest and most cost-effective supercomputers.
Figure 1.1 :
The nCUBE-2 node and its integration into
a board. Upto 128 of these boards can be combined into a single
supercomputer.
Figure 1.2 :
The CM-5 produced by Thinking Machines.
Figure 1.3 :
The DELTA Touchstone parallel
supercomputer produced by Intel and installed at Caltech.
Parallelism may only be critical today for supercomputer vendors and users. By the year 2000, all computers will have to address the hardware, algorithmic, and software issues implied by parallelism. The reward will be amazing performance and the opening up of new fields; the price will be a major rethinking and reimplementation of software, algorithms, and applications.
This vision and its consequent issues are now well understood and generally agreed. They provided the motivation in 1981 when C P's first roots were formed. In those days, the vision was blurred and controversial. Many believed that parallel computing would not work.
President Bush instituted, in 1992, the five-year federal High Performance Computing and Communications (HPCC) Program. This will spur the development of the technology described above and is focussed on the solution of grand challenges shown in Figure 1.4 (Color Plate). These are fundamental problems in science and engineering, with broad economic and scientific impact, whose solution could be advanced by applying high-performance computing techniques and resources.
Figure 1.4:
Grand Challenge Appications. Some major
applications which will be enabled by parallel supercomputers. The computer
performance numbers are given in more detail in color figure 2.1.
The activities of several federal agencies have been coordinated in this program. The Advanced Research Projects Agency (ARPA) is developing the basic technologies which is applied to the grand challenges by the Department of Energy (DOE), the National Aeronautics and Space Agency (NASA), the National Science Foundation (NSF), the National Institute of Health (NIH), the Environmental Protection Agency (EPA), and the National Oceanographic and Atmospheric Agency (NOAA). Selected activities include the mapping of the human genome in DOE, climate modelling in DOE and NOAA, coupled structural and airflow simulations of advanced powered lift and a high-speed civil transport by NASA.
More generally, it is clear that parallel computing can only realize its full potential and be commercially successful if it is accepted in the real world of industry and government applications. The clear U.S. leadership over Europe and Japan in high-performance computing offers the rest of the U.S. industry the opportunity of gaining global competitive advantage.
Some of these industrial opportunities are discussed in Chapter 19 . Here we note some interesting possibilities which include
C P did not address such large-scale problems. Rather, we concentrated on major academic applications. This fit the experience of the Caltech faculty who led most of the C P teams, and further academic applications are smaller and cleaner than large-scale industrial problems. One important large-scale C P application was a military simulation described in Chapter 18 and produced by Caltech's Jet Propulsion Laboratory. C P chose the correct and only computations on which to cut its parallel computing teeth. In spite of the focus on different applications, there are many similarities between the vision and structure of C P and today's national effort. It may even be that today's grand challenge teams can learn from C P's experience.
Guy Robinson
C P's origins dated to an early collaboration between the physics and computer science departments at Caltech in bringing up UNIX on the physics department's VAX 11/780. As an aside, we note this was motivated by the development of the Symbolic Manipulation Program (SMP) by Wolfram and collaborators; this project has now grown into the well-known system Mathematica. Carver Mead from computer science urged physics to get back to them if we had insoluble large scale computational needs. This comment was reinforced in May, 1981 when Mead gave a physics colloquium on VLSI, Very Large Scale Integration, and the opportunities it opened up. Fox, in the audience, realized that quantum chromodynamics (QCD, Section 4.3 ), now using up all free cycles on the VAX 11/780, was naturally parallelizable and could take advantage of the parallel machines promised by VLSI. Thus, a seemingly modest interdisciplinary interaction-a computer scientist lecturing to physicists-gave birth to a large interdisciplinary project, C P. Further, our interest in QCD stemmed from the installation of the VAX 11/780 to replace our previous batch computing using a remote CDC7600 . This more attractive computing environment, UNIX on a VAX 11/780, encouraged theoretical physics graduate students to explore computation.
During the summer of 1981, Fox's research group, especially Eugene Brooks and Steve Otto, showed that effective concurrent algorithms could be developed, and we presented our conclusion to the Caltech computer scientists. This presentation led to the plans, described in a national context in Chapter 2 , to produce the first hypercube, with Chuck Seitz and his student Erik DeBenedictis developing the hardware and Fox's group the QCD applications and systems software. The physics group did not understand what a hypercube was at that stage, but agreed with the computer scientists because the planned six-dimensional hypercube was isomorphic to a three-dimensional mesh, a topology whose relevance a physicist could appreciate. With the generous help of the computer scientists, we gradually came to understand the hypercube topology with its general advantage (maximum distance between nodes is ) and its specific feature of including a rich variety of mesh topologies. Here N is the total number of nodes in the concurrent computer. We should emphasize that this understanding of the relevance of concurrency to QCD was not particularly novel; it followed from ideas already known from earlier concurrent machines such as the Illiac IV. We were, however, fortunate to investigate the issues at a time when microprocessor technology (in particular the Intel 8086/8087) allowed one to build large (in terms of number of nodes) cost-effective concurrent computers with interesting performance levels. The QCD problem was also important in helping ensure that the initial Cosmic Cube was built with sensible design choices; we were fortunate that in choosing parameters, such as memory size, appropriate for QCD, we also realized a machine of general capability.
While the 64-node Cosmic Cube was under construction, Fox wandered around Caltech and the associated Jet Propulsion Laboratory explaining parallel computing and, in particular, the Cosmic Cube to scientists in various disciplines who were using ``large'' (by the standards of 1981) scale computers. To his surprise, all the problems being tackled on conventional machines by these scientists seemed to be implementable on the Cosmic Cube. This was the origin of C P, which identified the Caltech-JPL applications team, whose initial participants are noted in Table 4.2 . Fox, Seitz, and these scientists prepared the initial proposals which established and funded C P in the summer of 1983. Major support was obtained from the Department of Energy and the Parsons and System Development Foundation. Intel made key initial contributions of chips to the Cosmic Cube and follow-on machines. The Department of Energy remained the central funding support for C P throughout its seven years, 1983 to 1990.
The initial C P proposals were focussed on the question:
``Is the hypercube an effective computer for large-scale scientific computing?''
Our approach was simple: Build or acquire interesting hardware and provide the intellectual and physical infrastructure to allow leading application scientists and engineers to both develop parallel algorithms and codes, and use them to address important problems. Often we liked to say that C P
``used real hardware and real software to solve real problems.''
Our project showed initial success, with the approximately ten applications of Table 4.2 developed in the first year. We both showed good performance on the hypercube and developed a performance model which is elaborated in Chapter 3 . A major activity at this time was the design and development of the necessary support software, termed CrOS and later developed into the commercial software Express described in Chapter 5 . Not only was the initial hardware applicable to a wide range of problems, but our software model proved surprisingly useful. CrOS was originally designed by Brooks as the ``mailbox communication system'' and we initially targeted the regular synchronous problems typified in Chapter 4 . Only later did we realize that it supported quite efficiently the irregular and non-local communication needs of more general problems. This generalization is represented as an evolutionary track of Express in Chapter 5 and for a new communication system Zipcode in Section 16.1 developed from scratch for general asynchronous irregular problems.
Although successful, we found many challenges and intriguing questions opened up by C P's initial investigation into parallel computing. Further, around 1985, the DOE and later the NSF made substantial conventional supercomputer (Cray, Cyber, ETA) time available to applications scientists. Our Cosmic Cube and the follow-on Mark II machines were very competitive with the VAX 11/780, but not with the performance offered by the CRAY X-MP. Thus, internal curiosity and external pressures moved C P in the direction of computer science: still developing real software but addressing new parallel algorithms and load-balancing techniques rather than a production application. This phase of C P is summarized in [ Angus:90a ], [Fox:88a;88b].
Around 1988, we returned to our original goal with a focus on parallel supercomputers. We no longer concentrated on the hypercube, but rather asked such questions as [ Fox:88v ],
``What are the cost, technology, and software trade-offs that will drive the design of future parallel supercomputers?''
and as a crucial (and harder) question:
``What is the appropriate software environment for parallel machines and how should one develop it?''
We evolved from the initial 8086/8087, 80286/80287 machines to the internal JPL Mark IIIfp and commercial nCUBE-1 and CM-2 described in Chapter 2 . These were still ``rough, difficult to use machines'' but had performance capabilities competitive with conventional supercomputers.
This book largely describes work in the last three years of C P when we developed a series of large scale applications on these parallel supercomputers. Further, as described in Chapters 15 , 16 , and 17 , we developed prototypes and ideas for higher level software environments which could accelerate and ease the production of parallel software. This period also saw an explosion of interest in the use of parallel computing outside Caltech. Much of this research used commercial hypercubes which partly motivated our initial discoveries and successes on the in-house machines at Caltech. This rapid technology transfer was in one sense gratifying, but it also put pressure on C P which was better placed to blaze new trails than to perform the more programmatic research which was now appropriate.
An important and unexpected discovery in C P was in the education and the academic support for interdisciplinary research. Many of the researchers, especially graduate students in C P, evolved to be ``computational scientists.'' Not traditional physicists, chemists, or computer scientists but rather something in between. We believe that this interdisciplinary education and expertise was critical for C P's success and, as discussed in Chapter 20 , it should be encouraged in more universities [Fox:91f;92d].
Further information about C P can be found in our annual reports and two reviews.
[Fox:84j;84k;85c;85e;85i;86f;87c;87d;88v;88oo;89i;89n;89cc;89dd;90o]
Guy Robinson
C P's research showed that
The book quantifies and exemplifies this assertion.
- PCW:
- Parallel Computers work in a large class of scientific and engineering computations.
In Chapter 2 , we provide the national overview of parallel computing activities during the last decade. Chapter 3 is somewhat speculative as it attempts to provide a framework to quantify the previous PCW statement.
We will show that, more precisely, parallel computing only works in a ``scaling'' fashion in a special class of problems which we call synchronous and loosely synchronous.
By scaling, we mean that the parallel implementation will efficiently extend to systems with large numbers of nodes without levelling off of the speedup obtained. These concepts are quantified in Chapter 3 with a simple performance model described in detail in [ Fox:88a ].
The book is organized with applications and software issues growing in complexity in later chapters. Chapter 4 describes the cleanest regular synchronous applications which included many of our initial successes. However, we already see the essential points:
CrOS and its follow-on Express, described in Chapter 5 , support this software paradigm. Explicit message passing is still an important software model and in many cases, the only viable approach to high-performance parallel implementations on MIMD machines.
- DD:
- Domain decomposition (or data parallelism) is a universal source of scalable parallelism
- MP:
- software model was a simple explicit message passing with each node of a parallel processor running conventional sequential code communicating via subroutine call with other nodes.
Chapters 6 through 9 confirm these lessons with an extension to more irregular problems. Loosely synchronous problem classes are harder to parallelize, but still use the basic principles DD and MP . Chapter 7 describes a special class, embarrassingly parallel , of applications where scaling parallelism is guaranteed by the independence of separate components of the problem.
Chapters 10 and 11 describe parallel computing tools developed within C P. DIME supports parallel mesh generation and adaptation, and its use in general finite element codes. Initially, we thought load balancing would be a major stumbling block for parallel computing because formally it is an NP-complete (intractable) optimization problem. However, effective heuristic methods were developed which avoid the exponential time complexity of NP-complete problems by searching for good but not exact minima.
In Chapter 12 , we describe the most complex irregular loosely synchronous problems which include some of the hardest problems tackled in C P.
As described earlier, we implemented essentially all the applications described in the book using explicit user-generated message passing. In Chapter 13 , we describe our initial efforts to produce a higher level data-parallel Fortran environment, which should be able to provide a more attractive software environment for the user. High Performance Fortran has been adopted as an informal industry standard for this language.
In Chapter 14 , we describe the very difficult asynchronous problem class for which scaling parallel algorithms and the correct software model are less clear. Chapters 15 , 16 , and 17 describe four software models, Zipcode, MOOSE, Time Warp, and MOVIE which tackle asynchronous and the mixture of asynchronous and loosely synchronous problems one finds in the complex system simulations and analysis typical of many real-world problems. Applications of this class are described in Chapter 18 , with the application of Section 18.3 being an event-driven simulation-an important class of difficult-to-parallelize asynchronous applications.
In Chapter 19 we look to the future and describe some possibilities for the use of parallel computers in industry. Here we note that C P, and much of the national enterprise, concentrated on scientific and engineering computations. The examples and ``proof'' that parallel computing works are focussed in this book on such problems. However, this will not be the dominant industrial use of parallel computers where information processing is most important. This will be used for decision support in the military and large corporations, and to supply video, information and simulation ``on demand'' for homes, schools, and other institutions. Such applications have recently been termed national challenges to distinguish them from the large scale grand challenges , which underpinned the initial HPCC initiative [ FCCSET:94a ]. The lessons C P and others have learnt from scientific computations will have general applicability across the wide range of industrial problems.
Chapter 20 includes a discussion of education in computational science-an unexpected byproduct of C P-and other retrospective remarks. The appendices list the C P reports including those not cited directly in this book. Some information is available electronically by mailing citlib@caltech.edu.
Guy Robinson
Guy Robinson
This chapter surveys activities related to parallel computing that took place around the time that C P was an active project, primarily during the 1980s. The major areas that are covered are hardware, software, research projects, and production uses of parallel computers. In each case, there is no attempt to present a comprehensive list or survey of all the work that was done in that area. Rather, the attempt is to identify some of the major events during the period.
There are two major motivations for creating and using parallel computer architectures. The first is that, as surveyed in Section 1.2 parallelism is the only avenue to achieve vastly higher speeds than are possible now from a single processor. This was the primary motivation for initiating C P. Table 2.1 demonstrates dramatically the rather slow increase in speed of single-processor systems for one particular brand of supercomputer, CRAYs, the most popular supercomputer in the world. Figure 2.1 (Color Plate) shows a more comprehensive sample of computer performance, measured in operations per second, from the 1940s extrapolated through the year 2000.
Figure 2.1:
Historical trends of peak computer
performance. In some cases, we have scaled up parallel performance to
correspond to a configuration that would cost approximately $20 million.
A second motivation for the use of parallel architectures is that they should be considerably cheaper than sequential machines for systems of moderate speeds; that is, not necessarily supercomputers but instead minicomputers or mini-supercomputers would be cheaper to produce for a given performance level than the equivalent sequential system.
At the beginning of the 1980s, the goals of research in parallel computer architectures were to achieve much higher speeds than could be obtained from sequential architectures and to get much better price performance through the use of parallelism than would be possible from sequential machines.
Guy Robinson
Guy Robinson
There were parallel computers before 1980, but they did not have a widespread impact on scientific computing. The activities of the 1980s had a much more dramatic effect. Still, a few systems stand out as having made significant contributions that were taken advantage of in the 1980s. The first is the Illiac IV [ Hockney:81b ]. It did not seem like a significant advance to many people at the time, perhaps because its performance was only moderate, it was difficult to program, and had low reliability. The best performance achieved was two to six times the speed of a CDC 7600 . This was obtained on various computational fluid dynamics codes. For many other programs, however, the performance was lower than that of a CDC 7600, which was the supercomputer of choice during the early and mid-1970s. The Illiac was a research project, not a commercial product, and it was reputed to be so expensive that it was not realistic for others to replicate it. While the Illiac IV did not inspire the masses to become interested in parallel computing, hundreds of people were involved in its use and in projects related to providing better software tools and better programming languages for it. They first learned how to do parallel computing on the Illiac IV and many of them went on to make significant contributions to parallel computing in the 1980s.
The Illiac was an SIMD computer-single-instruction, multiple-data architecture. It had 32 processing elements, each of which was a processor with its own local memory; the processors were connected in a ring. High-level languages such as Glypnyr and Fortran were available for the Illiac IV. Glypnyr was reminiscent of Fortran and had extensions for parallel and array processing.
The ICL Distributed Array Processor (DAP) [ DAP:79a ] was a commercial product; a handful of machines were sold, mainly in England where it was designed and built. Its characteristics were that it had either 1K or 4K one-bit processors arranged in a square plane, each connected in rectangular fashion to its nearest neighbors. Like the Illiac IV, it was an SIMD system. It required an ICL mainframe as a front end. The ICL DAP was used for many university science applications. The University of Edinburgh, in particular, used it for a number of real computations in physics, chemistry, and other fields [Wallace:84a;87a]. The ICL DAP had a substantial impact on scientific computing culture, primarily in Europe. ICL did try to market it in the United States, but was never effective in doing so; the requirement for an expensive ICL mainframe as a host was a substantial negative factor.
A third important commercial parallel computer in the 1970s was the Goodyear Massively Parallel Processor (MPP) [ Batcher:85a ], [Karplus:87a, pp. 157-166]. Goodyear started building SIMD computers in 1969, but all except the MPP were sold to the military and to the Federal Aviation Administration for air traffic control. In the late 1970s, Goodyear produced the MPP which was installed at Goddard Space Flight Center, a NASA research center, and used for a variety of scientific applications. The MPP attracted attention because it did achieve high speeds on a few applications, speeds that, in the late 1970s and early 1980s, were remarkable-measured in the hundreds of MFLOPS in a few cases. The MPP had 16K one-bit processors, each with local memory, and was programmed in Pascal and Assembler.
In summary, the three significant scientific parallel computers of the 1970s were the Illiac IV, the ICL DAP, and the Goodyear MPP. All were SIMD computers. The DAP and the MPP were fine-grain systems based on single-bit processors, whereas the Illiac IV was a large-grain SIMD system. The other truly significant high-performance (but not parallel) computer of the 1970s was the CRAY 1, which was introduced in 1976. The CRAY 1 was a single-processor vector computer and as such it can also be classified as a special kind of SIMD computer because it had vector instructions. With a single vector instruction, one causes up to 64 data pairs to be operated on.
There were significant and seminal activities in parallel computing in the 1970s both from the standpoint of design and construction of systems and in the actual scientific use of the systems. However, the level of activity of parallel computing in the 1970s was modest compared to what followed in the 1980s.
Guy Robinson
In contrast to the 1970s, in the early 1980s it was MIMD (multiple instruction, multiple data) computers that dominated the activity in parallel computing. The first of these was the Denelcor Heterogeneous Element Processor (HEP). The HEP attracted widespread attention despite its terrible cost performance because of its many interesting hardware features that facilitated programming. The Denelcor HEP was acquired by several institutions, including Los Alamos, Argonne National Laboratory, Ballistic Research Laboratory, and Messerschmidt in Germany. Messerschmidt was the only installation that used it for real applications. The others, however, used it extensively for research on parallel algorithms. The HEP hardware supported both fine-grain and large-grain parallelism. Any one processor had an instruction pipeline that provided parallelism at the single instruction level. Instructions from separate processes (associated with separate user programs or tasks) were put into hardware queues and scheduled for execution once the required operands had been fetched from memory into registers, again under hardware control. Instructions from up to 128 processes could share the instruction execution pipeline. The latter had eight stages; all instructions except floating-point divide took eight machine cycles to execute. Up to 16 processors could be linked to perform large-grain MIMD computations. The HEP had an extremely efficient synchronization mechanism through a full-empty bit associated with every word of memory. The bit was automatically set to indicate whether the word had been rewritten since it had last been written into and could be set to indicate that the memory location had been read. The value of the full-empty bit could be checked in one machine cycle. Fortran, C, and Assembler could be used to program the HEP. It had a UNIX environment and was front-ended by a minicomputer. Because Los Alamos and Argonne made their HEPs available for research purposes to people who were interested in learning how to program parallel machines or who were involved in parallel algorithm research, hundreds of people became familiar with parallel computing through the Denelcor HEP [ Laksh:85a ].
A second computer that was important in the early 1980s, primarily because it exposed a large number of computational scientists to parallelism, was the CRAY X-MP/22, which was introduced in 1982. Certainly, it had limited parallelism, namely only two processors; still, it was a parallel computer. Since it was at the very high end of performance, it exposed the hardcore scientific users to parallelism, although initially mostly in a negative way. There was not enough payoff in speed or cost to compensate for the effort that was required to parallelize a program so that it would use both processors: the maximum speedup would, of course, only be two. Typically, it was less than two and the charging algorithms of most computer centers generated higher charges for a program when it used both processors than when it used only one. In a way, though, the CRAY X-MP multiprocessor legitimized parallel processing, although restricted to very large grain, very small numbers of processors. A few years later, the IBM 3090 series had the same effect; the 3090 can have up to six vector and scalar processors in one system. Memory is shared among all processors.
Another MIMD system that was influential during the early 1980s was the New York University Ultracomputer [ Gottlieb:86a ] and a related system, the IBM RP3 [ Brochard:92a ], [ Brochard:92b ], [ Darema:87a ], [ Pfister:85a ]. These systems were serious attempts to design and demonstrate a shared-memory architecture that was scalable to very large numbers of processors. They featured an interconnection network between processors and memories that would avoid hot spots and congestion. The fetch-and-add instruction that was invented by Jacob Schwartz [ Schwartz:80a ] would avoid some of the congestion problems in omega networks. Unfortunately, these systems took a great deal of time to construct and it was the late 1980s before the IBM RP3 existed in a usable fashion. At that time, it had 64 processors but each was so slow that it attracted comparatively little attention. The architecture is certainly still considered to be an interesting one, but far fewer users were exposed to these systems than to other designs that were constructed more quickly and put in places that allowed a large number of users to have at least limited access to the systems for experimentation. Thus, the importance of the Ultracomputer and RP3 projects lay mainly in the concepts.
Guy Robinson
Perhaps the most significant and influential parallel computer system of the early 1980s was the Caltech Cosmic Cube [ Seitz:85a ], developed by Charles Seitz and Geoffrey Fox. Since it was the inspiration for the C P project, we describe it and its immediate successors in some detail [Fox:87d;88oo], [ Seitz:85a ].
The hypercube work at Caltech originated in May 1981 when, as described in Chapter 1 , Fox attended a seminar by Carver Mead on VLSI and its implications for concurrency. As described in more detail in Sections 4.1 and 4.3 , Fox realized that he could use parallel computers for the lattice gauge computations that were central to his research at the time and that his group was running on a VAX 11/780. During the summer of 1981, he and his students worked out an algorithm that he thought would be parallel and tried it out on his VAX (simulating parallelism). The natural architecture for the problems he wanted to compute was determined to be a three-dimensional grid, which happens to be 64 processors (Figure 4.3 ).
In the fall of 1981 Fox approached Chuck Seitz about building a suitable computer. After an exchange of seminars, Seitz showed great interest in doing so and had funds to build a hypercube. Given Fox's problem, a six-dimensional hypercube (64 processors) was set as the target. Memory size of 128K was chosen after some debate; applications people (chiefly Fox) wanted at least that much. A trade-off was made between the number of processors and memory size. A smaller cube would been built if larger memory had been chosen.
From the outset a key goal was to produce an architecture with interprocessor communications that would scale well to a very large number of processors. The features that led to the choice of the hypercube topology specifically were the moderate growth in the number of channels required as the number of processors increases, and the good match between processor and memory speeds because memory is local.
The Intel 8086 was chosen because it was the only microprocessor available at the time with a floating-point co-processor, the 8087. First, a prototype 4-node system was built with wirewrap boards. It was designed, assembled, and tested during the winter of 1981-82. In the spring of 1982, message-passing software was designed and implemented on the 4-node. Eugene Brooks' proposal of simple send/receive routines was chosen and came to be known as the Crystalline Operating System (CrOS), although it was never really an operating system.
In autumn of 1982, simple lattice problems were implemented on the 4-node by Steve Otto and others. CrOS and the computational algorithm worked satisfactorily. By January 1983, Otto had the lattice gauge applications program running on the 4-node. Table 4.2 details the many projects and publications stemming from this pioneering work.
With the successful experience on the 4-node, Seitz proceeded to have printed circuit boards designed and built. The 64-node Cosmic Cube was assembled over the summer of 1983 and began operation in October 1983. It has been in use ever since, although currently it is lightly used.
The key characteristics of the Cosmic Cube are that it has 64 nodes, each with an 8086/8087, of memory, and communication channels with 2 Mbits/sec peak speed between nodes (about 0.8 Mbits/sec sustained in one direction). It is five feet long, six cubic feet in volume, and draws 700 watts.
The Cosmic Cube provided a dramatic demonstration that multicomputers could be built quickly, cheaply, and reliably. In terms of reliability, for example, there were two hard failures in the first 560,000 node hours of operation-that is, during the first year of operation. Its performance was low by today's standards, but it was still between five and ten times the performance of a DEC VAX 11/780, which was the system of choice for academic computer departments and research groups in that time period. The manufacturing cost of the system was $80,000, which at that time was about half the cost of a VAX with a modest configuration. Therefore, the price performance was on the order of 10 to 20 times better than a VAX 780. This estimate does not take into account design and software development costs; on the other hand, it was a one-of-a-kind system, so manufacturing costs were higher than for a commercial product. Furthermore, it was clearly a scalable architecture, and that is perhaps the most important feature of that particular project.
In the period from October, 1983 to April, 1984 a 2500-hour run of a QCD problem (Table 4.1 ) was completed, achieved 95% efficiency, and produced new scientific results. This demonstrated that hypercubes are well-suited for QCD (as are other architectures).
As described in Section 1.3 , during the fall of 1982 Fox surveyed many colleagues at Caltech to determine whether they needed large-scale computation in their research and began to examine those applications for suitability to run on parallel computers. Note that this was before the 64-node Cosmic Cube was finished, but after the 4-node gave evidence that approach was sound. The Caltech Concurrent Computation Program (C P) was formed in Autumn of 1982. A decision was made to develop big, fast hypercubes rather than rely on Crays. By the summer of 1984, the ten applications of Table 4.2 were running on the Cosmic Cube [ Fox:87d ].
Two key shortcomings that were soon noticed were that too much time was spent in communications and that high-speed external I/O was not available. The first was thought to be addressable with custom communication chips.
In the summer of 1983, Fox teamed with Caltech's Jet Propulsion Laboratory (JPL) to build bigger and better hypercubes. The first was the Mark II, still based on 8086/8087 (no co-processor faster than 8087 was yet available), but with memory, faster communications, and twice as many nodes. The first 32 nodes began operating in September, 1984. Four 32-node systems and one 128-node were built. The latter was completed in June, 1985 [ Fox:88oo ].
The Caltech project inspired several industrial companies to build commercial hypercubes. These included Intel, nCUBE [ Karplus:87a ], Ametek [ Seitz:88b ], and Floating Point Systems Corporation. Only two years after the 64-node Caltech Cosmic Cube was operational, there were commercial products on the market and installed at user sites.
With the next Caltech-JPL system, the Mark III, there was a switch to the Motorola family of microprocessors. On each node the Mark III had one Motorola 68020/68881 for computation and another 68020 for communications. The two processors shared the of node memory. The first 32-node Mark III was operational in April, 1986. The concept of dedicating a processor to communications has influenced commercial product design, including recently introduced systems.
In the spring of 1986, a decision was made to build a variant of the Mark III, the Mark IIIfp (originally dubbed the Mark IIIe). It was designed to compete head-on with ``real'' supercomputers. The Mark IIIfp has a daughter board at each node with the Weitek XL floating-point chip set running at , which gives a peak speed of . By January 1987, an 8-node Mark IIIfp was operational. A 128-node system was built and in the spring of 1989 achieved on two applications.
In summary, the hypercube family of computers enjoyed rapid development and was used for scientific applications from the beginning. In the period from 1982 to 1987, three generations of the family were designed, built, and put into use at Caltech. The third generation (the Mark III) even included a switch of microprocessor families. Within the same five years, four commercial vendors produced and delivered computers with hypercube architectures. By 1987, approximately 50 major applications had been completed on Caltech hypercubes. Such rapid development and adaption has few if any parallels. The remaining chapters of this book are largely devoted to lessons from these applications and their followers.
Guy Robinson
During this period, many new systems were launched by commercial companies, and several were quite successful in terms of sales. The two most successful were the Sequent and the Encore [Karplus:87a, pp. 111-126] products. Both were shared-memory, bus-connected multiprocessors of moderate parallelism. The maximum number of processors on the Encore product was 20; on the Sequent machine initially 16 and later 30. Both provided an extremely stable UNIX environment and were excellent for time-sharing. As such, they could be considered VAX killers since VAXes were the time-sharing system of choice in research groups in those days. The Sequent and the Encore provided perhaps a cost performance better by a factor of 10, as well as considerably higher total performance than could be obtained on a VAX at that time. These systems were particularly useful for smaller jobs, for time-sharing, and for learning to do parallel computing. Perhaps their most impressive aspect was the reliability of both hardware and software. They operated without interruption for months at a time, just as conventional mini-supercomputers did. Their UNIX operating system software was familiar to many users and, as mentioned before, very stable. Unlike most parallel computers whose system software requires years to mature, these systems had very stable and responsive system software from the beginning.
Another important system during this period was the Alliant [Karplus:87a, pp. 35-44]. The initial model featured up to eight vector processors, each of moderate performance. But when used simultaneously, they provided performance equivalent to a sizable fraction of a CRAY processor. A unique feature at the time was a Fortran compiler that was quite good at automatic vectorization and also reasonably good at parallelization. These compiler features, coupled with its shared memory, made this system relatively easy to use and to achieve reasonably good performance. The Alliant also supported the C language, although initially there was no vectorization or parallelization available in C. The operating system was UNIX-based. Because of its reasonably high floating-point performance and ease of use, the Alliant was one of the first parallel computers that was used for real applications. The Alliant was purchased by groups who wanted to do medium-sized computations and even computations they would normally do on CRAYs. This system was also used as a building block of the Cedar architecture project led by D. Kuck [ Kuck:86a ].
Advances in compiling technology made wide-instruction word machines an interesting and, for a few years, commercially viable architecture. The Multiflow and Cydrome systems both had compilers that effectively exploited very fine-grain parallelism and scheduling of floating-point pipelines within the processing units. Both these systems attempted to get parallelism at the instruction level from Fortran programs-the so-called dusty decks that might have convoluted logic and thus be very difficult to vectorize or parallelize in a large-grain sense. The price performance of these systems was their main attraction. On the other hand, because these systems did not scale to very high levels of performance, they were relegated to the super-minicomputer arena. An important contribution they made was to show dramatically how far compiler technology had come in certain areas.
As was mentioned earlier, hypercubes were produced by Intel, nCUBE, Ametek, and Floating Point Systems Corporation in the mid-1980s. Of these, the most significant product was the nCUBE with its high degree of integration and a configuration of up to 1024 nodes [ Palmer:86a ], [ nCUBE:87a ]. It was pivotal in demonstrating that massively parallel MIMD medium-grain computers are practical. The nCUBE featured a complete processor on a single chip, including all channels for connecting to the other nodes so that one chip plus six memory chips constituted an entire node. They were packaged on boards with 64 nodes so that the system was extremely compact, air-cooled, and reliable. Caltech had an early 512-node system, which was used in many C P calculations, and soon afterwards Sandia National Laboratories installed the 1024-node system. A great deal of scientific work was done on those two machines and they are still in use. The 1024-node Sandia machine got the world's attention by demonstrating speedups of 1000 for several applications [ Gustafson:88a ]. This was particularly significant because it was done during a period of active debate as to whether MIMD systems could provide speedups of more than a hundred. Amdahl's law [ Amdahl:67a ] was cited as a reason why it would not be possible to get speedups greater than perhaps a hundred, even if one used 1000 processors.
Towards the end of the mid-1980s, transputer-based systems [ Barron:83a ], [ Hey:88a ], both large and small, began to proliferate, especially in Europe but also in the United States. The T800 transputer was like the nCUBE processor, a single-chip system with built-in communications channels, and it had respectable floating point performance-a peak speed of nearly and frequently achieved speeds of 1/2 to . They provided a convenient building block for parallel systems and were quite cost-effective. Their prevalent use at the time was in boards with four or eight transputers that were attached to IBM PCs, VAXes, or other workstations.
Guy Robinson
By the late 1980s, truly powerful parallel systems began to appear. The Meiko system at Edinburgh University, is an example; by 1989, that computer had 400 T800s [ Wallace:88a ]. The system was being used for a number of traditional scientific computations in physics, chemistry, engineering, and other areas [ Wexler:89a ]. The system software for transputer-based systems had evolved to resemble the message-passing system software available on hypercubes. Although the transputer's two-dimensional mesh connection is in principle less efficient than hypercube connections, for systems of moderate size (only a few hundred processors), the difference is not significant for most applications. Further, any parallel architecture deficiencies were counterbalanced by the transputer's excellent communication channel performance.
Three new SIMD fine-grain systems were introduced in the late 1980s: the CM-2, the MasPar, and a new version of the DAP. The CM-2 is a version of the original Connection Machine [Hillis:85a;87a] that has been enhanced with Weitek floating-point units, one for each 32 single-bit processors, and optional large memory. In its largest configuration, such as is installed at Los Alamos National Laboratory, there are 64K single-bit processors, 2048 64-bit floating-point processors, and of memory. The CM-2 has been measured at running the unlimited Linpack benchmark solving a linear system of order 26,624 and even higher performance on some applications, e.g., seismic data processing [ Myczkowski:91a ] and QCD [ Brickner:91b ], [ Liu:91a ]. It has attracted widespread attention both because of its extremely high performance and its relative ease of use [ Boghosian:90a ], [Hillis:86a;87b]. For problems that are naturally data parallel, the CM Fortran language and compiler provide a relatively easy way to implement programs and get high performance.
The MasPar and the DAP are smaller systems that are aimed more at excellent price performance than at supercomputer levels of performance. The new DAP is front-ended by Sun workstations or VAXes. This makes it much more affordable and compatible with modern computing environments than when it required an ICL front end. DAPs have been built in ruggedized versions that can be put into vehicles, flown in airplanes, and used on ships, and have found many uses in signal processing and military applications. They are also used for general scientific work. The MasPar is the newest SIMD system. Its architecture constitutes an evolutionary approach of fine-grain SIMD combined with enhanced floating-point performance coming from the use of 4-bit (Maspar MP-1) or 32-bit (Maspar MP-2) basic SIMD units. Standard 64-bit floating-point algorithms implemented on a (SIMD) machine built around an l bit CPU take time of order machine instructions. The DAP and CM-1,2 used l=1 and here the CM-2 and later DAP models achieve floating-point performance with special extra hardware rather than by increasing l .
Two hypercubes became available just as the decade ended: the second generation nCUBE, popularly known as the nCUBE-2, and the Intel iPSC/860. The nCUBE-2 can be configured with up to 8K nodes; that configuration would have a peak speed of . Each processor is still on a single chip along with all the communications channels, but it is about eight times faster than its predecessor-a little over . Communication bandwidth is also a factor of eight higher. The result is a potentially very powerful system. The nCUBE-2 has a custom microprocessor that is instruction-compatible with the first-generation system. The largest system known to have been built to date is a 1024 system installed at Sandia National Laboratories. The unlimited size Linpack benchmark for this system yielded a performance of solving a linear system of order 21,376.
The second hypercube introduced in 1989 (and first shipped to a customer, Oak Ridge, in January 1990), the Intel iPSC/860, has a peak speed of over . While the communication speed between nodes is very low compared to the speed of the i860 processor, high speeds can be achieved for problems that do not require extensive communication or when the data movement is planned carefully. For example, the unlimited size Linpack benchmark on the largest configuration iPSC/860, 128 processors, ran at when solving a system of order 8,600.
The iPSC/860 uses the Intel i860 microprocessor, which has a peak speed of full precision and with 32-bit precision. In mid-1991, a follow-on to Intel iPSC/860, the Intel Touchstone Delta System, reached a Linpack speed of for a system of order 25,000. This was done on 512 i860 nodes of the Delta System. This machine has a peak speed of and of memory and is a one-of-a-kind system built for a consortium of institutions and installed at California Institute of Technology. Although the C P project is finished at Caltech, many C P applications have very successfully used the Delta. The Delta uses a two-dimensional mesh connection scheme with mesh routing chips instead of a hypercube connection scheme. The Intel Paragon, a commercial product that is the successor to the iPSC/860 and the Touchstone Delta, became available in the fall of 1992. The Paragon has the same connection scheme as the Delta. Its maximum configuration is 4096 nodes. It uses a second generation version of the i860 microprocessor and has a peak speed of .
The BBN TC2000 is another important system introduced in the late 1980s. It provides a shared-memory programming environment supported by hardware. It uses a multistage switch based on crossbars that connect processor memory pairs to each other [Karplus:87a, pp. 137-146], [ BBN:87a ]. The BBN TC2000 uses Motorola 88000 Series processors. The ratio of speeds between access to data in cache, to data respectively in the memory local to a processor, and to data in some other processor's memory, is approximately one, three and seven. Therefore, there is a noticeable but not prohibitive penalty for using another processor's memory. The architecture is scalable to over 500 processors, although none was built of that size. Each processor can have a substantial amount of memory, and the operating system environment is considered attractive. This system is one of the few commercial shared-memory MIMD computers that can scale to large numbers of nodes. It is no longer in production; the BBN Corporation terminated its parallel computer activities in 1991.
Guy Robinson
By the late 1980s, several highly parallel systems were able to achieve high levels of performance-the Connection Machine Model CM-2, the Intel iPSC/860, the nCUBE-2, and, early in the decade of the '90s, the Intel Touchstone Delta System. The peak speeds of these systems are quite high and, at least for some applications, the speeds achieved are also high, exceeding those achieved on vector supercomputers. The fastest CRAY system until 1992 was a CRAY Y-MP with eight processors, a peak speed of , and a maximum speed observed for applications of . In contrast, the Connection Machine Model CM-2 and the Intel Delta have achieved over for some real applications [ Brickner:89b ], [ Messina:92a ], [Mihaly:92a;92b]. There are some new Japanese vector supercomputers with a small number of processors (but a large number of instruction pipelines) that have peak speeds of over .
Finally, the vector computers continued to become faster and to have more processors. For example, the CRAY Y-MP C-90 that was introduced in 1992 has sixteen processors and a peak speed of .
By 1992, parallel computers were substantially faster. As was noted above, the Intel Paragon has a peak speed of . The CM-5, an MIMD computer introduced by Thinking Machines Corporation in 1992 has a maximum configuration of 16K processors, each with a peak speed of . The largest system at this writing is a 1024-node configuration in use at Los Alamos National Laboratory.
New introductions continue with Fall, 1992 seeing Fujitsu (Japan) and Meiko (U. K.) introducing distributed-memory parallel machines with a high-performance node featuring a vector unit using, in each case, a different VLSI implementation of the node of Fujitsu's high-end vector supercomputer. 1993 saw major Cray and Convex systems built around Digital and HP RISC microprocessor nodes.
Recently, there has been an interesting new architecture with a distributed-memory design supported by special hardware to build an appearance of shared memory to the user. The goal is to continue the cost effectiveness of distributed memory with the programmability of a shared-memory architecture. There are two major university projects: DASH at Stanford [ Hennessy:93a ], [ Lenoski:89a ] and Alewife [ Agarwal:91a ] at MIT. The first commercial machine, the Kendall Square KSR-1, was delivered to customers in Fall, 1991. A high-performance ring supports the apparent shared memory, which is essentially a distributed dynamic cache. The ring can be scaled up to 32 nodes that can be joined hierarchically to a full-size, 1024-node system that could have a performance of approximately . Burton Smith, the architect of the pioneering Denelcor HEP-1, has formed Teracomputer, whose machine has a virtual shared memory and other innovative features. The direction of parallel computing research could be profoundly affected if this architecture proves successful.
In summary, the 1980s saw an incredible level of activity in parallel computing, much greater than most people would have predicted. Even those projects that in a sense failed-that is, that were not commercially successful or, in the case of research projects, failed to produce an interesting prototype in a timely fashion-were nonetheless useful in that they exposed many people to parallel computing at universities, computer vendors, and (as outlined in Chapter 19 ) commercial companies such as Xerox, DuPont, General Motors, United Technologies, and aerospace and oil companies.
Guy Robinson
A gross generalization of the situation in the 1980s can be made that there was good software on low- and medium-performance systems such as Alliant, Sequent, Encore, and Multiflow systems (uninteresting to those preoccupied with the highest performance levels), while there was poor quality software in the highest performance systems. In addition, there is little or no software aimed at managing the system and providing a service to a diverse user community. There is typically no software that provides information on who uses the system and how much, that is, accounting and reporting software. Batch schedulers are typically not available. Controls for limiting the amount of time interactive users can take on the system at any one time also are missing. Ways of managing the on-line disks are non-existent. In short, the system software provided with the high-performance parallel computers is at best suitable for systems used by a single person or a small tightly-knit group of people.
Guy Robinson
In contrast, in the area of computer languages and compilers for those languages for parallel machines, there has been a significant amount of progress, especially in the late 1980s, for example [ AMT:87a ]. In February of 1984, the Argonne Workshop on Programming the Next Generation of Supercomputers was held in Albuquerque, New Mexico [ Smith:84a ]. It addressed topics such as:
Many people came to the workshop and showed high levels of interest, including leading computer vendors, but not very much happened in terms of real actions by compiler writers or standards-making groups. By the late 1980s, the situation had changed. Now the Parallel Computing Forum is healthy and well attended by vendor representatives. The Parallel Computing Forum was formed to develop a shared-memory multiprocessor model for parallel processing and to establish language standards for that model beginning with Fortran and C. In addition, the ANSI Standards Committee X3 formed a new technical committee, X3H5, named Parallel Processing Constructs for High Level Programming Languages. This technical committee will work on a model based upon standard practice in shared memory parallel processing. Extensions for message-passing-based parallel processing are outside the scope of the model under consideration at this time. The first meeting of X3H5 was held March 23, 1990.
Finally there are efforts under way to standardize language issues for parallel computing, at least for certain programming models. In the meantime, there has been progress in compiler technology. Compilers provided with Alliant and Multiflow machines before they went out of business, can be quite good at producing efficient code for each processor and relatively good at automatically parallelizing. On the other hand, compilers for the processors that are used on multicomputers generally produce inefficient code for the floating-point hardware. Generally, these compilers do not perform even the standard optimizations that have nothing to do with fancy instruction scheduling, nor do they do any automatic parallelization for the distributed-memory computers. While automatic parallelization for distributed-memory, as well as shared-memory systems, is a difficult task, and it is clear that it will be a few more years before good compilers exist for that task, it is a shame that so little effort is invested in producing efficient code for single processors. There are known compilation techniques that would provide a much greater percentage of the peak speed on commonly used microprocessors than is currently produced by the existing compilers.
As for languages, despite much work and interest in new languages, in most cases people still use Fortran or C with minor additions or calls to system subroutines. The language known as Connection Machine Fortran or CM-Fortran is, as discussed in Section 13.1 , an important exception. It is, of course, based largely on the array extensions of Fortran 90, but is not identical to that. One might note that CM-Fortran array extensions are also remarkably similar to those defined in the Department of Energy Language Working Group Fortran effort of the early 1980s [ Wetherell:82a ]. Fortran 90 itself was influenced by the LWG Fortran; in the early and mid-1980s, there were regular and frequent interactions between the DOE Language Working Group and the Fortran Standards Committee. A recent variant of Fortran 90 designed for distributed-memory systems is Fortran 90D [ Fox:91e ], which, as described in Chapter 13 , is the basis of an informal industry standard for data-parallel Fortran-High Performance Fortran (HPF) [ Kennedy:93a ]. HPF has attracted a great deal of attention from both users and computer vendors and it is likely to become a de facto standard in one or two years. The time for such a language must have finally come. The Fortran developments are mirrored by those in other languages, with C and, in particular, C++ receiving the most attention. Among many important projects, we select pC++ at Indiana University [ Bodin:91a ], which extends C++ so that it incorporates essentially the HPF parallel constructs. Further, C++ allows one to define more general data structures than the Fortran array; correspondingly pC++ supports general parallel collections.
Other languages that have seen some use include Linda [ Gelertner:89a ], [ Ahuja:86a ], and Strand [ Foster:90a ]. Linda has been particularly successful especially as a coordination language allowing one to link the many individual components of what we term metaproblems -a concept developed throughout this book and particularly in Chapters 3 and 18 . A more recent language effort is Program Composition Notation (PCN) that is being developed at the Center for Research on Parallel Computation (an NSF Science and Technology Center) [ Chandy:90a ]. PCN is a parallel programming language in its own right, but additionally has the feature that one can take existing Fortran and C functions and subprograms and use them through PCN as part of a PCN parallel program. PCN is in some ways similar to Strand, which is a dataflow-oriented logic language in the flavor of Prolog. PCN has been extended to CC++ [ Chandy:92a ] (Compositional C++), supporting general functional parallelism. Chandy reports that users found the novel syntax in PCN uncomfortable for users familiar with existing languages. This motivated his group to embody the PCN ideas in widely used languages with CC++ for C and C++ (sequential) users, and Fortran-M for Fortran users. The combination of CC++ and data parallel pC++ is termed HPC++ and this is an attractive candidate for the software model that could support general metaproblems . The requirements and needs for such software models will become clear from the discussion in this book, and are summarized in Section 18.1.3 .
Guy Robinson
Substantial efforts have been put into developing tools that facilitate parallel programming, both in shared-memory and distributed-memory systems, e.g., [ Clarke:91a ], [ Sunderam:90a ], [ Whiteside:88a ]. For shared-memory systems, for example, there are SCHEDULE [ Hanson:90a ], MONMACS, and FORCE. MONMACS and FORCE both provide higher level parallel programming constructs such as barrier synchronization and DO ALL that are useful for shared-memory environments. SCHEDULE provides a graphical interface to producing functionally decomposed programs for shared-memory systems. With SCHEDULE, one specifies a tree of calls to subroutines, and SCHEDULE facilitates and partially automates the creation of Fortran or C programs (augmented by appropriate system calls) that implement the call graphs. For distributed-memory environments, there are also several libraries or small operating systems that provide extensions to Fortran and C for programming on such architectures. A subset of MONMACS falls into that camp. More widely used systems in this area include Cosmic Environment Reactive Kernel [ Seitz:88a ] (see Chapter 16 ), Express [ ParaSoft:88a ] (discussed in detail in Chapter 5 ), and PICL [ Sunderam:90a ]. These systems provide message-passing routines in some cases, including those that do global operations on data such as Broadcast. They may also provide facilities for measuring performance or collecting data about message traffic, CPU utilization, and so on. Some debugging capabilities may also be provided. These are all general purpose tools and programming environments, and had been used for a wide variety of applications, chiefly scientific and engineering, but also non-numerical ones.
In addition, there are many tools that are domain-specific in some sense. Examples of these would be the Distributed Irregular Mesh Environment (DIME) by Roy Williams [Williams:88a;89b] (described in Chapter 10 ), and the parallel ELLPACK [ Houstis:90a ] partial differential equation solver and domain decomposer [Chrisochoides:91b:93a] developed by John Rice and his research group at Purdue. DIME is a programming environment for calculations with irregular meshes; it provides adaptive mesh refinement and dynamic load balancing. There are also some general purpose tools and programming systems, such as Sisal from Livermore, that provide a dataflow-oriented language capability; and Parti [Saltz:87a;91b], [ Berryman:91a ], which facilitates, for example, array mappings on distributed-memory machines. Load-balancing tools are described in Chapter 11 and, although they look very promising, they have yet to be packaged in a robust form for general users.
None of the general-purpose tools has emerged as a clear leader. Perhaps there is still a need for more research and experimentation with such systems.
Guy Robinson
There was remarkable progress during the 1980s in most areas related to high-performance computing in general and parallel computing in particular. There are now substantial numbers of people who use parallel computers to get real applications work done, in addition to many people who have developed and are developing new algorithms, new operating systems, new languages, and new programming paradigms and software tools for massively parallel and other high-performance computer systems. It was during this decade, especially in the last half, that there was a very quick transition towards identifying high-performance computing strongly with massively parallel computing. In the early part of the decade, only large, vector-oriented systems were used for high-performance computing. By the end of the decade, while most such work was still being done on vector systems, some of the leading-edge work was already being done on parallel systems. This includes work at universities and research laboratories, as well as in industrial applications. By the end of the decade, oil companies, brokerage companies on Wall Street, and database users were all taking advantage of parallelism in addition to the traditional scientific and engineering fields. The C P efforts played an important role in advancing parallel hardware, software, and applications. As this chapter indicates, many other projects contributed to this advance as well.
There is still a frustrating phenomenon of neglect of certain areas in the design of parallel computer systems, including ratios of internal computational speed versus input and output speed, and speed of communication between the processors in distributed-memory systems. Latency for both I/O and communication is still very high. Compilers are often still crude. Operating systems still lack stability and even the most fundamental system management tools. Nevertheless, much progress was made.
By the end of the 1980s, higher speeds than on any sequential computer were indeed achieved on the parallel computer systems, and this was done for a few real applications. In a few cases, the parallel systems even proved to be cheaper, that is, more cost-effective than sequential computers of equivalent power. This despite a truly dramatic increase in performance of sequential microprocessors, especially floating-point units, in the late 1980s. So, both key objectives of parallel computing-the highest achievable speed and more cost-effective performance-were achieved and demonstrated in the 1980s.
Guy Robinson
Guy Robinson
Computing is a controversial field. In more traditional fields, such as mathematics and physics, there is usually general agreement on the key issues-which ideas and research projects are ``good,'' what are the critical questions for the future, and so on. There is no such agreement in computing on either the academic or industrial sides. One simple reason is that the field is young-roughly forty years old. However, another important aspect is the multidisciplinary nature of the field. Hardware, software, and applications involve practitioners from very different academic fields with different training, prejudices, and goals. Answering the question, ``Does and How Does Parallel Computing Work?''
requires ``Solving real problems with real software on real hardware''
and so certainly involves aspects of hardware, software, and applications. Thus, some sort of mix of disciplines seems essential in spite of the difficulties in combining several disciplines.
The Caltech Concurrent Computation program attempted to cut through some of the controversy by adopting an interdisciplinary rather than multidisciplinary methodology. We can consider the latter as separate teams of experts as shown in Figures 3.1 and 3.2 , which tackle each component of the total project. In , we tried an integrated approach illustrated in Figure 3.3 . This did not supplant the traditional fields but rather augmented them with a group of researchers with a broad range of skills that to a greater or lesser degree spanned those of the core areas underlying computing. We will return to these discipline issues in Chapter 20 , but note here that this current discussion is simplistic and just designed to give context to the following analysis. The assignment of hardware to electrical engineering and software to computer science (with an underlying discipline of mathematics) is particularly idealized. Indeed, in many schools, these components are integrated. However, it is perhaps fair to say that experts in computer hardware have significantly different training and background from experts in computer software.
Figure 3.1:
The Multi-Disciplinary (Three-Team) Approach to Computing
Figure 3.2:
An Alternative (Four-Team) Multi-Disciplinary Approach to Computing
We believe that much of the success (and perhaps also the failures) of can be traced to its interdisciplinary nature. In this spirit, we will provide here a partial integration of the disparate contributions in this volume with a mathematical framework that links hardware, software, and applications. In this chapter, we will describe the principles behind this integration which will then be amplified and exemplified in the following chapters. This integration is usually contained in the first introductory section of each chapter. In Section 3.2 , we define a general methodology for computation and propose that it be described as mappings between complex systems. The latter are only loosely defined but several examples are given in Section 3.3 , while relevant properties of complex systems are given in Sections 3.4 through Section 3.6 . Section 3.7 describes mappings between different complex systems and how this allows one to classify software approaches. Section 3.8 uses this formalism to state our results and what we mean by ``parallel computing works.'' In particular, it offers the possibility of a quantitative approach to such questions as,
``Which parallel machine is suitable for which problems?''and
``What software models are suitable for what problems on what machines?''
Guy Robinson
There is no agreed-upon process behind computation, that is, behind the use of a computer to solve a particular problem. We have tried to quantify this in Figures 3.1 (b), 3.2 (b), and 3.3 which show a problem being first numerically formulated and then mapped by software onto a computer.
Figure 3.3:
An Interdisciplinary Approach to Computing with Computational
Science Shown Shaded
Even if we could get agreement on such an ``underlying'' process, the definitions of the parts of the process are not precise and correspondingly the roles of the ``experts'' are poorly defined. This underlies much of the controversy, and in particular, why we cannot at present or probably ever be able to define ``The best software methodology for parallel computing.''
How can we agree on a solution (What is the appropriate software?) unless we can agree on the task it solves?
``What is computation and how can it be broken up into components?''In other words, what is the underlying process?
In spite of our pessimism that there is any clean, precise answer to this question, progress can be made with an imperfect process defined for computation. In the earlier figures, it was stressed that software could be viewed as mapping problems onto computers. We can elaborate this as shown in Figure 3.4 , with the solution to a question pictured as a sequence of idealizations or simplifications which are finally mapped onto the computer. This process is spelled out for four examples in Figures 3.5 and 3.6 . In each case, we have tried to indicate possible labels for components of the process. However, this can only be approximate. We are not aware of an accepted definition for the theoretical or computational parts of the process. Again, which parts are modelling or simulation? Further, there is only an approximate division of responsibility among the various ``experts''; for example, between the theoretical physicist and the computational physicist, or among aerodynamics, applied mathematics, and computer science. We have also not illustrated that, in each case, the numerical algorithm is dependent on the final computer architecture targeted; in particular, the best parallel algorithm is often different from the best conventional sequential algorithm.
Figure 3.4:
An Idealized Process of Computation
Figure 3.5:
A Process for Computation in Two Examples in Basic Physics
Simulations
Figure 3.6:
A Process for Computation in Two Examples from Large Scale
System Simulations
We can abstract Figures 3.5 and 3.6 into a sequence of maps between complex systems , .
We have anticipated (Chapter 5 ) and broken the software into a high level component (such as a compiler) and a lower level one (such as an assembler) which maps a ``virtual computer'' into the particular machine under consideration. In fact, the software could have more stages, but two is the most common case for simple (sequential) computers.
A complex system, as used here, is defined as a collection of fundamental entities whose static or dynamic properties define a connection scheme between the entities. Complex systems have a structure or architecture. For instance, a binary hypercube parallel computer of dimension d is a complex system with members connected in a hypercube topology. We can focus in on the node of the hypercube and expand the node, viewed itself as a complex system, into a collection of memory hierarchies, caches, registers, CPU., and communication channels. Even here, we find another ill-defined point with the complex system representing the computer dependent on the resolution or granularity with which you view the system. The importance of the architecture of a computer system has been recognized for many years. We suggest that the architecture or structure of the problem is comparably interesting. Later, we will find that the performance of a particular problem or machine can be studied in terms of the match (similarity) between the architectures of the complex systems and defined in Equation 3.1 . We will find that the structure of the appropriate parallel software will depend on the broad features of the (similar) architecture of and . This can be expected as software maps the two complex systems into each other.
At times, we have talked in terms of problem architecture, but this is ambiguous since it could refer to any of the complex systems which can and usually do have different architectures. Consider the second example of Figure 3.5 with the computational fluid dynamics study of airflow. In the language of Equation 3.1 :
Guy Robinson
In the previous section, we showed how the process of computation could be viewed as mappings between complex systems. As the book progresses, we will quantify this by providing examples that cover a range of problem architectures. In the next three sections, we will set up the general framework and define terms which will be made clearer later on as we see the explicit problems with their different architectures. The concept of complex systems may have very general applicability to a wide range of fields but here we will focus solely on their application to computation. Thus, our discussion of their properties will only cover what we have found useful for the task at hand. These properties are surely more generally applicable, and one can expect that other ideas will be needed in a general discussion. Section 3.3 gives examples and a basic definition of a complex system and its associated space-time structure. Section 3.4 defines temporal properties and, finally, Section 3.5 spatial structures.
We wish to understand the interesting characteristics or structure of a complex system. We first introduce the concept of space-time into a general complex system. As shown in Figure 3.7 , we consider a general complex system as a space , or data domain, that evolves deterministically or probabilistically with time. Often, the space-time associated with a given complex system is identical with physical space-time but sometimes it is not. Let us give some examples.
Figure 3.7:
(a) Synchronous, Loosely Synchronous (Static), and (b)
Asynchronous (Dynamic) Complex Systems with their Space-Time Structure
Consider instead the solution of the elliptic partial differential equation
for the electrostatic potential in the presence of a charge density . A simple, albeit usually non-optimal approach to solving Equation 3.4 , is a Jacobi iteration, which in the special case of two dimensions and involves the iterative procedure
where we assume that integer values of the indices x and y label the two-dimensional grid on which Laplace's equation is to be solved. The complex system defined by Equation 3.5 has spatial domain defined by the grid and a temporal dimension defined by the iteration index n . Indeed, the Jacobi iteration is mathematically related to solving the parabolic partial differential equation
where one relates the discretized time t to the iteration index n . This equivalence between Equations 3.5 and 3.6 is qualitatively preserved when one compares the solution of Equations 3.3 and 3.5 . As long as one views iteration as a temporal structure, Equations 3.3 and 3.4 can be formulated numerically with isomorphic complex systems . This implies that parallelization issues, both hardware and software, are essentially identical for both equations.
The above example illustrates the most important form of parallelism-namely, data parallelism . This is produced by parallel execution of a computational (temporal) algorithm on each member of a space or data domain . Data parallelism is essentially synonymous with either massive parallelism or massive data parallelism . Spatial domains are usually very large, with from to members today; thus exploiting this data parallelism does lead to massive parallelism.
Parallelization of this is covered fully in [ Fox:88a ] and Chapter 8 of this book. Gaussian elimination (LU decomposition) for solving Equation 3.7 involves successive steps where in the simplest formulation without pivoting, at step k one ``eliminates'' a single variable where the index . At each step k , one modifies both and
and , are formed from ,
where one ensures (if no pivoting is employed) that when j>k . Consider the above procedure as a complex system. The spatial domain is formed by the matrix A with a two-dimensional array of values . The time domain is labelled by the index k and so is a discrete space with n (number of rows or columns of A ) members. The space is also discrete with members.
Guy Robinson
As shown in Equation 3.1 , we will use complex systems to unify a variety of different concepts including nature and an underlying theory such as Quantum Chromodynamics; the numerical formulation of the theory; the result of expressing this with various software paradigms and the final computer used in its simulation. Different disciplines have correctly been built up around these different complex systems. Correspondingly different terminology is often used to describe related issues. This is certainly reasonable for both historical and technical reasons. However, we argue that understanding the process of computation and answering questions such as, ``Which parallel computers are good for which problems?''; ``What problems parallelize?''; and ``What are productive parallel software paradigms?'' is helped by a terminology which bridges the different complex systems. We can illustrate this with an anecdote. In a recent paper, an illustration of particles in the universe was augmented with a hierarchical set of clusters produced with the algorithm of Section 12.4 . These clusters are designed to accurately represent the different length scales and physical clustering of the clouds of particles. This picture was labelled ``data structure'' but one computer science referee noted that this was not appropriate. Indeed, the referee was in one sense correct-we had not displayed a computer science data structure such as a Fortran array or C structure defining the linked list of particles. However, taking the point of view of the physicist, this picture was precisely showing the structure of the data and so, the caption was in one discipline (physics) correct and in another (computer science) false!
We will now define and discuss some general properties and parameters of complex systems which span the various disciplines involved.
We will first discuss possible temporal structures for a complex system. Here, we draw on a computer science classification of computer architecture. In this context, aspects such as internode topology refer to the spatial structure of the computer viewed as a complex system. The control structure of the computer refers to the temporal behavior of its complex system. In our review of parallel computer hardware, we have already introduced the concepts of SIMD and MIMD, two important temporal classes which carry over to general complex systems. Returning to Figures 3.7 (a) and 3.7 (b), we see complex systems which are MIMD (or asynchronous as defined below) in Figure 3.7 (b) and either SIMD or a restricted form of MIMD in Figure 3.7 (a) (synchronous or loosely synchronous in language below). In fact, when we consider the temporal structure of problems ( in Equation 3.1 ), software ( ), and hardware ( in Equation 3.1 ), we will need to further extend this classification. Here we will briefly define concepts and give the section number where we discuss and illustrate it more fully.
Society shows many examples of loosely synchronous behavior. Vehicles proceed more or less independently on a city street between loose synchronization points defined by traffic lights. The reader's life is loosely synchronized by such events as meals and bedtime.
When we consider computer hardware and software systems, we will need to consider other temporal classes which can be thought of as further subdivisions of the asynchronous class.
In Figure 3.8 , we summarize these temporal classifications for complex systems, indicating a partial ordering with arrows pointing to more general architectures. This will become clearer in Section 3.5 when we discuss software and the relation between problem and computer. Note that although the language is drawn from the point of view of computer architecture, the classifications are important at the problem, software, and hardware level.
Figure 3.8:
Partial Ordering of Temporal (Control) Architectures for a Complex
System
The hardware (computer) architecture naturally divides into SIMD (synchronous), MIMD (asynchronous), and von Neumann classes. The problem structures are synchronous, loosely synchronous, or asynchronous. One can argue that the shared-memory asynchronous architecture is naturally suggested by software ( ) considerations and in particular by the goal of efficient parallel execution for sequential software models. For this reason it becomes an important computer architecture even though it is not a natural problem ( ) architecture.
Guy Robinson
Now we switch topics and consider the spatial properties of complex systems.
The size N of the complex system is obviously an important property. Note that we think of a complex system as a set of members with their spatial structure evolving with time. Sometimes, the time domain has a definite ``size'' but often one can evolve the system indefinitely in time. However, most complex systems have a natural spatial size with the spatial domain consisting of N members. In the examples of Section 3.3 , the seismic example had a definite spatial extent and unlimited time domain; on the other hand, Gaussian elimination had spatial members evolving for a fixed number of n ``time'' steps. As usual, the value of spatial size N will depend on the granularity or detail with which one looks at the complex system. One could consider a parallel computer as a complex system constructed as a collection of transistors with a correspondingly very large value of but here we will view the processor node as the fundamental entity and define the spatial size of a parallel computer viewed as a complex system, by the number of processing nodes.
Now is a natural time to define the von Neumann complex system spatial structure which is relevant, of course, for computer architecture. We will formally define this to be a system with no spatial extent, i.e. size . Of course, a von Neumann node can have a sophisticated structure if we look at fine enough resolution with multiple functional units. More precisely, perhaps, we can generalize this complex system to one where is small and will not scale up to large values.
Consider mapping a seismic simulation with grid points onto a parallel machine with processors. An important parameter is the grain size n of the resultant decomposition. We can introduce the problem grain size and the computer grain size as the memory contained in each node of the parallel computer. Clearly we must have,
if we measure memory size in units of seismic grid points. More interestingly, in Equation 3.10 we will relate the performance of the parallel implementation of the seismic simulation to and other problem and computer characteristics. We find that, in many cases, the parallel performance only depends on and in the combination and so grain size is a critical parameter in determining the effectiveness of parallel computers for a particular application.
The next set of parameters describe the topology or structure of the spatial domain associated with the complex system. The simplest parameter of this type is the geometric dimension of the space. As reviewed in Chapter 2 , the original hardware and, in fact, software (see Chapter 5 ) exhibited a clear geometric structure for or . The binary hypercube of dimension d had this as its geometric dimension. This was an effective architecture because it was richer than the topologies of most problems. Thus, consider mapping a problem of dimension onto a computer of dimension . Suppose the software system preserves the spatial structure of the problem and that . Then, one can show that the parallel computing overhead f has a term due to internode communication that has the form,
with parallel speedup S given by
The communication overhead depends on the problem grain size and computer complex system . It also involves two parameters specifying the parallel hardware performance.
The definitions of and are imprecise above. In particular, depends on the nature of node and can take on very different values depending on the details of the implementation; floating-point operations are much faster when the operands are taken from registers than from slower parts of the memory hierarchy. On systems built from processors like the Intel i860 chip, these effects can be large; could be from registers (50 MFLOPS) and larger by a factor of ten when the variables a,b are fetched from dynamic RAM. Again, communication speed depends on internode message size (a software characteristic) and the latency (startup time) and bandwidth of the computer communication subsystem.
Returning to Equation 3.10 , we really only need to understand here that the term indicates that communication overhead depends on relative performance of the internode communication system and node (floating-point) processing unit. A real study of parallel computer performance would require a deeper discussion of the exact values of and . More interesting here is the dependence on the number of processors and problem grain size . As described above, grain size depends on both the problem and the computer. The values of and are given by
independent of computer parameters, while if
The results in Equation 3.13 quantify the penalty, in terms of a value of that increases with , for a computer architecture that is less rich than the problem architecture. An attractive feature of the hypercube architecture is that is large and one is essentially always in the regime governed by the top line in Equation 3.13 . Recently, there has been a trend away from rich topologies like the hypercube towards the view that the node interconnect should be considered as a routing network or switch to be implemented in the very best technology. The original MIMD machines from Intel, nCUBE, and AMETEK all used hypercube topologies, as did the SIMD Connection Machines CM-1 and CM-2. The nCUBE-2, introduced in 1990, still uses a hypercube topology but both it and the second generation Intel iPSC/2 used a hardware routing that ``hides'' the hypercube connectivity. The latest Intel Paragon and Touchstone Delta and Symult (ex-Ametek) 2010 use a two-dimensional mesh with wormhole routing. It is not clear how to incorporate these new node interconnects into the above picture and further research is needed. Presumably, we would need to add new complex system properties and perhaps generalize the definition of dimension as we will see below is in fact necessary for Equation 3.10 to be valid for problems whose structure is not geometrically based.
Returning to Equations 3.10 through 3.12 , we note that we have not properly defined the correct dimension or to use. We have implicitly equated this with the natural geometric dimension but this is not always correct. This is illustrated by the complex system consisting of a set of particles in three dimensions interacting with a long range force such as gravity or electrostatic charge. The geometric structure is local with but the complex system structure is quite different; all particles are connected to all others. As described in Chapter 3 of [ Fox:88a ], this implies that whatever the underlying geometric structure. We define the system dimension for a general complex system to reflect the system connectivity. Consider Figure 3.9 which shows a general domain D in a complex system. We define the volume of this domain by the information in it. Mathematically, is the computational complexity needed to simulate D in isolation. In a geometric system
where L is a geometric length scale. The domain D is not in general isolated and is connected to the rest of the complex system. Information flows in D and again in a geometric system. is a surface effect with
Figure 3.9:
The Information Density and Flow in a General Complex System
with Length Scale
L
If we view the complex system as a graph, is related to the number of links of the graph inside D and is related to the number of links cut by the surface of D . Equations 3.14 and 3.15 are altered in cases like the long-range force problem where the complex system connectivity is no longer geometric. We define the system dimension to preserve the surface versus volume interpretation of Equation 3.15 compared to Equation 3.14 . Thus, generally we define
With this definition of system dimension , we will find that Equations 3.10 through 3.12 essentially hold in general. In particular for the long range force problem, one finds independent of .
A very important special type of spatial structure is the case when we find the embarrassingly parallel or spatially disconnected complex system. Here there is no connection between the different members in the spatial domain. Applying this to parallel computing, we see that if or is spatially disconnected, then it can be parallelized very straightforwardly. In particular, any MIMD machine can be used whatever the temporal structure of the complex system. SIMD machines can only be used to simulate embarrassingly parallel complex systems which have spatially disconnected members with identical structure.
In Section 13.7 , we extend the analysis of this section to cover the performance of hierarchical memory machines. We find that one needs to replace subdomains in space with those in space-time.
In Chapter 11 , we describe other interesting spatial properties in terms of a particle analogy. We find system temperature and phase transitions as one heats and cools the complex system.
Guy Robinson
In Sections 3.4 and 3.5 , we discussed basic characteristics of complex systems. In fact, many ``real world examples'' are a mixture of these fundamental architectures. This is illustrated by Figure 3.10 , which shows a very conventional computer network with several different architectures. We cannot only regard each individual computer as a complex system but the whole network is, of course, a single complex system as we have defined it. We will term such complex systems compound . Note that one often puts together a network of resources to solve a ``single problem'' and so an analysis of the structure of compound complex systems is not an academic issue, but of practical importance.
Figure 3.10:
A Heterogeneous Compound Complex System
Corresponding
to a Network of Computers of Disparate Architectures
Figure 3.10 shows a mixture of temporal, synchronous and asynchronous, and spatial, hypercube, and von Neumann architectures. We can look at both the architecture of the individual network components or, taking a higher level view, look at the network itself with member computers, such as the hypercube in Figure 3.10 , viewed as ``black boxes.'' In this example, the network is an asynchronous complex system and this seems quite a common circumstance. Figure 3.11 shows two compound problems coming from the aerospace and battle management fields, respectively. In each case we link asynchronously problem modules which are synchronous, loosely synchronous, or asynchronous. We have found this very common. In scientific and engineering computations, the basic modules are usually synchronous or loosely synchronous. These modules have large spatial size and naturally support ``massive data parallelism.'' Rarely do we find large asynchronous modules; this is fortunate as such complex systems are difficult to parallelize. However, in many cases the synchronous or loosely synchronous program modules are hierarchically combined with an asynchronous architecture. This is an important way in which the asynchronous problem architecture is used in large scale computations. This is explored in detail in Chapter 18 . One has come to refer to systems such as those in Figure 3.10 as metacomputers . Correspondingly, we designate the applications in Figure 3.11 metaproblems .
Figure:
Two Heterogeneous Complex Systems
Corresponding
to: a) the Integrated Design of a New Aircraft, and b) the Integrated
Battle Management Problem Discussed in Chapter
18
If we combine Figures 3.10 and 3.11 , we can formulate the process of computation in its most general complicated fashion.
``Map a compound heterogeneous problem onto a compound heterogeneous computer.''
Guy Robinson
Equation 3.1 first stated our approach to computation as a map between different complex systems. We can quantify this by defining a partial order on complex systems written
Equation 3.17 states that a complex system A can be mapped to complex system B , that is, that B has a more general architecture than A . This was already seen in Figure 3.8 and is given in more detail in Figure 3.12 , where we have represented complex systems in a two-dimensional space labelled by their spatial and temporal properties. In this notation, we require:
Figure 3.12:
An Illustration of Problem or Computer Architecture Represented
in a Two-dimensional Space. The spatial structure only gives a few
examples.
The requirement that a particular problem parallelize is that
which is shown in Figure 3.13 . We have drawn our space labelled by complex system properties so that the partial ordering of Figures 3.8 and 3.12 ``flows'' towards the origin. Roughly, complex systems get more specialized as one either moves upwards or to the right. We view the three key complex systems, , , and , as points in the space represented in Figures 3.12 and 3.13 . Then Figure 3.13 shows that the computer complex system lies below and to the left of those representing and .
Let us consider an example. Suppose (the computer) is a hypercube of dimension with a MIMD temporal structure. Synchronous, loosely synchronous, or asynchronous problems can be mapped onto this computer as long as the problem's spatial structure is contained in the hypercube topology. Thus, we will successfully map a two-dimensional mesh. But what about a mesh or irregular lattice? The mesh only has nine (spatial) components and insufficient parallelism to exploit the 64-node computer. The large irregular mesh can be efficiently mapped onto the hypercube as shown in Chapter 12 . However, one could support this with a more general computer architecture where hardware or software routing essentially gives the general node-to-node communication shown in the bottom left corner of Figure 3.12 . The hypercube work at Caltech and elsewhere always used this strategy in mapping complex spatial topologies; the crystal-router mechanism in CrOS or Express was a powerful and efficient software strategy. Some of the early work using transputers found difficulties with some spatial structures since the language Occam only directly supported process-to-process communication over the physical hardware channels. However, later general Occam subroutine libraries (communication ``harnesses'') callable from FORTRAN or C allowed the general point-to-point (process-to-process) communication model for transputer systems.
Figure:
Easy and Hard Mappings of Complex Systems
or
. We show
complex systems for problems and computers in a space labelled by
spatial and temporal complex system (computer architectures).
Figure
3.12
illustrates possible ordering of
these structures.
Guy Robinson
The complex system classification introduced in this chapter allows a precise formulation of the lessons of current research.
The majority of large scale scientific and engineering computations have synchronous or loosely synchronous character. Detailed statistics are given in Section 14.1 but we note that our survey suggests that at most 10% of applications are asynchronous. The microscopic or macroscopic temporal synchronization in the synchronous or loosely synchronous problems ensures natural parallelism without difficult computer hardware or software synchronization. Thus, we can boldly state that
for these problems. This quantifies our statement that ``Parallel Computing Works,''
where Equation 3.20 should be interpreted in the sense shown in Figure 3.13 (b).
Roughly, loosely synchronous problems are suitable for MIMD and synchronous problems for SIMD computers. We can expand Equation 3.20 and write
The results in Equation 3.21 are given with more details in Tables 14.1 and 14.2 . The text of this book is organized so that we begin by studying the simpler synchronous applications, then give examples first of loosely synchronous and finally asynchronous and compound problems.
The bold statements in Equations 3.20 and 3.21 become less clear when one considers software and the associated software complex system . The parallel software systems CrOS and its follow-on Express were used in nearly all our applications. These required explicit user insertion of message passing, which in many cases is tiresome and unfamiliar. One can argue that, as shown in Figure 3.14 , we supported a high-level software environment that reflected the target machine and so could be effectively mapped on it. Thus, we show (CrOS) and (MIMD) close together on Figure 3.14 . A more familiar and attractive environment to most users would be a traditional sequential language like Fortran77. Unfortunately,
and so, as shown in Figures 3.13 (a) and 3.14 , it is highly non-trivial to effectively map existing or new Fortran77 codes onto MIMD or SIMD parallel machines-at least those with distributed memory. We will touch on this issue in this book in Sections 13.1 and 13.2 , but it remains an active research area.
We also discuss data parallel languages, such as High Performance Fortran, in Chapter 13 [ Kennedy:93a ]. This is designed so that
Figure 3.14:
The Dusty Deck Issue in Terms of the Architectures of Problem,
Software, and Computer Complex Systems
We can show this point more graphically if we introduce a quantitative measure M of the difficulty of mapping . We represent M as the difference in heights h ,
where we can only perform the map if M is positive
Negative values of M correspond to difficult cases such as Equation 3.22 while large positive values of M imply a possible but hard map. Figure 3.15 shows how one can now picture the process of computation as moving ``downhill'' in the complex system architecture space.
Figure 3.15:
Two Problems
and Five Computer Architectures
in the Space-Time Architecture Classification of Complex Systems. An arrow
represents a successful mapping and an ``X'' a mapping that will fail without a
sophisticated compiler.
This formal discussion is illustrated throughout the book by numerous examples, which show that a wide variety of applications parallelize. Most of the applications chapters start with a computational analysis that refers back to the general concepts developed. This is finally summarized in Chapter 14 , which exemplifies the asynchronous applications and starts with an overview of the different temporal problem classes. We build up to this by discussing synchronous problems in Chapters 4 and 6 ; embarrassingly parallel problems in Chapter 7 ; loosely synchronous problem with increasing degrees of irregularity in Chapters 8 , 9 , and 12 . Compound problem classes-an asynchronous mixture of loosely synchronous components-are described in Chapters 17 and 18 . The large missile tracking and battle management simulation built at JPL (Figure 3.11 b) and described briefly in Chapter 18 was the major example of a compound problem class within C P. Chapters 17 and 19 indicate that we believe that this class is extremely important in many ``real-world'' applications that integrate many diverse functions.
Guy Robinson
Guy Robinson
The Caltech Concurrent Computation Project started with QCD, or Quantum Chromodynamics, as its first application. QCD is discussed in more detail in Sections 4.2 and 4.3 , but here we will put in the historical perspective. This nostalgic approach is developed in [ Fox:87d ], [ Fox:88oo ] as well as Chapters 1 and 2 of this book.
We show, in Table 4.1 , fourteen QCD simulations, labelled by representative physics publications, performed within C P using parallel machines. This activity started in 1981 with simulations, using the first four-node 8086-8087-based prototypes of the Cosmic Cube. These prototypes were quite competitive in performance with the VAX 11/780 on which we had started (in 1980) our computational physics program within high energy physics at Caltech. The 64-node Cosmic Cube was used more or less continuously from October, 1983 to mid-1984 on what was termed by Caltech, a ``mammoth calculation'' in the press release shown in Figure 4.2 . This is the modest, four-dimensional lattice calculation reported in line 3 of Table 4.1 . As trumpeted in Figures 4.1 and 4.2 , this was our first major use of parallel machines and a critical success on which we built our program.
Table 4.1:
Quantum Chromodynamic (QCD) Calculations Within C
P
Our 1983-1984 calculations totalled some 2,500 hours on the 64-node Cosmic Cube and successfully competed with 100-hour CDC Cyber 205 computations that were the state of the art at the time [ Barkai:84b ], [ Barkai:84c ], [ Bowler:85a ], [ DeForcrand:85a ]. We used a four-dimensional lattice with grid points, with eight gluon field values defined on each of the 110,592 links between grid points. The resultant 884,736 degrees of freedom seem modest today as QCD practitioners contemplate lattices of order simulated on machines of teraFLOP performance [ Aoki:91a ]. However, this lattice was comparable to those used on vector supercomputers at the time.
A hallmark of this work was the interdisciplinary team building hardware, software, and parallel application. Further, from the start we stressed large supercomputer-level simulations where parallelism would make the greatest initial impact. It was also worth noting that our use of comparatively high-level software paid off-Otto and Stack were able to code better algorithms [ Parisi:83a ] than the competing vector supercomputer teams. The hypercube could be programmed conveniently without use of microcode or other unproductive environments needed on some of the other high-performance machines of the time.
Our hypercube calculations used an early C plus message-passing programming approach which later evolved into the Express system described in the next chapter. Although not as elegant as data-parallel C and Fortran (discussed in Chapter 13 ), our approach was easier than hand-coded assembly, which was quite common for alternative high-performance systems of the time.
Figures 4.1 and 4.2 show extracts from Caltech and newspaper publicity of the time. We were essentially only a collection of 64 IBM PCs. Was that a good thing (as we thought) or an indication of our triviality (as a skeptical observer commenting in Figure 4.1 thought)? 1985 saw the start of a new phase as conventional supercomputers and availability increased in power and NSF and DOE allocated many tens of thousands of hours on the CRAY X-MP (2, Y-MP) and ETA-10 to QCD simulations. Our final QCD hypercube calculations in 1989 within C P used a 64-node JPL Mark IIIfp with approximately performance. Since this work, we switched to using the Connection Machine CM-2, which by 1990 was the commercial standard in the field. C P helped the Los Alamos group of Brickner and Gupta (one of our early graduates!) to develop the first CM-2 QCD codes, which in 1991 performed at on the full size CM-2 [ Brickner:91a ], [ Liu:91a ].
Caltech Scientists Develop `Parallel' Computer Model By LEE DEMBART
Times Science Writer Caltech scientists have developed a working prototype for a new super computer that can perform many tasks at once, making possible the solution of important science and engineering problems that have so far resisted attack.The machine is one of the first to make extensive use of parallel processing, which has been both the dream and the bane of computer designers for years.
Unlike conventional computers, which perform one step at a time while the rest of the machine lies idle, parallel computers can do many things at the same time, holding out the prospect of much greater computing speed than currently available-at much less cost.
If its designers are right, their experimental device, called the Cosmic Cube, will open the way for solving problems in meteorology, aerodynamics, high-energy physics, seismic analysis, astrophysics and oil exploration, to name a few. These problems have been intractable because even the fastest of today's computers are too slow to process the mountains of data in a reasonable amount of time.
One of today's fastest computers is the Cray 1, which can do 20 million to 80 million operations a second. But at $5 million, they are expensive and few scientists have the resources to tie one up for days or weeks to solve a problem.
``Science and engineering are held up by the lack of super computers,'' says one of the Caltech inventors, Geoffrey C. Fox, a theoretical physicist. ``They know how to solve problems that are larger than current computers allow.''
The experimental device, 5 feet long by 8 inches high by 14 inches deep, fits on a desk top in a basement laboratory, but it is already the most powerful computer at Caltech. It cost $80,000 and can do three million operations a second-about one-tenth the power of a Cray 1.
Fox and his colleague, Charles L. Seitz, a computer scientist, say they can expand their device in coming years so that it has 1,000 times the computing power of a Cray.
``Poor old Cray and Cyber (another super computer) don't have much of a chance of getting any significant increase in speed,'' Fox said. ``Our ultimate machines are expected to be at least 1,000 times faster than the current fastest computers.''
``We are getting to the point where we are not going to be talking about these things as fractions of a Cray but as multiples of them,'' Seitz said.
But not everyone in the field is as impressed with Caltech's Cosmic Cube as its inventors are. The machine is nothing more nor less than 64 standard, off-the-shelf microprocessors wired together, not much different than the innards of 64 IBM personal computers working as a unit.
``We are using the same technology used in PCs (personal computers) and Pacmans,'' Seitz said. The technology is an 8086 microprocessor capable of doing 1/20th of a million operations a second with 1/8th of a megabyte of primary storage. Sixty-four of them together will do 3 million operations a second with 8 megabytes of storage.
Currently under development is a single chip that will replace each of the 64 8-inch-by-14-inch boards. When the chip is ready, Seitz and Fox say they will be able to string together 10,000 or even 100,000 of them.
Computer scientists have known how to make such a computer for years but have thought it too pedestrian to bother with.
``It could have been done many years ago,'' said Jack B. Dennis, a computer scientist at the Massachusetts Institute of Technology who is working on a more radical and ambitious approach to parallel processing than Seitz and Fox. He thinks his approach, called ``dataflow,'' will both speed up computers and expand their horizons, particularly in the direction of artificial intelligence .
Computer scientists dream of getting parallel processors to mimic the human brain , which can also do things concurrently.
``There's nothing particularly difficult about putting together 64 of these processors,'' he said. ``But many people don't see that sort of machine as on the path to a profitable result.''
What's more, Dennis says, organizing these machines and writing programs for them have turned out to be sticky problems that have resisted solution and divided the experts.
``There is considerable debate as to exactly how these large parallel machines should be programmed,'' Dennis said by telephone from Cambridge, Mass. ``The 64-processor machine (at Caltech) is, in terms of cost-performance, far superior to what exists in a Cray 1 or a Cyber 205 or whatever. The problem is in the programming.''
Fox responds that he has ``an existence proof'' for his machine and its programs, which is more than Dennis and his colleagues have to show for their efforts.
The Caltech device is a real, working computer, up and running and chewing on a real problem in high-energy physics. The ideas on which it was built may have been around for a while, he agreed, but the Caltech experiment demonstrates that there is something to be gained by implementing them.
For all his hopes, Dennis and his colleagues have not yet built a machine to their specifications. Others who have built parallel computers have done so on a more modest scale than Caltech's 64 processors. A spokesman for IBM said that the giant computer company had built a 16-processor machine, and is continuing to explore parallel processing.
The key insight that made the development of the Caltech computer possible, Fox said, was that many problems in science are computationally difficult because they are big, not because they are necessarily complex.
Because these problems are so large, they can profitably be divided into 64 parts. Each of the processors in the Caltech machine works on 1/64th of the problem.
Scientists studying the evolution of the universe have to deal with 1 million galaxies. Scientists studying aerodynamics get information from thousands of data points in three dimensions.
To hunt for undersea oil, ships tow instruments through the oceans, gathering data in three dimensions that is then analyzed in two dimensions because of computer limitations. The Caltech computer would permit three-dimensional analysis.
``It has to be problems with a lot of concurrency in them,'' Seitz said. That is, the problem has to be split into parts, and all the parts have to be analyzed simultaneously.
So the applications of the Caltech computer for commercial uses such as an airline reservation system would be limited, its inventors agree.
Figure 4.1:
Caltech Scientists Develop ``Parallel'' Computer Model
[
Dembart:84a
]
CALTECH'S COSMIC CUBE
PERFORMING MAMMOTH CALCULATIONSLarge-scale calculations in basic physics have been successfully run on the Cosmic Cube, an experimental computer at Caltech that its developers and users see as the forerunner of supercomputers of the future. The calculations, whose results are now being published in articles in scientific journals, show that such computers can deliver useful computing power at a far lower cost than today's machines.
The first of the calculations was reported in two articles in the June 25 issue of . In addition, a second set of calculations related to the first has been submitted to for publication.
The June articles were:
-``Pure Gauge SU(3) Lattice Theory on an Array of Computers,'' by Eugene Brooks, Geoffrey Fox, Steve Otto, Paul Stolorz, William Athas, Erik DeBenedictis, Reese Faucette, and Charles Seitz, all of Caltech; and John Stack of the University of Illinois at Urbana-Champaign, and
-``The SU(3) Heavy Quark Potential with High Statistics,'' by Steve Otto and John Stack.
The Cosmic Cube consists of 64 computer elements, called nodes, that operate on parts of a problem concurrently. In contrast, most computers today are so-called von Neumann machines, consisting of a single processor that operates on a problem sequentially, making calculations serially.
The calculation reported in the June took 2,500 hours of the computation time on the Cosmic Cube. The calculation represents a contribution to the test of a set of theories called the Quantum Field Theories, which are mathematical attempts to explain the physical properties of subatomic particles known as hadrons, which include protons and neutrons.
These basic theories represent in a series of equations the behavior of quarks, the basic constituents of hadrons. Although theorists believe these equations to be valid, they have never been directly tested by comparing their predictions with the known properties of subatomic particles as observed in experiments with particle accelerators.
The calculations to be published in probe the properties, such as mass, of the glueballs that are predicted by theory.
``The calculations we are reporting are not earth-shaking,'' said Dr. Fox. ``While they are the best of their type yet done, they represent but a steppingstone to better calculations of this type.'' According to Dr. Fox, the scientists calculated the force that exists between two quarks. This force is carried by gluons, the particles that are theorized to carry the strong force between quarks. The aim of the calculation was to determine how the attractive force between quarks varies with distance. Their results showed that the potential depends linearly on distance.
``These results indicate that it would take an infinite amount of energy to separate two quarks, which shows why free quarks are not seen in nature,'' said Dr. Fox. ``These findings represent a verification of what most people expected.''
The Cosmic Cube has about one-tenth the power of the most widely used supercomputer, the Cray-1, but at one hundredth the cost, about $80,000. It has about eight times the computing power of the widely used minicomputer, the VAX 11/780. Physically, the machine occupies about six cubic feet, making it fit on the average desk, and uses 700 watts of power.
Each of the 64 nodes of the Cosmic Cube has approximately the same power as a typical microcomputer, consisting of 16-bit Intel 8086 and 8087 processors, with 136K bytes of memory storage. For comparison, the IBM Personal Computer uses the same family of chips and typically possesses a similar amount of memory. Each of the Cosmic Cube nodes executes programs concurrently, and each can send messages to six other nodes in a communication network based on a six-dimensional cube, or hypercube. The chips for the Cosmic Cube were donated by Intel Corporation, and Digital Equipment Corporation contributed supporting computer hardware. According to Dr. Fox, a full-scale extension of the Quantum Field Theories to yield the properties of hadrons would require a computer 1,000 times more powerful than the Cosmic Cube-or 100 computer projects at Caltech are developing hardware and software for such advanced machines.
Figure 4.2:
Caltech's Cosmic Cube Performing Mammoth Calculations
[
Meredith:84a
]
It is not surprising that our first hypercube calculations in C P did not need the full MIMD structure of the machine. This was also a characteristic of Sandia's pioneering use of the 1024-node nCUBE-[ Gustafson:88a ]. Synchronous applications like QCD are computationally important and have a simplicity that made them the natural starting point for our project.
Guy Robinson
Table 4.2 indicates that 70 percent of our first set of applications were of the class we call synchronous. As remarked above, this could be expected in any such early work as these are the problems with the cleanest structure that are, in general, the simplest to code and, in particular, the simplest to parallelize. As already defined in Section 3.4 , synchronous applications are characterized by a basic algorithm that consists of a set of operations which are applied identically in every point in a data set. The structure of the problem is typically very clear in such applications, and so the parallel implementation is easier than for the irregular loosely synchronous and asynchronous cases. Nevertheless, as we will see, there are many interesting issues in these problems and they include many very important applications. This is especially true for academic computations that address fundamental science. These are often formulated as studies of fundamental microscopic quantities such as the quark and gluon fundamental particles seen in QCD of Section 4.3 . Fundamental microscopic entities naturally obey identical evolution laws and so lead to synchronous problem architectures. ``Real world problems''-perhaps most extremely represented by the battle simulations of Chapter 18 in this book-typically do not involve arrays of identical objects, but rather the irregular dynamics of several different entities. Thus, we will find more loosely synchronous and asynchronous problems as we turn from fundamental science to engineering and industrial or government applications. We will now discuss the structure of QCD in more detail to illustrate some general computational features of synchronous problems.
Table 4.2:
The Ten Pioneer Hypercube Applications within C
P
The applications using the Cosmic Cube were well established by 1984 and Table 4.2 lists the ten projects which were completed in the first year after we started our interdisciplinary project in the summer of 1983. All but one of these projects are more or less described in this book, while the other will be found in [ Fox:88a ]. They covered a reasonable range of application areas and formed the base on which we first started to believe that parallel computing works! Figure 4.3 illustrates the regular lattice used in QCD and its decomposition onto 64 nodes. QCD is a four-dimensional theory and all four dimensions can be decomposed. In our initial 64-node Cosmic Cube calculations, we used the three-dimensional decompositions shown in Figure 4.3 with the fourth dimension, as shown in Figure 4.4 , and internal degrees of freedom stored sequentially in each node. Figure 4.3 also indicates one subtlety needed in the parallelization; namely, one needs a so-called red-black strategy with only half the lattice points updated in each of the two (``red'' and ``black'') phases. Synchronous applications are characterized by such a regular spatial domain as shown in Figure 4.3 and an identical update algorithm for each lattice point. The update makes use of a Monte Carlo procedure described in Section 4.3 and in more detail in Chapter 12 of [ Fox:88a ]. This procedure is not totally synchronous since the ``accept-reject'' mechanism used in the Monte Carlo procedure does not always terminate at the same step. This is no problem on an MIMD machine and even makes the problem ``slightly loosely synchronous.'' However, SIMD machines can also cope with this issue as all systems (DAP, CM-2, Maspar) have a feature that allows processors to either execute the common instruction or ignore it.
Figure 4.3:
A
Problem Lattice Decomposed onto a 64-node Machine
Arranged as a
Machine Lattice. Points labeled X (``red'') or
(``black'') can be updated at the same time.
Figure:
The 16 time and eight internal gluon degrees of freedom stored
at each point shown in Figure
4.3
Figure 4.5 illustrates the nearest neighbor algorithm used in QCD and very many problems described by local interactions. We see that some updates require communication and some don't. In the message-passing software model used in our hypercube work described in Chapter 5 , the user is responsible for organizing this communication with an explicit subroutine call. Our later QCD calculations and the spin simulations of Section 4.4 use the data-parallel software model on SIMD machines where a compiler can generate their messaging. Chapter 13 will mention later projects which are aimed at producing a uniform data parallel Fortran or C compiler which will generate the correct message structure for either SIMD or MIMD machines on such regular problems as QCD.
Figure 4.5:
Template of a Local Update Involving No Communication in a) and
the Value to be Communicated in b).
The calculations in Sections 4.3 and 4.4 used a wide variety of machines, in-house and commercial multicomputers, as well as the SIMD DAP and CM-2. The spin calculations in Section 4.4 can have very simple degrees of freedom, including that of the binary ``spin'' of the Ising model. These are naturally suited to the single-bit arithmetic available on the AMT DAP and CM-2. Some of the latest Monte Carlo algorithms do not use the local algorithms of Figure 4.4 but exploit the irregular domain structure seen in materials near a critical point. These new algorithms are much more efficient but very much more difficult to parallelize-especially on SIMD machines. They are discussed in Section 12.8 . We also see a taste of the ``embarrassingly parallel'' problem structure of Chapter 7 in Section 4.4 . For the Potts simulation, we obtained parallelism not from the data domain (lattice of spins) but from different starting points for the evolution. This approach, described in more detail in Section 7.2 , would not be practical for QCD with many degrees of freedom as one must have enough memory to store the full lattice in each node of the multicomputer.
Table 4.2 lists the early seismic simulations of the group led by Clayton, whose C P work is reviewed in Section 18.1 . These solved the elastic wave equations using finite difference methods as discussed in Section 3.5 . The equations are iterated with time steps replacing the Monte Carlo iterations used above. This work is described in Chapters 5 and 7 of [ Fox:88a ] and represents methods that can tackle quite practical problems, for example, predicting the response of complicated geological structures such as the Los Angeles basin. The two-dimensional hydrodynamics work of Meier [ Meier:88a ] is computationally similar, using the regular decomposition and local update of Figures 4.3 and 4.5 . These techniques are now very familiar and may seem ``old-hat.'' However, it is worth noting that, as described in Chapter 13 , we are only now in 1992 developing the compiler technology that will automate these methods developed ``by-hand'' by our early users. A much more sophisticated follow-on to these early seismic wave simulations is the ISIS interactive seismic imaging system described in Chapter 18 .
Chapter 9 of [ Fox:88a ] explains the well-known synchronous implementation of long-range particle dynamics. This algorithm was not used directly in any large C P application as we implemented the much more efficient cluster algorithms described in Sections 12.4 , 12.5 , and 12.8 . The initial investigation of the vortex method of Section 12.5 used the method [ Harstad:87a ]. We also showed a parallel database used in Kolawa's thesis on how a semi-analytic approach to QCD could be analyzed identically with the long-range force problem [Kolawa:86b;88a]. As explained in [ Fox:88a ], one can use the long-range force algorithm in any case where the calculation involves a set of N points with observables requiring functions of every pair of which there are . In the language of Chapter 3 , this problem has a system dimension of one, whatever its geometrical dimension. This is illustrated in Figures 4.6 and 4.7 , which represent in the form of Equations 3.10 and 3.13 . We find for the simple two-dimensional decompositions described for the Clayton and Meier applications for Table 4.2 . We increase range of R ``interaction'' in Figure 4.7 (a),(b)-defined formally by
from small (nearest neighbor) R to , the long-range force. As shown in Figure 4.7 (a),(b), decreases as R increases with the limiting form independently of for . Noederlinger [ Lorenz:87a ] and Theiler [Theiler:87a;87b] used such a ``long-range'' algorithm for calculating the correlation dimension of a chaotic dynamical system. This measures the essential number of degrees of freedom for a complex system which in this case was a time series becoming a plasma. The correction function involved studying histograms of over the data points at time .
Figure 4.6:
Some Examples of Communication Overhead as a Function of Increasing
Range of Interaction
R
.
Figure 4.7:
General Form of Communication Overhead for (a) Increasing and
(b) Infinite Range
R
Fucito and Solomon [Fucito:85b;85f] studied a simple Coulomb gas which naturally had a long-range force. However, this was a Monte Carlo calculation that was implemented efficiently by an ingenious algorithm that cannot directly use the analysis of the particle dynamics (time-stepped) case shown in Figure 4.7 . Monte Carlo is typically harder to parallelize than time evolution, where all ``particles'' can be evolved in time together. However, Monte Carlo updates can only proceed simultaneously if they involve disjoint particle sets. This implies the red-black ordering of Figure 4.5 and the requiring of a difficult asynchronous algorithm in the irregular melting problem of Section 14.2 . Johnson's application was technically the hardest in our list of pioneers in Table 4.2 .
Finally, Section 4.5 uses cellular automata ideas that lead to a synchronous architecture to grain dynamics, which, if implemented directly as in Section 9.2 , would be naturally loosely synchronous. This illustrates that the problem architecture depends on the particular numerical approach.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Quantum chromodynamics (QCD) is the proposed theory of the so-called strong interactions that bind quarks and gluons together to form hadrons-the constituents of nuclear matter such as the proton and neutron. It also mediates the forces between hadrons and thus controls the formation of nuclei. The fundamental properties of QCD cannot be directly tested, but a wealth of indirect evidence supports this theory. The problem is that QCD is a nonlinear theory that is not analytically solvable. For the equivalent quantum field theories of weaker forces such as electromagnetism, approximations using perturbation expansions in the interaction strength give very accurate results. However, since the QCD interaction is so strong, perturbative approximations often fail. Consequently, few precise predictions can be made from the theory. This led to the introduction of a non-perturbative approximation based on discretizing four-dimensional space-time onto a lattice of points, giving a theory called lattice QCD, which can be simulated on a computer.
Most of the work on lattice QCD has been directed towards deriving the masses (and other properties) of the large number of hadrons, which have been found in experiments using high energy particle accelerators. This would provide hard evidence for QCD as the theory of the strong force. Other calculations have also been performed; in particular, the properties of QCD at finite (i.e., non-zero) temperature and/or density have been determined. These calculations model the conditions of matter in the early stages of the evolution of the universe, just after the Big Bang. Lattice calculations of other quantum field theories, such as the theory of the weak nuclear force, have also been performed. For example, numerical calculations have given estimates for the mass of the Higgs boson, which is currently the Holy Grail of experimental high energy physics, and one of the motivating factors for the construction of the, now cancelled, $10 billion Superconducting Supercollider.
One of the major problems in solving lattice QCD on a computer is that the simulation of the quark interactions requires the computation of a large, highly non-local matrix determinant, which is extremely compute-intensive. We will discuss methods for calculating this determinant later. For the moment, however, we note that, physically, the determinant arises from the dynamics of the quarks. The simplest way to proceed is thus to ignore the quark dynamics and work in the so-called quenched approximation, with only gluonic degrees of freedom. This should be a reasonable approximation, at least for heavy quarks. However, even solving this approximate theory requires enormous computing power. Current state-of-the-art quenched QCD calculations are performed on lattices of size , which involves the numerical solution of a 21,233,664 dimensional integral. The only way of solving such an integral is by Monte Carlo methods.
Guy Robinson
In order to explain the computations for QCD, we use the Feynman path integral formalism [ Feynman:65a ]. For any field theory described by a Lagrangian density , the dynamics of the fields are determined through the action functional
In this language, the measurement of a physical observable represented by an operator is given as the expectation value
where the partition function Z is
In these expressions, the integral indicates a sum over all possible configurations of the field . A typical observable would be a product of fields , which says how the fluctuations in the field are correlated, and in turn, tells us something about the particles that can propagate from point x to point y . The appropriate correlation functions give us, for example, the masses of the various particles in the theory. Thus, to evaluate almost any quantity in field theories like QCD, one must simply evaluate the corresponding path integral. The catch is that the integrals range are over an infinite-dimensional space.
To put the field theory onto a computer, we begin by discretizing space and time into a lattice of points. Then the functional integral is simply defined as the product of the integrals over the fields at every site of the lattice :
Restricting space and time to a finite box, we end up with a finite (but very large) number of ordinary integrals, something we might imagine doing directly on a computer. However, the high dimensionality of these integrals renders conventional mesh techniques impractical. Fortunately, the presence of the exponential means that the integrand is sharply peaked in one region of configuration space. Hence, we resort to a statistical treatment and use Monte Carlo type algorithms to sample the important parts of the integration region [ Binder:86a ].
Monte Carlo algorithms typically begin with some initial configuration of fields, and then make pseudorandom changes on the fields such that the ultimate probability P of generating a particular field configuration is proportional to the Boltzmann factor,
where is the action associated with the given configuration. There are several ways to implement such a scheme, but for many theories the simple Metropolis algorithm [ Metropolis:53a ] is effective. In this algorithm, a new configuration is generated by updating a single variable in the old configuration and calculating the change in action (or energy)
If , the change is accepted; if , the change is accepted with probability . In practice, this is done by generating a pseudorandom number r in the interval [0,1] with uniform probability distribution, and accepting the change if . This is guaranteed to generate the correct (Boltzmann) distribution of configurations, provided ``detailed balance'' is satisfied. That condition means that the probability of proposing the change is the same as that of proposing the reverse process . In practice, this is true if we never simultaneously update two fields which interact directly via the action. Note that this constraint has important ramifications for parallel computers as we shall see below.
Whichever method one chooses to generate field configurations, one updates the fields for some equilibration time of E steps, and then calculates the expectation value of in Equation 4.3 from the next T configurations as
The statistical error in Monte Carlo behaves as , where N is the number of statistically independent configurations. , where is known as the autocorrelation time. This autocorrelation time can easily be large, and most of the computer time is spent in generating effectively independent configurations. The operator measurements then become a small overhead on the whole calculation.
Guy Robinson
QCD describes the interactions between quarks in high energy physics. Currently, we know of five types (referred to as ``flavors'') of quark: up, down, strange, charm, and bottom; and expect one more (top) to show up soon. In addition to having a ``flavor,'' quarks can carry one of three possible charges known as ``color'' (this has nothing to do with color in the macroscopic world!); hence, quantum chromo dynamics. The strong color force is mediated by particles called gluons, just as photons mediate the electromagnetic force. Unlike photons, though, gluons themselves carry a color charge and, therefore, interact with one another. This makes QCD a nonlinear theory, which is impossible to solve analytically. Therefore, we turn to the computer for numerical solutions.
QCD is an example of a ``gauge theory.'' These are quantum field theories that have a local symmetry described by a symmetry (or gauge) group. Gauge theories are ubiquitous in elementary particle physics: The electromagnetic interaction between electrons and photons is described by quantum electrodynamics (QED) based on the gauge group U(1); the strong force between quarks and gluons is believed to be explained by QCD based on the group SU(3); and there is a unified description of the weak and electromagnetic interactions in terms of the gauge group . The strength of these interactions is measured by a coupling constant. This coupling constant is small for QED, so very precise analytical calculations can be performed using perturbation theory, and these agree extremely well with experiment. However, for QCD, the coupling appears to increase with distance (which is why we never see an isolated quark, since they are always bound together by the strength of the coupling between them). Perturbative calculations are therefore only possible at short distances (or large energies). In order to solve QCD at longer distances, Wilson [ Wilson:74a ] introduced lattice gauge theory, in which the space-time continuum is discretized and a discrete version of the gauge theory is derived which keeps the gauge symmetry intact. This discretization onto a lattice, which is typically hypercubic, gives a nonperturbative approximation to the theory that is successively improvable by increasing the lattice size and decreasing the lattice spacing, and provides a simple and natural way of regulating the divergences which plague perturbative approximations. It also makes the gauge theory amenable to numerical simulation by computer.
Guy Robinson
To put QCD on a computer, we proceed as follows [ Wilson:74a ], [ Creutz:83a ]. The four-dimensional space-time continuum is replaced by a four-dimensional hypercubic periodic lattice of size , with the quarks living on the sites and the gluons living on the links of the lattice. is the spatial and is the temporal extent of the lattice. The lattice has a finite spacing a . The gluons are represented by complex SU(3) matrices associated with each link in the lattice. The 3 in SU(3) reflects the fact that there are three colors of quarks, and SU means that the matrices are unitary with unit determinant (i.e., ``special unitary''). This link matrix describes how the color of a quark changes as it moves from one site to the next. For example, as a quark is transported along a link of the lattice it can change its color from, say, red to green; hence, a red quark at one end of the link can exchange colors with a green quark at the other end. The action functional for the purely gluonic part of QCD is
where is a coupling constant and
is the product of link matrices around an elementary square or plaquette on the lattice-see Figure 4.8 . Essentially all of the time in QCD simulations of gluons is spent multiplying these SU(3) matrices together. The main component of this is the kernel, which most supercomputers can do very efficiently. As the action involves interactions around plaquettes, in order to satisfy detailed balance we can update only half the links in any one dimension simultaneously, as shown in Figure 4.9 (in two dimensions for simplicity). The partition function for full-lattice QCD including quarks is
where is a large sparse matrix the size of the lattice squared. Unfortunately, since the quark or fermion variables are anticommuting Grassmann numbers, there is no simple representation for them on the computer. Instead, they must be integrated out, leaving a highly non-local fermion determinant:
This is the basic integral one wants to evaluate numerically.
Figure 4.8:
A Lattice Plaquette
Figure 4.9:
Updating the Lattice
The biggest stumbling block preventing large QCD simulations with quarks is the presence of the determinant in the partition function. There have been many proposals for dealing with the determinant. The first algorithms tried to compute the change in the determinant when a single link variable was updated [ Weingarten:81a ]. This turned out to be prohibitively expensive. So instead, the approximate method of pseudo-fermions [ Fucito:81a ] was used. Today, however, the preferred approach is the so-called Hybrid Monte Carlo algorithm [ Duane:87a ], which is exact. The basic idea is to invent some dynamics for the variables in the system in order to evolve the whole system forward in (simulation) time, and then do a Metropolis accept/reject for the entire evolution on the basis of the total energy change. The great advantage is that the whole system is updated in one fell swoop. The disadvantage is that if the dynamics are not correct, the acceptance will be very small. Fortunately (and this is one of very few fortuitous happenings where fermions are concerned), good dynamics can be found: the Hybrid algorithm [ Duane:85a ]. This is a neat combination of the deterministic microcanonical method [ Callaway:83a ], [ Polonyi:83a ] and the stochastic Langevin method [ Parisi:81a ], [ Batrouni:85a ], which yields a quickly evolving, ergodic algorithm for both gauge fields and fermions. The computational kernel of this algorithm is the repeated solution of systems of equations of the form
where and are vectors that live on the sites of the lattice. To solve these equations, one typically uses a conjugate gradient algorithm or one of its cousins, since the fermion matrix is sparse. For more details, see [ Gupta:88a ]. Such iterative matrix algorithms have as their basic component the kernel, so again computers which do this efficiently will run QCD well.
Guy Robinson
Lattice QCD is truly a ``grand challenge'' computing problem. It has been estimated that it will take on the order of a TeraFLOP-year of dedicated computing to obtain believable results for the hadron mass spectrum in the quenched approximation, and adding dynamical fermions will involve many orders of magnitude more operations. Where is the computer power needed for QCD going to come from? Today, the biggest resources of computer time for research are the conventional supercomputers at the NSF and DOE centers. These centers are continually expanding their support for lattice gauge theory, but it may not be long before they are overtaken by several dedicated efforts involving concurrent computers. It is a revealing fact that the development of most high-performance parallel computers-the Caltech Cosmic Cube, the Columbia Machine, IBM's GF11, APE in Rome, the Fermilab Machine and the PAX machines in Japan-was actually motivated by the desire to simulate lattice QCD [ Christ:91a ], [ Weingarten:92a ].
As described already, Caltech built the first hypercube computer, the Cosmic Cube or Mark I, in 1983. It had 64 nodes, each of which was an Intel 8086/87 microprocessor with of memory, giving a total of about (measured for QCD). This was quickly upgraded to the Mark II hypercube with faster chips, twice the memory per node, and twice the number of nodes in 1984. Then, QCD was run on the last internal Caltech hypercube, the 128-node Mark IIIfp (built by JPL), at sustained [ Ding:90b ]. Each node of the Mark IIIfp hypercube contains two Motorola 68020 microprocessors, one for communication and the other for calculation, with the latter supplemented by one 68881 co-processor and a 32-bit Weitek floating point processor.
Norman Christ and Anthony Terrano built their first parallel computer for doing lattice QCD calculations at Columbia in 1984 [ Christ:84a ]. It had 16 nodes, each of which was an Intel 80286/87 microprocessor, plus a TRW 22-bit floating point processor with of memory, giving a total peak performance of . This was improved in 1987 using Weitek rather than TRW chips so that 64 nodes gave peak. In 1989, the Columbia group finished building their third machine: a 256-node, , lattice QCD computer [ Christ:90a ].
QCDPAX is the latest in the line of PAX (Parallel Array eXperiment) machines developed at the University of Tsukuba in Japan. The architecture is very similar to that of the Columbia machine. It is a MIMD machine configured as a two-dimensional periodic array of nodes, and each node includes a Motorola 68020 microprocessor and a 32-bit vector floating-point unit. Its peak performance is similar to that of the Columbia machine; however, it achieves only half the floating-point utilization for QCD code [ Iwasaki:91a ].
Don Weingarten initiated the GF11 project in 1984 at IBM. The GF11 is a SIMD machine comprising 576 Weitek floating point processors, each performing at to give the total peak implied by the name. Preliminary results for this project are given in [Weingarten:90a;92a].
The APE (Array Processor with Emulator) computer is basically a collection of 3081/E processors (which were developed by CERN and SLAC for use in high energy experimental physics) with Weitek floating-point processors attached. However, these floating-point processors are attached in a special way-each node has four multipliers and four adders, in order to optimize the calculations, which form the major component of all lattice QCD programs. This means that each node has a peak performance of . The first small machine-Apetto-was completed in 1986 and had four nodes yielding a peak performance of . Currently, they have a second generation of this machine with peak from 16 nodes. By 1993, the APE collaboration hopes to have completed the 2048-node ``Apecento,'' or APE-100, based on specialized VLSI chips that are software compatible with the original APE [ Avico:89a ], [ Battista:92a ]. The APE-100 is a SIMD machine with the architecture based on a three-dimensional cubic mesh of nodes. Currently, a 128-node machine is running with a peak performance of .
Table 4.3:
Peak and Real Performances in MFLOPS of ``Homebrew'' QCD
Machines
Not to be outdone, Fermilab has also used its high energy experimental physics emulators to construct a lattice QCD machine called ACPMAPS. This is a MIMD machine, using a Weitek floating-point chip set on each node. A 16-node machine, with a peak rate of , was finished in 1989. A 256-node machine, arranged as a hypercube of crates, with eight nodes communicating through a crossbar in each crate, was completed in 1991 [ Fischler:92a ]. It has a peak rate of , and a sustained rate of about for QCD. An upgrade of ACPMAPS is planned, with the number of nodes being increased and the present processors being replaced with two Intel i860 chips per node, giving a peak performance of per node. These performance figures are summarized in Table 4.3 . (The ``real'' performances are the actual performances obtained on QCD codes.)
Major calculations have also been performed on commercial SIMD machines, first on the ICL Distributed Array Processor (DAP) at Edinburgh University during the period from 1982 to 1987 [ Wallace:84a ], and now on the TMC Connection Machine (CM-2); and on commercial distributed memory MIMD machines like the nCUBE hypercube and Intel Touchstone Delta machines at Caltech. Currently, the Connection Machine is the most powerful commercial QCD machine available, running full QCD at a sustained rate of approximately on a CM-2 [ Baillie:89e ], [ Brickner:91b ]. However, simulations have recently been performed at a rate of on the experimental Intel Touchstone Delta at Caltech. This is a MIMD machine made up of 528 Intel i860 processors connected in a two-dimensional mesh, with a peak performance of for 32-bit arithmetic. These results compare favorably with performances on traditional (vector) supercomputers. Highly optimized QCD code runs at about per processor on a CRAY Y-MP, or on a fully configured eight-processor machine.
The generation of commercial parallel supercomputers, represented by the CM-5 and the Intel Paragon, have a peak performance of over . There was a proposal for the development of a TeraFLOPS parallel supercomputer for QCD and other numerically intensive simulations [ Christ:91a ], [ Aoki:91a ]. The goal was to build a machine based on the CM-5 architecture in collaboration with Thinking Machines Corporation, which would be ready by 1995 at a cost of around $40 million.
It is interesting to note that when the various groups began building their ``home-brew'' QCD machines, it was clear they would outperform all commercial (traditional) supercomputers; however, now that commercial parallel supercomputers have come of age [ Fox:89n ], the situation is not so obvious. To emphasize this, we describe QCD calculations on both the home grown Caltech hypercube and on the commercially available Connection Machine.
Guy Robinson
To make good use of MIMD distributed-memory machines like hypercubes, one should employ domain decomposition. That is, the domain of the problem should be divided into subdomains of equal size, one for each processor in the hypercube; and communication routines should be written to take care of data transfer across the processor boundaries. Thus, for a lattice calculation, the N sites are distributed among the processors using a decomposition that ensures that processors assigned to adjacent subdomains are directly linked by a communication channel in the hypercube topology. Each processor then independently works through its subdomain of sites, updating each one in turn, and only communicating with neighboring processors when doing boundary sites. This communication enforces ``loose synchronization,'' which stops any one processor from racing ahead of the others. Load balancing is achieved with equal-size domains. If the nodes contain at least two sites of the lattice, all the nodes can update in parallel, satisfying detailed balance, since loose synchronicity guarantees that all nodes will be doing black, then red sites alternately.
The characteristic timescale of the communication, , corresponds roughly to the time taken to transfer a single matrix from one node to its neighbor. Similarly, we can characterize the calculational part of the algorithm by a timescale, , which is roughly the time taken to multiply together two matrices. For all hypercubes built without floating-point accelerator chips, and, hence, QCD simulations are extremely ``efficient,'' where efficiency (Equations 3.10 and 3.11 ) is defined by the relation
where is the time taken for k processors to perform the given calculation. Typically, such calculations have efficiencies in the range , which means they are ideally suited to this type of computation since doubling the number of processors nearly halves the total computational time required for solution. However, as we shall see (for the Mark IIIfp hypercube, for example), the picture changes dramatically when fast floating-point chips are used; then and one must take some care in coding to obtain maximum performance.
Rather than describe every calculation done on the Caltech hypercubes, we shall concentrate on one calculation that has been done several times as the machine evolved-the heavy quark potential calculation (``heavy'' because the quenched approximation is used).
QCD provides an explanation of why quarks are confined inside hadrons, since lattice calculations reveal that the inter-quark potential rises linearly as the separation between the quarks increases. Thus, the attractive force (the derivative of the potential) is independent of the separation, unlike other forces, which usually decrease rapidly with distance. This force, called the ``string tension,'' is carried by the gluons, which form a kind of ``string'' between the quarks. On the other hand, at short distances, quarks and gluons are ``asymptotically free'' and behave like electrons and photons, interacting via a Coulomb-like force. Thus, the quark potential V is written as
where R is the separation of the quarks, is the coefficient of the Coulombic potential and is the string tension. In fitting experimental charmonium data to this Coulomb plus linear potential, Eichten et al. [ Eichten:80a ] estimated that and =0.18GeV . Thus, a goal of the lattice calculations is to reproduce these numbers. Enroute to this goal, it is necessary to show that the numbers from the lattice are ``scaling,'' that is, if one calculates a physical observable on lattices with different spacings then one gets the same answer. This means that the artifacts due to the finiteness of the lattice spacing have disappeared and continuum physics can be extracted.
The first heavy quark potential calculation using a Caltech hypercube was performed on the 64-node Mark I in 1984 on a lattice with ranging from to [ Otto:84a ]. The value of was found to be and the string tension (converting to the dimensionless ratio) . The numbers are quite a bit off from the charmonium data but the string tension did appear to be scaling, albeit in the narrow window .
The next time around, in 1986, the 128-node Mark II hypercube was used on a lattice with [ Flower:86b ]. The dimensionless string tension decreased somewhat to 83, but clear violations of scaling were observed: The lattice was still too coarse to see continuum physics.
Therefore, the last (1989) calculation using the Caltech/JPL 32-node Mark IIIfp hypercube concentrated on one value, , and investigated different lattice sizes: , , , [ Ding:90b ]. Scaling was not investigated; however, the values of and , that is, = 0.15GeV , compare favorably with the charmonium data. This work is based on about 1300 CPU hours on the 32-node Mark IIIfp hypercube, which has a performance of roughly twice a CRAY X-MP processor. The whole 128-node machine performs at . As each node runs at , this corresponds to a speedup of 100, and hence, an efficiency of 78 percent. These figures are for the most highly optimized code. The original version of the code written in C ran on the Motorola chips at and on the Weitek chips at . The communication time, which is roughly the same for both, is less than a 2 percent overhead for the former but nearly 30 percent for the latter. When the computationally intensive parts of the calculation are written in assembly code for the Weitek, this overhead becomes almost 50 percent. This of communication, shown in lines two and three in Table 4.4 , is dominated by the hardware/software message startup overhead (latency), because for the Mark IIIfp the node-to-node communication time, , is given by
where W is the number of words transmitted. To speed up the communication, we update all even (or odd) links (eight in our case) in each node, allowing us to transfer eight matrix products at a time, instead of just sending one in each message. This reduces the by a factor of
to . On all hypercubes with fast floating-point chips-and on most hypercubes without these chips for less computationally intensive codes-such ``vectorization'' of communication is often important. In Figure 4.10 , the speedups for many different lattice sizes are shown. For the largest lattice size, the speedup is 100 on the 128-node machine. The speedup is almost linear in number of nodes. As the total lattice volume increases, the speedup increases, because the ratio of calculation/communication increases. For more information on this performance analysis, see [ Ding:90c ].
Table 4.4:
Link Update Time (msec) on Mark IIIfp Node for Various Levels
of Programming
Figure 4.10:
Speedups for QCD on the Mark IIIfp
Guy Robinson
The Connection Machine Model CM-2 is also very well suited for large scale simulations of QCD. The CM-2 is a distributed-memory, single-instruction, multiple-data (SIMD), massively parallel processor comprising up to 65536 ( ) processors [Hillis:85a;87a]. Each processor consists of an arithmetic-logic unit (ALU), or of random-access memory (RAM), and a router interface to perform communications among the processors. There are 16 processors and a router per custom VLSI chip, with the chips interconnected as a 12-dimensional hypercube. Communications among processors within a chip work essentially like a crossbar interconnect. The router can do general communications, but only local communication is required for QCD, so we use the fast nearest-neighbor communication software called NEWS. The processors deal with one bit at a time. Therefore, the ALU can compute any two Boolean functions as output from three inputs, and all datapaths are one bit wide. In the current version of the Connection Machine (the CM-2), groups of 32 processors (two chips) share a 32-bit (or 64-bit) Weitek floating-point chip, and a transposer chip, which changes 32 bits stored bit-serially within 32 processors into 32 32-bit words for the Weitek, and vice versa.
The high-level languages on the CM, such as *Lisp and CM-Fortran, compile into an assembly language called Parallel Instruction Set (Paris). Paris regards the bit-serial processors as the fundamental units in the machine. However, floating-point computations are not very efficient in the Paris model. This is because in Paris, 32-bit floating-point numbers are stored ``fieldwise''; that is, successive bits of the word are stored at successive memory locations of each processor's memory. However, 32 processors share one Weitek chip, which deals with words stored ``slicewise''-that is, across the processors, one bit in each. Therefore, to do a floating-point operation, Paris loads in the fieldwise operands, transposes them slicewise for the Weitek (using the transposer chip), does the operation, and transposes the slicewise result back to fieldwise for memory storage. Moreover, every operation in Paris is an atomic process; that is, two operands are brought from memory and one result is stored back to memory, so no use is made of the Weitek registers for intermediate results. Hence, to improve the performance of the Weiteks, a new assembly language called CM Instruction Set (CMIS) has been written, which models the local architectural features much better. In fact, CMIS ignores the bit-serial processors and thinks of the machine in terms of the Weitek chips. Thus, data can be stored slicewise, eliminating all the transposing back and forth. CMIS allows effective use of the Weitek registers, creating a memory hierarchy, which, combined with the internal buses of the Weiteks, offers increased bandwidth for data motion.
When the arithmetic part of the program is rewritten in CMIS (just as on the Mark IIIfp when it was rewritten in assembly code), the communications become a bottleneck. Therefore, we need also to speed up the communication part of the code. On the CM-2, this is done using the ``bi-directional multi-wire NEWS'' system. As explained above, the CM chips (each containing 16 processors) are interconnected in a 12-dimensional hypercube. However, since there are two CM chips for each Weitek floating-point chip, the floating-point hardware is effectively wired together as an 11-dimensional hypercube, with two wires in each direction. This makes it feasible to do simultaneous communications in both directions of all four space-time directions in QCD-bidirectional multiwire NEWS-thereby reducing the communication time by a factor of eight. Moreover, the data rearrangement necessary to make use of this multiwire NEWS further speeds up the CMIS part of the code by a factor of two.
In 1990-1992, the Connection Machine was the most powerful commercial QCD machine available: the ``Los Alamos collaboration'' ran full QCD at a sustained rate of on a CM-2 [ Brickner:91a ]. As was the case for the Mark IIIfp hypercube, in order to obtain this performance, one must resort to writing assembly code for the Weitek chips and for the communication. Our original code, written entirely in the CM-2 version of *Lisp, achieved around [ Baillie:89e ]. As shown in Table 4.5 , this code spends 34 percent of its time doing communication. When we rewrote the most computationally intensive part in the assembly language CMIS, this rose to 54 percent. Then when we also made use of ``multi-wire NEWS'' (to reduce the communication time by a factor of eight), it fell to 30 percent. The Intel Delta and Paragon, as well as Thinking Machines CM-5, passed the CM-2 performance levels in 1993, but here optimization is not yet complete [ Gupta:93a ].
Table 4.5:
Fermion Update Time (sec) on
Connection Machine for
Various Levels of Programming
Guy Robinson
The status of lattice QCD may be summed up as: under way. Already there have been some nice results in the context of the quenched approximation, but the lattices are still too coarse and too small to give definitive results. Results for full QCD are going to take orders of magnitude more computer time, but we now have an algorithm-Hybrid Monte Carlo-which puts real simulations within reach.
When will the computer power be sufficient? In Figure 4.11 , we plot the horsepower of various QCD machines as a function of the year they started to produce physics results. The performance plotted in this case is the real sustained rate on actual QCD codes. The surprising fact is that the rate of increase is very close to exponential, yielding a factor of 10 every two years! On the same plot, we show our estimate of the computer power needed to redo correct quenched calculations on a lattice. This estimate is also a function of time, due to algorithm improvements.
Figure 4.11:
MFLOPS for QCD Calculations
Extrapolating these trends, we see the outlook for lattice QCD is rather bright. Reasonable results for the phenomenologically interesting physical observables should be available within the quenched approximation in the mid-1990s. With the same computer power, we will be able to redo today's quenched calculations using dynamic fermions (but still on today's size of lattice). This will tell us how reliable the quenched approximation is. Finally, results for the full theory with dynamical fermions on a lattice should follow early in the next century (!), when computers are two or three orders of magnitude more powerful.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Spin models are simple statistical models of real systems, such as magnets, which exhibit the same behavior and hence provide an understanding of the physical mechanisms involved. Despite their apparent simplicity, most of these models are not exactly soluble by present theoretical methods. Hence, computer simulation is used. Usually, one is interested in the behavior of the system at a phase transition; the computer simulation reveals where the phase boundaries are, what the phases on either side are, and how the properties of the system change across the phase transition. There are two varieties of spins: discrete or continuous valued. In both cases, the spin variables are put on the sites of the lattice and only interact with their nearest neighbors. The partition function for a spin model is
with the action being of the form
where denotes nearest neighbors, is the spin at site i , and is a coupling parameter which is proportional to the interaction strength and inversely proportional to the temperature. A great deal of work has been done over the years in finding good algorithms for computer simulations of spin models; recently some new, much better, algorithms have been discovered. These so-called cluster algorithms are described in detail in Section 12.6 . Here, we shall describe results obtained from using them to perform large-scale Monte Carlo simulations of several spin models-both discrete and continuous.
Guy Robinson
The Ising model is the simplest model for ferromagnetism that predicts phase transitions and critical phenomena. The spins are discrete and have only two possible states. This model, introduced by Lenz in 1920 [ Lenz:20a ], was solved in one dimension by Ising in 1925 [ Ising:25a ], and in two dimensions by Onsager in 1944 [ Onsager:44a ]. However, it has not been solved analytically in three dimensions, so Monte Carlo computer simulation methods have been one of the methods used to obtain numerical solutions. One of the best available techniques for this is the Monte Carlo Renormalization Group (MCRG) method [ Wilson:80a ], [ Swendsen:79a ]. The Ising model exhibits a second-order phase transition in d=3 dimensions at a critical temperature . As T approaches , the correlation length diverges as a power law with critical exponent :
and the pair correlation function at falls off to zero with distance r as a power law defining the critical exponent :
, and determine the critical behavior of the 3-D Ising model and it is their values we wish to determine using MCRG.
In 1984, this was done by Pawley, Swendsen, Wallace and Wilson [ Pawley:84a ] in Edinburgh on the ICL DAP computer with high statistics. They ran on four lattice sizes- , , and -measuring seven even and six odd spin operators. We are essentially repeating their calculation on the new AMT DAP computer. Why should we do this? First, to investigate finite size effects-we have run on the biggest lattice used by Edinburgh, , and on a bigger one, . Second, to investigate truncation effects-qualitatively the more operators we measure for MCRG, the better, so we have included 53 even and 46 odd operators. Third, we are making use of the new cluster-updating algorithm due to Swendsen and Wang [ Swendsen:87a ], implemented according to Wolff [ Wolff:89b ]. Fourth, we would like to try to measure another critical exponent more accurately-the correction-to-scaling exponent , which plays an important role in the analysis.
The idea behind MCRG is that the correlation length diverges at the critical point, so that certain quantities should be invariant under ``renormalization'', which here means a transformation of the length scale. On the lattice, we can double the lattice size by, for example, ``blocking'' the spin values on a square plaquette into a single spin value on a lattice with 1/4 the number of sites. For the Ising model, the blocked spin value is given the value taken by the majority of the 4 plaquette spins, with a random tie-breaker for the case where there are 2 spins in either state. Since quantities are only invariant under this MCRG procedure at the critical point, this provides a method for finding the critical point.
In order to calculate the quantities of interest using MCRG, one must evaluate the spin operators . In [ Pawley:84a ], the calculation was restricted to seven even spin operators and six odd; we evaluated 53 and 46, respectively [ Baillie:91d ]. Specifically, we decided to evaluate the most important operators in a cube [ Baillie:88h ]. To determine the critical coupling (or inverse temperature), , one performs independent Monte Carlo simulations on a large lattice L of size and on smaller lattices S of size , , and compares the operators measured on the large lattice blocked m times more than the smaller lattices. when they are the same. Since the effective lattice sizes are the same, unknown finite size effects should cancel. The critical exponents, , are obtained directly from the eigenvalues, , of the stability matrix, , which measures changes between different blocking levels, according to . In particular, the leading eigenvalue of for the even gives from , and, similarly, from the odd eigenvalue of .
The Distributed Array Processor (DAP) is a SIMD computer consisting of bit-serial processing elements (PEs) configured as a cyclic two-dimensional grid with nearest-neighbor connectivity. The Ising model computer simulation is well suited to such a machine since the spins can be represented as single-bit (logical) variables. In three-dimensions, the system of spins is configured as an simple cubic lattice, which is ``crinkle mapped'' onto the DAP by storing pieces of each of M planes in each PE: , with . Our Monte Carlo simulation uses a hybrid algorithm in which each sweep consists of 10 standard Metropolis [ Metropolis:53a ] spin updates followed by one cluster update using Wolff's single-cluster variant of the Swendsen and Wang algorithm. On the lattice, the autocorrelation time of the magnetization reduces from sets of 100 sweeps for Metropolis alone to sets of 10 Metropolis plus one cluster update for the hybrid algorithm. In order to measure the spin operators, , the DAP code simply histograms the spin configurations so that an analysis program can later pick out each particular spin operator using a look-up table. Currently, the code requires the same time to do one histogram measurement, one Wolff single-cluster update or 100 Metropolis updates. Therefore, our hybrid of 10 Metropolis plus one cluster update takes about the same time as a measurement. On a DAP 510, this hybrid update takes on average 127 secs (13.5 secs) for the ( ) lattices. We have performed simulations on and lattices at two values of the coupling: (Edinburgh's best estimate of the critical coupling) and . We accumulated measurements for each of the simulations and for the so that the total time used for this calculation is roughly 11,000 hours. For error analysis, this is divided into bins of measurements.
In analyzing our results, the first thing we have to decide is the order in which to arrange our 53 even and 46 odd spin operators. Naively, they can be arranged in order of increasing total distance between the spins [ Baillie:88h ] (as was done in [ Pawley:84a ]). However, the ranking of a spin operator is determined physically by how much it contributes to the energy of the system. Thus, we did our analysis initially with the operators in the naive order to calculate their energies, then subsequently we used the ``physical'' order dictated by these energies. This physical order of the first 20 even operators is shown in Figure 4.12 with 6 of Edinburgh's operators indicated; the 7th Edinburgh operator (E-6) is our 21st. This order is important in assessing the systematic effects of truncation, as we are going to analyze our data as a function of the number of operators included. Specifically, we successively diagonalize the , , , ( for even, for odd) stability matrix to obtain its eigenvalues and, thus, the critical exponents.
Figure 4.12:
Our Order for Even Spin Operators
We present our results in terms of the eigenvalues of the even and odd parts of . The leading even eigenvalue on the first four blocking levels starting from the lattice is plotted against the number of operators included in the analysis in Figure 4.13 , and on the first five blocking levels starting from the lattice in Figure 4.14 . Similarly, the leading odd eigenvalues for and lattices are shown in Figures 4.15 and 4.16 , respectively. First of all, note that there are significant truncation effects-the value of the eigenvalues do not settle down until at least 30 and perhaps 40 operators are included. We note also that our value agrees with Edinburgh's when around 7 operators are included-this is a significant verification that the two calculations are consistent. With most or all of the operators included, our values on the two different lattice sizes agree, and the agreement improves with increasing blocking levels. Thus, we feel that we have overcome the finite size effects so that a lattice is just large enough. However, the advantage in going to is obvious in Figures 4.14 and 4.16 : There, we can perform one more blocking , which reveals that the results on the fourth and fifth blocking levels are consistent. This means that we have eliminated most of the transient effects near the fixed point in the MCRG procedure. We also see that the main limitation of our calculation is statistics-the error bars are still rather large for the highest blocking level.
Now in order to obtain values for and , we must extrapolate our results from a finite number of blocking levels to an infinite number. This is done by fitting the corresponding eigenvalues and according to
where is the extrapolated value and is the correction-to-scaling exponent. Therefore, we first need to calculate , which comes directly from the second leading even eigenvalue: . Our best estimate is in the interval -0.85, and we use the value 0.85 for the purpose of extrapolation, since it gives the best fits. The final results are , , where the first errors are statistical and the second errors are estimates of the systematic error coming from the uncertainty in .
Figure 4.13:
Leading Even Eigenvalue on
Lattice
Figure 4.14:
Leading Even Eigenvalue on
Lattice
Figure 4.15:
Leading Odd Eigenvalue on
Lattice
Figure 4.16:
Leading Odd Eigenvalue on
Lattice
Finally, perhaps the most important number, because it can be determined the most accurately, is . By comparing the fifth blocking level on the lattice to the fourth on the lattice for both coupling values and taking a weighted mean, we obtain , where again the first error is statistical and the second is systematic.
Thus, MCRG calculations give us very accurate values for the three critical parameters , , and , and give a reasonable estimate for . Each parameter is obtained independently and directly from the data. We have shown that truncation and finite-size errors at all but the highest blocking level have been reduced to below the statistical errors. Future high statistics simulations on lattices will significantly reduce the remaining errors and allow us to determine the exponents very accurately.
Guy Robinson
The q -state Potts model [ Potts:52a ] consists of a lattice of spins , which can take q different values, and whose Hamiltonian is
For q=2 , this is equivalent to the Ising model. The Potts model is thus a simple extension of the Ising model; however, it has a much richer phase structure, which makes it an important testing ground for new theories and algorithms in the study of critical phenomena [ Wu:82a ].
Monte Carlo simulations of Potts models have traditionally used local algorithms such as that of Metropolis, et al. [ Metropolis:53a ], however, these algorithms have the major drawback that near a phase transition, the autocorrelation time (the number of sweeps needed to generate a statistically independent configuration) increases approximately as , where L is the linear size of the lattice. New algorithms have recently been developed that dramatically reduce this ``critical slowing down'' by updating clusters of spins at a time (these algorithms are described in Section 12.6 ). The original cluster algorithm of Swendsen and Wang (SW) was implemented for the Potts model [ Swendsen:87a ], and there is a lot of interest in how well cluster algorithms perform for this model. At present, there are very few theoretical results known about cluster algorithms, and theoretical advances are most likely to come from first studying the simplest possible models.
We have made a high statistics study of the SW algorithm and the single cluster Wolff algorithm [ Wolff:89b ], as well as a number of variants of these algorithms, for the q=2 and q=3 Potts models in two dimensions [ Baillie:90n ]. We measured the autocorrelation time in the energy (a local operator) and the magnetization (a global one) on lattice sizes from to . About 10 million sweeps were required for each lattice size in order to measure autocorrelation times to within about 1 percent. From these values, we can extract the dynamic critical exponent z , given by , where is measured at the infinite volume critical point (which is known exactly for the two-dimensional Potts model).
The simulations were performed on a number of different parallel computers. For lattice sizes of or less, it is possible to run independent simulations on each processor of a parallel machine, enabling us to obtain 100 percent efficiency by running 10 or 20 runs for each lattice size in parallel, using different random number streams. These calculations were done using a 32-processor Meiko Computing Surface, a 20-processor Sequent Symmetry, a 20-processor Encore Multimax, and a 96-processor BBN GP1000 Butterfly , as well as a network of SUN workstations. The calculations ook approximately 15,000 processor-hours. For the largest lattice sizes, and , a parallel cluster algorithm was required, due to the large amount of calculation (and memory) required. We have used the self-labelling algorithm described in Section 12.6 , which gives fairly good efficiencies of about 70 percent on the machines we have used (an nCUBE-1 and a Symult S2010), by doing multiple runs of 32 nodes each for the lattice, and 64 nodes for . Since this problem does not vectorize, using all 512 nodes of the nCUBE gives a performance approximately five times that of a single processor CRAY X-MP, while all 192 nodes of the Symult is equivalent to about six CRAYs. The calculations on these machines have so far taken about 1000 hours.
Results for the autocorrelation times of the energy for the Wolff and SW algorithms are shown in Figure 4.17 for q=2 and Figure 4.18 for q=3 . As can be seen, the Wolff algorithm has smaller autocorrelation times than SW. However, the dynamical critical exponents for the two algorithms appear to be identical, being approximately and for q=2 and q=3 respectively (shown as straight lines in Figures 4.17 and 4.18 ), compared to values of approximately 2 for the standard Metropolis algorithm.
Figure 4.17:
Energy Autocorrelation Times,
q=2
Figure 4.18:
Energy Autocorrelation Times,
q=3
Burkitt and Heermann [ Heermann:90a ] have suggested that the increase in the autocorrelation time is a logarithmic one, rather than a power law for the q=2 case (the Ising model), that is, z = 0 . Fits to this are shown as dotted lines in Figure 4.17 . These have smaller values than the power law fits, favoring logarithmic behavior. However, it is very difficult to distinguish between a logarithm and a small power even on lattices as large as . In any case, the performance of the cluster algorithms for the Potts model is quite extraordinary, with autocorrelation times for the lattice hundreds of times smaller than for the Metropolis algorithm. In the future, we hope to use the cluster algorithms to perform greatly improved Monte Carlo simulations of various Potts models, to study their critical behavior.
There is little theoretical understanding of why cluster algorithms work so well, and in particular there is no theory which predicts the dynamic critical exponents for a given model. These values can currently only be obtained from measurements using Monte Carlo simulation. Our results, which are the best currently available, are shown in Table 4.6 . We would like to know why, for example, critical slowing down is virtually eliminated for the two-dimensional 2-state Potts model, but z is nearly one for the 4-state model; and why the dynamic critical exponents for the SW and Wolff algorithms are approximately the same in two dimensions, but very different in higher dimensions.
Table 4.6:
Measured Dynamic Critical Exponents for Potts Model Cluster Algorithms.
The only rigorous analytic result so far obtained for cluster algorithms was derived by Li and Sokal [ Li:89a ]. They showed that the autocorrelation time for the energy using the SW algorithm is bounded (as a function of the lattice size) by the specific heat , that is, , which implies that the corresponding dynamic critical exponent is bounded by , where and are critical exponents measuring the divergence at the critical point of the specific heat and the correlation length, respectively. A similar bound has also been derived for the Metropolis algorithm, but with the susceptibility exponent substituted for the specific heat exponent.
No such result is known for the Wolff algorithm, so we have attempted to check this result empirically using simulation [ Coddington:92a ]. We found that for the Ising model in two, three, and four dimensions, the above bound appears to be satisfied (at least to a very good approximation); that is, there are constants a and b such that , and thus , for the Wolff algorithm.
This led us to investigate similar empirical relations between dynamic and static quantities for the SW algorithm. The power of cluster update algorithms comes from the fact that they flip large clusters of spins at a time. The average size of the largest SW cluster (scaled by the lattice volume), m , is an estimator of the magnetization for the Potts model, and the exponent characterizing the divergence of the magnetization has values which are similar to our measured values for the dynamic exponents of the SW algorithm. We therefore scaled the SW autocorrelations by m , and found that within the errors of the simulations, this gave either a constant (in three and four dimensions) or a logarithm (in two dimensions). This implies that the SW autocorrelations scale in the same way (up to logarithmic corrections) as the magnetization, that is, .
These simple empirical relations are very surprising, and if true, would be the first analytic results equating dynamic quantities, which are dependent on the Monte Carlo algorithm used, to static quantities, which depend only on the physical model. These relations could perhaps stem from the fact that the dynamics of cluster algorithms are closely linked to the physical properties of the system, since the Swendsen-Wang clusters are just the Coniglio-Klein-Fisher droplets which have been used to describe the critical behavior of these systems [ Fisher:67a ] [ Coniglio:80a ].
We are currently doing further simulations to check whether these relations hold up with larger lattices and better statistics, or whether they are just good approximations. We are also trying to determine whether similar results hold for the general q -state Potts model. However, we have thus far only been able to find simple relations for the q=2 (Ising) model. This work is being done using both parallel machines (the nCUBE-1, nCUBE-2, and Symult S2010) and networks of DEC, IBM, and Hewlett-Packard workstations. These high-performance RISC workstations were especially useful in obtaining good results for the Wolff algorithm, which does not vectorize or parallelize, apart from the trivial parallelism we used in running independent simulations on different processors.
Guy Robinson
The XY (or O(2)) model consists of a set of continuous valued spins regularly arranged on a two-dimensional square lattice. Fifteen years ago, Kosterlitz and Thouless (KT) predicted that this system would undergo a phase transition as one changed from a low-temperature spin wave phase to a high-temperature phase with unbound vortices. KT predicted an approximate transition temperature, , and the following unusual exponential singularity in the correlation length and magnetic susceptibility:
with
where and the correlation function exponent is defined by the relation .
Our simulation [ Gupta:88a ] was done on the 128-node FPS (Floating Point Systems) T-Series hypercube at Los Alamos. FPS software allowed the use of C with a software model similar (communication implemented by subroutine call) to that used on the hypercubes at Caltech. Each FPS node is built around Weitek floating-point units, and we achieved per node in this application. The total machine ran at , or at about twice the performance of one processor of a CRAY X-MP for this application. We use a 1-D torus topology for communications, with each node processing a fraction of the rows. Each row is divided into red/black alternating sites of spins and the vector loop is over a given color. This gives a natural data structure of ( ) words for lattices of size . The internode communications, in both lattice update and measurement of observables, can be done asynchronously and are a negligible overhead.
Figure 4.19:
Autocorrelation Times for the XY Model
Previous numerical work was unable to confirm the KT theory, due to limited statistics and small lattices. Our high-statistics simulations are done on , , , and lattices using a combination of over-relaxed and Metropolis algorithms which decorrelates as . (For comparison, a Metropolis algorithm decorrelates as .) Each configuration represents over-relaxed sweeps through the lattice followed by Metropolis sweeps. Measurement of observables is made on every configuration. The over-relaxed algorithm consists of reflecting the spin at a given site about , where is the sum of the nearest-neighbor spins, that is,
This implementation [ Creutz:87a ], [ Brown:87a ] of the over-relaxed algorithm is microcanonical, and it reduces critical slowing down even though it is a local algorithm. The ``hit'' elements for the Metropolis algorithm are generated as , where is a uniform random number in the interval , and is adjusted to give an acceptance rate of 50 to 60 percent. The Metropolis hits make the algorithm ergodic, but their effectiveness is limited to local changes in the energy. In Figure 4.19 , we show the autocorrelation time vs. the correlation length ; for , we extract , and for , we get .
Table 4.7:
Results of the XY Model Fits: (a)
in T, and (b)
in T Assuming the KT Form. The fits KT1-3 are pseudominima while KT4 is
the true minimum. All data points are included in the fits and we give the
for each fit and an estimate of the exponent
.
We ran at 14 temperatures near the phase transition and made unconstrained fits to all 14 data points (four parameter fits according to Equation 4.24 ), for both the correlation length (Figure 4.20 ) and susceptibility (Figure 4.21 ). The key to the interpretation of the data is the fits. We find that fitting programs (e.g., MINUIT, SLAC) move incredibly slowly towards the true minimum from certain points (which we label spurious minima), which, unfortunately, are the attractors for most starting points. We found three such spurious minima (KT1-3) and the true minimum KT4, as listed in Table 4.7 .
Figure 4.20:
Correlation Length for the XY Model
Figure 4.21:
Susceptibility for the XY Model
Thus, our data was found to be in excellent agreement with the KT theory and, in fact, this study provides the first direct measurement of from both and data that is consistent with the KT predictions.
Guy Robinson
The XY model is the simplest O(N) model, having N=2 , the O(N) model being a set of rotors (N-component continuous valued spins) on an N-sphere. For , this model is asymptotically free [ Polyakov:75a ], and for N=3 , there exist so called instanton solutions. Some of these properties are analogous to those of gauge theories in four dimensions; hence, these models are interesting. In particular, the O(3) model in two dimensions should shed some light on the asymptotic freedom of QCD (SU(3)) in four dimensions. The predictions of the renormalization group for the susceptibility and inverse correlation length (i.e., mass gap) m in the O(3) model are [ Brezin:76a ]
and
respectively. If m and vary according to these equations, without the correction of order , they are said to follow asymptotic scaling. Previous work was able to confirm that this picture is qualitatively correct, but was not able to probe deep enough in the area of large correlation lengths to obtain good agreement.
The combination of the over-relaxed algorithm and the computational power of the FPS T-Series allowed us to simulate lattices of sizes up to . We were thus able to simulate at coupling constants that correspond to correlation lengths up to 300, on lattices where finite-size effects are negligible. We were also able to gather large statistics and thus obtain small statistical errors. Our simulation is in good agreement with similar cluster calculations [Wolff:89b;90a]. Thus, we have validated and extended these results in a regime where our algorithm is the only known alternative to clustering.
Table 4.8:
Coupling Constant, Lattice Size, Autocorrelation Time, Number of
Overrelaxed Sweeps, Susceptibility, and Correlation Length for the O(3)
Model
We have made extensive runs at 10 values of the coupling constant. At the lowest , several hundred thousand sweeps were collected, while for the largest values of , between 50,000 and 100,000 sweeps were made. Each sweep consists of between 10 iterations through the lattice at the former end and 150 iterations at the latter. The statistics we have gathered are equivalent to about 200 days, use of the full 128-node FPS machine.
Our results for the correlation length and susceptibility for each coupling and lattice size are shown in Table 4.8 . The autocorrelation times are also shown. The quantities measured on different-sized lattices at the same agree, showing that the infinite volume limit has been reached.
To compare the behavior of the correlation length and susceptibility with the asymptotic scaling predictions, we use the ``correlation length defect'' and ``susceptibility defect'' , which are defined as follows: , , so that asymptotic scaling is seen if , go to constants as . These defects are shown in Figures 4.22 and 4.23 , respectively. It is clear that asymptotic scaling does not set in for , but it is not possible to draw a clear conclusion for -though the trends of the last two or three points may be toward constant behavior.
Figure 4.22:
Correlation Length Defect Versus the Coupling Constant for the O(3)
Model
Figure 4.23:
Susceptibility Defect Versus the Coupling Constant for the O(3) Model
Figure 4.24:
Decorrelation Time
Versus Number of Over-relations Sweeps
for Different Values of
We gauged the speed of the algorithm in producing statistically independent configurations by measuring the autocorrelation time . We used this to estimate the dynamical critical exponent z , which is defined by . For constant , our fits give . However, we discovered that by increasing in rough proportion to , we can improve the performance of the algorithm significantly. To compare the speed of decorrelation between runs with different , we define a new quantity, which we call ``effort,'' . This measures the computational effort expended to obtain a decorrelated configuration. We define a new exponent from , where is chosen to keep constant. We also found that the behavior of the decorrelation time can be approximated over a good range by
A fit to the set of points ( , , ) gives , . Thus, is significantly lower than z . Figure 4.24 shows versus , with the fits shown as solid lines.
Guy Robinson
HPFA Applications
Guy Robinson
Physical systems comprised of discrete, macroscopic particles or grains which are not bonded to one another are important in civil, chemical, and agricultural engineering, as well as in natural geological and planetary environments. Granular systems are observed in rock slides, sand dunes, clastic sediments, snow avalanches, and planetary rings. In engineering and industry they are found in connection with the processing of cereal grains, coal, gravel, oil shale, and powders, and are well known to pose important problems associated with the movement of sediments by streams, rivers, waves, and the wind.
The standard approach to the theoretical modelling of multiparticle systems in physics has been to treat the system as a continuum and to formulate the model in terms of differential equations. As an example, the science of soil mechanics has traditionally focussed mainly on quasi-static granular systems, a prime objective being to define and predict the conditions under which failure of the granular soil system will occur. Soil mechanics is a macroscopic continuum model requiring an explicit constitutive law relating, for example, stress and strain. While very successful for the low-strain quasi-static applications for which it is intended, it is not clear how it can be generalized to deal with the high-strain, explicitly time-dependent phenomena which characterize a great many other granular systems of interest. Attempts at obtaining a generalized theory of granular systems using a differential equation formalism [ Johnson:87a ] have met with limited success.
An alternate approach to formulating physical theories can be found in the concept of cellular automata , which was first proposed by Von Neumann in 1948. In this approach, the space of a physical problem would be divided up into many small, identical cells each of which would be in one of a finite number of states. The state of a cell would evolve according to a rule that is both local (involves only the cell itself and nearby cells) and universal (all cells are updated simultaneously using the same rule).
The Lattice Grain Model [ Gutt:89a ] (LGrM) we discuss here is a microscopic, explicitly time-dependent, cellular automata model, and can be applied naturally to high-strain events. LGrM carries some attributes of both particle dynamics models [ Cundall:79a ], [Haff:87a;87b], [ Walton:84a ], [ Werner:87a ] (PDM), which are based explicitly on Newton's second law, and lattice gas models [ Frisch:86a ], [ Margolis:86a ] (LGM), in that its fundamental element is a discrete particle, but differs from these substantially in detail. Here we describe the essential features of LGrM, compare the model with both PDM and LGM, and finally discuss some applications.
Guy Robinson
The purpose of the lattice grain model is to predict the behavior of large numbers of grains (10,000 to 1,000,000) on scales much larger than a grain diameter. In this respect, it goes beyond the particle dynamics calculations of Section 9.2 , which are limited to no more than grains by currently available computing resources [ Cundall:79a ], [Haff:87a;87b], [ Walton:84a ], [ Werner:87a ]. The particle dynamics models follow the motion of each individual grain exactly, and may be formulated in one of two ways depending upon the model adopted for particle-particle interactions.
In one formulation, the interparticle contact times are assumed to be of finite duration, and each particle may be in simultaneous contact with several others [ Cundall:79a ], [ Haff:87a ], [ Walton:84a ], [ Werner:87a ]. Each particle obeys Newton's law, F = ma , and a detailed integration of the equations of motion of each particle is performed. In this form, while useful for applications involving a much smaller number of particles than LGrM allows, PDM cannot compete with LGrM for systems involving large numbers of grains because of the complexity of PDM ``automata.''
In the second, simpler formulation, the interparticle contact times are assumed to be of infinitesimal duration, and particles undergo only binary collisions (the hard-sphere collisional models) [ Haff:87b ]. Hard-sphere models usually rely upon a collision-list ordering of collision events to avoid the necessity of checking all pairs of particles for overlaps at each time step. In regions of high particle number density, collisions are very frequent; and thus in problems where such high-density zones appear, hard-sphere models spend most of their time moving particles through very small distances using very small time steps. In granular flow, zones of stagnation where particles are very nearly in contact much of the time, are common, and the hard-sphere model is therefore unsuitable, at least in its simplest form, as a model of these systems. LGrM avoids these difficulties because its time-stepping is controlled, not by a collision list but by a scan frequency, which in turn is a function of the speed of the fastest particle and is independent of number density. Furthermore, although fundamentally a collisional model, LGrM can also mimic the behavior of consolidated or stagnated zones of granular material in a manner which will be described.
Guy Robinson
LGrM closely resembles LGM [ Frisch:86a ], [ Margolis:86a ] in some respects. First, for two-dimensional applications, the region of space in which the particles are to move is discretized into a triangular lattice-work, upon each node of which can reside a particle. The particles are capable of moving to neighboring cells at each tick of the clock, subject to certain simple rules. Finally, two particles arriving at the same cell (LGM) or adjacent cells (LGrM) at the same time, may undergo a ``collision'' in which their outgoing velocities are determined according to specified rules chosen to conserve momentum.
Each of the particles in LGM has the same magnitude of velocity and is allowed to move in one of six directions along the lattice, so that each particle travels exactly one lattice spacing in each time step. The single-velocity magnitude means that all collisions between particles are perfectly elastic and that energy conservation is maintained simply through particle number conservation. It also means that the temperature of the gas is uniform throughout time and space, thus limiting the application of LGM to problems of low Mach number. An exclusion principle is maintained in which no two particles of the same velocity may occupy one lattice point. Thus, each lattice point may have no more than six particles, and the state of a lattice point can be recorded using only six bits.
LGrM differs from LGM in that it has many possible velocity states, not just six. In particular, in LGrM not only the direction but the magnitude of the velocity can change in each collision. This is a necessary condition because the collision of two macroscopic particles is always inelastic, so that mechanical energy is not conserved. The LGrM particles satisfy a somewhat different exclusion principle: No more than one particle at a time may occupy a single site. This exclusion principle allows LGrM to capture some of the volume-filling properties of granular material-in particular, to be able to approximate the behavior of static granular masses.
The determination of the time step is more critical in LGrM than in LGM. If the time step is long enough that some particles travel several lattice spacings in one clock tick, the problem of finding the intersection of particle trajectories arises. This involves much computation and defeats the purpose of an automata approach. A very short time step would imply that most particles would not move even a single lattice spacing. Here we choose a time step such that the fastest particle will move exactly one lattice spacing. A ``position offset'' is stored for each of the slower particles, which are moved accordingly when the offset exceeds one-half lattice spacing. These extra requirements for LGrM automata imply a slower computation speed than expected in LGM simulations; but, as a dividend, we can compute inelastic grain flows of potential engineering and geophysical interest.
Guy Robinson
In order to keep the particle-particle interaction rules as simple as possible, all interparticle contacts, whether enduring contacts or true collisions, will be modelled as collisions. Collisions that model enduring contacts will transmit, in each time step, an impulse equal to the force of the enduring contact multiplied by the time step. The fact that collisions take place between particles on adjacent lattice nodes means that some particles may undergo up to six collisions in a time step. For simplicity, these collisions will be resolved as a series of binary collisions. The order in which these collisions are calculated at each lattice node, as well as the order in which the lattice nodes are scanned, is now an important consideration.
The rules of the Lattice Grain Model may be summarized as follows:
where:
Once the offset exceeds half the distance to the nearest lattice node, and that node is empty, the particle is moved to that node, and its offset is decremented appropriately. Also, in a collision, the component of the offset along the line connecting the centers of the colliding particles is set to zero.
The transmission of ``static'' contact forces within a mass of grains (as in grains at rest in a gravitational field) is handled naturally within the above framework. Though a particle in a static mass of grains may be nominally at rest, its velocity may be nonzero (due to gravitational or pressure forces); and it will transmit the appropriate force (in the form of an impulse) to the particles under it by means of collisions. When these impulses are averaged over several time steps, the proper weights and pressures will emerge.
Figure 4.25:
Definition of Lattice Numbers and Collision Directions
Guy Robinson
When implementing this algorithm on a computer, what is stored in the computer's memory is information concerning each point in the lattice, regardless of whether or not there is a particle at that lattice point. This allows for very efficient checking of the space around each particle for the presence of other particles (i.e., information concerning the six adjacent points in a triangular lattice will be found at certain known locations in memory). The need to keep information on empty lattice points in memory does not entail as great a penalty as might be thought; many lattice grain model problems involve a high density of particles, typically one for every one to four lattice points, and the memory cost per lattice point is not large. The memory requirements for the implementation of LGrM as described here are five variables per lattice site: two components of position, two of velocity, and one status variable, which denotes an empty site, an occupied site, or a bounding ``wall'' particle. If each variable is stored using four bytes of memory, then each lattice point requires 20 bytes.
The standard configuration for a simulation consists of a lattice with a specified number of rows and columns bounded at the top and bottom by two rows of wall particles (thus forming the top and bottom walls of the problem space), and with left and right edges connected together to form periodic boundary conditions. Thus, the boundaries of the lattice are handled naturally within the normal position updating and collision rules, with very little additional programming. (Note: Since the gravitational acceleration can point in an arbitrary direction, the top and bottom walls can become side walls for chute flow. Also, the periodic boundary conditions can be broken by the placement of an additional wall, if so desired.)
Because of the nearest-neighbor type interactions involved in the model, the computational scheme was well suited to an nCUBE parallel processor. For the purpose of dividing up the problem, the hypercube architecture is unfolded into a two-dimensional array, and each processor is given a roughly equal-area section of the lattice. The only interaction between sections will be along their common boundaries; thus, each processor will only need to exchange information with its eight immediate neighbors. The program itself was written in C under the Cubix/CrOS III operating system.
Guy Robinson
The LGrM simulations performed so far have involved from to automata. Trial application runs included two-dimensional, vertical, time-dependent flows in several geometries, of which two examples are given here: Couette flow and flow out of an hourglass-shaped hopper.
The standard Couette flow configuration consists of a fluid confined between two flat parallel plates of infinite extent, without any gravitational accelerations. The plates move in opposite directions with velocities that are equal and parallel to their surfaces, which results in the establishment of a velocity gradient and a shear stress in the fluid. For fluids that obey the Navier-Stokes equation, an analytical solution is possible in which the velocity gradient and shear stress are constant across the channel. If, however, we replace the fluid by a system of inelastic grains, the velocity gradient will no longer necessarily be constant across the channel. Typically, stagnation zones or plugs form in the center of the channel with thin shear-bands near the walls. Shear-band formation in flowing granular materials was analyzed earlier by Haff and others [ Haff:83a ], [ Hui:84a ] based on kinetic theory models.
The simulation was carried out with 5760 grains, located in a channel 60 lattice points wide by 192 long. Due to the periodic boundary conditions at the left and right ends, the problem is effectively infinite in length. The first simulation is intended to reproduce the standard Couette flow for a fluid; consequently, the particle-particle collisions were given a coefficient of restitution of 1.0 (i.e., perfectly elastic collisions) and the particle-wall collisions were given a .75 coefficient of restitution. The inelasticity of the particle-wall collisions is needed to simulate the conduction of heat (which is being generated within the fluid) from the fluid to the walls. The simulation was run until an equilibrium was established in the channel (Figure 4.26 (a)). The average x - and y -components of velocity and the second moment of velocity, as functions of distance across the channel, are plotted in Figure 4.26 (b).
Figure 4.26:
(a) Elastic Particle Couette Flow; (b)
x
-component (1),
y
-component (2), and Second Moment (3) of Velocity
The second simulation used a coefficient of restitution of .75 for both the particle-particle and particle-wall collisions. The equilibrium results are shown in Figure 4.27 (a) and (b). As can be seen from the plots, the flow consists of a central region of particles compacted into a plug, with each particle having almost no velocity. Near each of the moving walls, a region of much lower density has formed in which most of the shearing motion occurs. Note the increase in value of the second moment of velocity (the granular ``thermal velocity'') near the walls, indicating that grains in this area are being ``heated'' by the high rate of shear. It is interesting to note that these flows are turbulent in the sense that shear stress is a quadratic, not a linear, function of shear rate.
Figure 4.27:
(a) Inelastic Particle Couette Flow; (b)
x
-component (1),
y
-component (2), and Second Moment (3) of Velocity.
In the second problem, the flow of grains through a hopper or an hourglass with an opening only a few grain diameters wide, was studied; the driving force was gravity. This is an example of a granular system which contains a wide range of densities, from groups of grains in static contact with one another to groups of highly agitated grains undergoing true binary collisions. Here, the number of particles used was 8310, and the lattice was 240 points long by 122 wide. Additional walls were added to form the sloped sides of the bin and to close off the bottom of the lattice to prevent the periodic boundary conditions from reintroducing the falling particles back into the bin (Figure 4.28 (a)). This is a typical feature of automata modelling. It is often easier to configure the simulation to resemble a real experiment-in this case by explicitly ``catching'' spent grains-than by reprogramming the basic code to erase such particles.
The hourglass flow in Figure 4.28 (b) showed internal shear zones, regions of stagnation, free-surface evolution toward an angle of repose, and an exit flow rate approximately independent of pressure head, as observed experimentally [ Tuzun:82a ]. It is hard to imagine that one could solve a partial differential equation describing such a complex, multiple-domain, time-dependent problem, even if the right equation were known (which is not the case).
Figure 4.28:
(a) Initial Condition of Hourglass; (b) Hourglass Flow after 2048
Time Steps
Guy Robinson
These exploratory numerical experiments show that an automata approach to granular dynamics problems can be implemented on parallel computing machines. Further work remains to be done to assess more quantitatively how well such calculations reflect the real world, but the prospects are intriguing.
Guy Robinson
Guy Robinson
As already noted in Chapter 2 , the initial software used by C P was called CrOS, although its modest functionality hardly justified CrOS being called an operating system. Actually, this is an interesting issue. In our original model, the ``real'' operating system (UNIX in our case) ran on the ``host'' that directly or indirectly (via a network) connected to the hypercube. The nodes of the parallel machine need only provide the minimal services necessary to support user programs. This is the natural mode for all SIMD systems and is still offered by several important MIMD multicomputers. However, systems such as the IBM SP-1, Intel's Paragon series, and Meiko's CS-1 and 2 offer a full UNIX (or equivalent, such as MACH) on each node. This has many advantages, including the ability of the system to be arbitrarily configured-in particular we can consider a multicomputer with N nodes as ``just'' N ``real'' computers connected by a high-performance network. This would lead to particularly good performance on remote disk I/O, such as that needed for the Network File System (NFS). The design of an operating system for the node is partly based on the programming usage paradigm, and partly on the hardware. The original multicomputers all had small node memories ( on the Cosmic Cube) and could not possibly hold UNIX on a node. Current multicomputers such as the CM-5, Paragon, and Meiko CS-2 would consider a normal minimum node memory. This is easily sufficient to hold a full UNIX implementation with the extra functionality needed to support parallel programming. There are some, such as IBM Owego (Execube), Seitz at Caltech (MOSAIC) [Seitz:90a;92a], and Dally at MIT (J Machine) [Dally:90a;92a], who are developing very interesting families of highly parallel ``small node'' multicomputers for which a full UNIX on each node may be inappropriate.
Essentially, all the applications described in this book are not sensitive to these issues, which would only affect the convenience of program development and operating environment. C P's applications were all developed using a simple message-passing system involving C (and less often Fortran) node programs that sent messages to each other via subroutine call. The key function of CrOS and Express, described in Section 5.2 , was to provide this subroutine library.
There are some important capabilities that a parallel computing environment needs in addition to message passing and UNIX services. These include:
We did not perform any major computations in C P that required high-speed input/output capabilities. This reflects both our applications mix and the poor I/O performance of the early hypercubes. The applications described in Chapter 18 needed significant but not high bandwidth input/output during computation, as did our analysis of radio astronomy data. However, the other applications used input/output for checkpointing, interchange of parameters between user and program, and in greatest volume, checkpoint and restart. This input/output was typically performed between the host and (node 0 of) the parallel ensemble. Section 5.2.7 and in greater detail [ Fox:88a ] describe the Cubix system, which we developed to make this input/output more convenient. This system was overwhelmingly preferred by the C P community as compared to the conventional host-node programming style. Curiously, Cubix seems to have made no impact on the ``real world.'' We are not aware of any other group that has adopted it.
Guy Robinson
The evolution of the various message-passing paradigms used on the Caltech/JPL machines involved three generations of hypercubes and many different software concepts, which ultimately led to the development of Express , a general, asynchronous buffered communication system for heterogeneous multiprocessors.
Originally designed, developed, and used by scientists with applications-oriented research goals, the Caltech/JPL system software was written to generate near-term needed capability. Neither hindered nor helped by any preconceptions about the type of software that should be used for parallel processing, we simply built useful software and added it to the system library.
Hence, the software evolved from primitive hardware-dependent implementations into a sophisticated runtime library, which embodied the concepts of ``loose synchronization,'' domain decomposition, and machine independence. By the time the commercial machines started to replace our homemade hypercubes, we had evolved a programming model that allowed us to develop and debug code effectively, port it between different parallel computers, and run with minimal overheads. This system has stood the test of time and, although there are many other implementations, the functionality of CrOS and Express appears essentially correct. Many of the ideas described in this chapter, and the later Zipcode System of Section 16.1 , are embodied in the current message-passing standards activity, MPI [ Walker:94a ]. A detailed description of CrOS and Express will be found in [ Fox:88a ] and [ Angus:90a ], and is not repeated here.
How did this happen?
Guy Robinson
The original hypercubes described in Chapter
20
, the Cosmic
Cube
[
Seitz:85a
], and Mark II
[
Tuazon:85a
] machines, had been designed and built as exploratory
devices. We expected to be able to do useful physics and, in particular, were
interested in high-energy physics. At that time, we were merely trying to
extract exciting physics from an untried technology. These first machines
came equipped with ``64-bit FIFOs,'' meaning that at a software level, two
basic communication routines were available:
rdELT(packet, chan)
wtELT(packet, chan).
The latter pushed a 64-bit ``packet'' into the indicated hypercube channel, which was then extracted with the rdELT function. If the read happened before the write, the program in the reading node stopped and waited for the data to show up. If the writing node sent its data before the reading node was ready, it similarly waited for the reader.
To make contact with the world outside the hypercube cabinet, a node had
to be able to communicate with a ``host'' computer. Again, the FIFOs
came into play with two additional calls:
rdIH(packet)
wtIH(packet),
which allowed node 0 to communicate with the host.
This rigidly defined behavior, executed on a hypercubic lattice of nodes, resembled a crystal, so we called it the Crystalline Operating System (CrOS). Obviously, an operating system with only four system calls is quite far removed from most people's concept of the breed. Nevertheless, they were the only system calls available and the name stuck.
Guy Robinson
We began to build algorithms and methods to exploit the power of parallel computers. With little difficulty, we were able to develop algorithms for solving partial differential equations [ Brooks:82b ], FFTs [ Newton:82a ], [ Salmon:86b ], and high-energy physics problems described in the last chapter [ Brooks:83a ].
As each person wrote applications, however, we learned a little more about the way problems were mapped onto the machines. Gradually, additional functions were added to the list to download and upload data sets from the outside world and to combine the operations of the rdELT and wtELT functions into something that exchanged data across a channel.
In each case, these functions were adopted, not because they seemed necessary to complete our operating system, but because they fulfilled a real need. At that time, debugging capabilities were nonexistent; a mistake in the program running on the nodes merely caused the machine to stop running. Thus, it was beneficial to build up a library of routines that performed common communication functions, which made reinventing tools unnecessary.
Guy Robinson
Up to this point, our primary concern was with communication between neighboring processors. Applications, however, tended to show two fundamental types of communication: local exchange of boundary condition data, and global operations connected with control or extraction of physical observables.
As seen from the examples in this book, these two types of communication are generally believed to be fundamental to all scientific problems-the modelled application usually has some structure that can be mapped onto the nodes of the parallel computer and its structure induces some regular communication pattern. A major breakthrough, therefore, was the development of what have since been called the ``collective'' communication routines, which perform some action across all the nodes of the machine.
The simplest example is that of `` broadcast ''- a function that enabled node 0 to communicate one or more packets to all the other nodes in the machine. The routine `` concat '' enabled each node to accumulate data from every other node, and `` combine '' let us perform actions, such as addition, on distributed data sets. The routine combine is often called a reduction operator.
The development of these functions, and the natural way in which they could be mapped to the hypercube topology of the machines, led to great increases in both productivity on the part of the programmers and efficiency in the execution of the algorithms. CrOS quickly grew to a dozen or more routines.
Guy Robinson
By 1985, the Mark II machine was in constant use and we were beginning to examine software issues that had previously been of little concern. Algorithms, such as the standard FFT, had obvious implementations on a hypercube [ Salmon:86b ], [ Fox:88a ]-the ``bit-twiddling'' done by the FFT algorithm could be mapped onto a hypercube by ``twiddling'' the bits in a slightly revised manner. More problematic was the issue of two- or three-dimensional problem solving. A two- or three-dimensional problem could easily be mapped into a small number of nodes. However, one cannot so easily perform the mapping of grids onto 128 nodes connected as a seven-dimensional hypercube.
Another issue that quickly became apparent was that C P users did not have a good feel for the `` chan '' argument used by the primitive communication functions. Users wanted to think of a collection of processors each labelled by a number, with data exchanged between them, but unfortunately the software was driven instead by the hypercube architecture of the machine. Tolerance of the explanation of ``Well, you XOR the processor number with one shifted left by the '' was rapidly exceeded in all but the most stubborn users.
Both problems were effectively removed by the development of whoami [ Salmon:84b ]. We used the well-known techniques of binary grey codes to automate the process of mapping two, three, or higher dimensional problems to the hypercube. The whoami function took the dimensionality of the physical system being modelled and returned all the necessary `` chan '' values to make everything else work out properly. No longer did the new user have to spend time learning about channel numbers, XORing, and the ``mapping problem''-everything was done by the call to whoami . Even on current hardware, where long-range communication is an accepted fact, the techniques embodied by whoami result in programs that can run up to an order of magnitude faster than those using less convenient mapping techniques (see Figure 5.1 ).
Figure 5.1:
Mapping a Two-dimensional World
Guy Robinson
Up to this point, we had concentrated on the most obvious scientific problems: FFTs, ordinary and partial differential equations, matrices, and so on, which were all characterized by their amenability to the lock step, short-range communication primitives available. Note that some of these, such as the FFT and matrix algorithms, are not strictly ``nearest neighbor'' in the sense of the communication primitives discussed earlier, since they require data to be distributed to nodes further than one step away. These problems, however, are amenable to the ``collective communication'' strategies.
Based on our success with these problems, we began to investigate areas that were not so easily cast into the crystalline methodology. A long-term goal was the support of event-driven simulations, database machines, and transaction-processing systems, which did not appear to be crystalline .
In the shorter term, we wanted to study the physical process of ``melting'' [ Johnson:86a ] described in Section 14.2 . The melting process is different from the applications described thus far, in that it inherently involves some indeterminacies-the transition from an ordered solid to a random liquid involves complex and time-varying interactions. In the past, we had solved such an irregular problem-that of N-body gravity [ Salmon:86b ] by the use of what has since been called the ``long-range-force'' algorithm [ Fox:88a ]. This is a particularly powerful technique and leads to highly efficient programs that can be implemented with crystalline commands.
The melting process differs from the long-range force algorithm in that the interactions between particles do not extend to infinity, but are localized to some domain whose size depends upon the particular state of the solid/fluid. As such, it is very wasteful to use the long-range force technique, but the randomness of the interactions makes a mapping to a crystalline algorithm difficult (see Figure 5.2 ).
Figure 5.2:
Interprocessor Communication Requirements
To address these issues effectively, it seemed important to build a communication system that allowed messages to travel between nodes that were not necessarily connected by ``channels,'' yet didn't need to involve all nodes collectively.
At this point, an enormous number of issues came up-routing, buffering, queueing, interrupts, and so on. The first cut at solving these problems was a system that never acquired a real name, but was known by the name of its central function, `` rdsort '' [ Johnson:85a ]. The basic concept was that a message could be sent from any node to any other node, at any time, and the receiving node would have its program interrupted whenever a message arrived. At this point, the user provided a routine called `` rdsort '' which, as its name implies, needed to read, sort and process the data.
While simple enough in principle, this programming model was not initially adopted (although it produced an effective solution to the melting problem). To users who came from a number-crunching physics background, the concept of ``interrupts'' was quite alien. Furthermore, the issues of sorting, buffering, mutual exclusion, and so on, raised by the asynchronous nature of the resulting programs, proved hard to code. Without debugging tools, it was extremely hard to develop programs using these techniques. Some of these ideas were taken further by the Reactive Kernel [ Seitz:88b ] (see Section 16.2 ), which do not, however, implement ``reaction'' with an interrupt level handler. The recent development of active messages on the CM-5 has shown the power of the rdsort concepts [ Eiken:92a ].
Guy Robinson
The advent of the Mark III machine [ Peterson:85a ] generated a rapid development in applications software. In the previous five years, the crystalline system had shown itself to be a powerful tool for extracting maximum performance from the machines, but the new Mark III encouraged us to look at some of the ``programmability'' issues, which had previously been of secondary importance.
The first and most natural development was the generalization of the CrOS system for the new hardware [ Johnson:86c ], [ Kolawa:86d ]. Christened ``CrOS III,'' it allowed us the flexibility of arbitrary message lengths (rather than multiples of the FIFO size), hardware-supported collective communication-the Mark III allowed hardware support of simultaneous communication down multiple channels, which allowed fast cube and subcube broadcast [ Fox:88a ]. All of these enhancements, however, maintained the original concept of nearest-neighbor (in a hypercube) communication supported by collective communication routines that operated throughout (or on a subset of) the machine. In retrospect, the hypercube's specific nature of CrOS should not have been preserved in the major redesign of CrOS III.
Guy Robinson
At this point, the programming model for the machines remained pretty much as it had been in 1982. A program running on the host computer loaded an application into the hypercube nodes, and then passed data back and forth with routines similar in nature to the rdIH and wtIH calls described earlier. This remained the only method through which the nodes could communicate with the outside world.
During the Mark II period, it had quickly become apparent that most users were writing host programs that, while differing in detail, were identical in outline and function. An early effort to remove from the user the burden of writing yet another host program was a system called C3PO [ Meier:84b ], which had a generic host program providing a shell in which subroutines could be executed in the nodes. Essentially, this model freed the user from writing an explicit host program, but still kept the locus of control in the host.
Cubix [ Salmon:87a ] reversed this. The basic idea was that the parallel computer, being more powerful than its host, should play the dominant role. Programs running in the nodes should decide for themselves what actions to take and merely instruct the host machine to intercede on their behalf. If, for example, the node program wished to read from a file, it should be able to tell the host program to perform the appropriate actions to open the file and read data, package it up into messages, and transmit it back to the appropriate node. This was a sharp contrast to the older method in which the user was effectively responsible for each of these actions.
The basic outcome was that the user's host programs were replaced with a standard ``file-server'' program called cubix . A set of library routines were then created with a single protocol for transactions between the node programs and cubix , which related to such common activities as opening, reading and writing files, interfacing with the user, and so on.
This change produced dramatic results. Now, the node programs could contain calls to such useful functions as printf , scanf , and fopen , which had previously been forbidden. Debugging was much easier, albeit in the old-fashioned way of ``let's print everything and look at it.'' Furthermore, programs no longer needed to be broken down into ``host'' and ``node'' pieces and, as a result, parallel programs began to look almost like the original sequential programs.
Once file I/O was possible from the individual nodes of the machine, graphics soon followed through Plotix [ Flower:86c ], resulting in the development system shown in the heart of the family tree ( 5.3 ). The ideas embodied in this set of tools-CrOS III, Cubix, and Plotix-form the basis of the vast majority of C P parallel programs.
Figure 5.3:
The C
P ``Message-passing'' Family Tree
Guy Robinson
While the radical change that led to Cubix was happening, the non-crystalline users were still developing alternative communication strategies. As mentioned earlier, rdsort never became popular due to the burden of bookkeeping that was placed on the user and the unfamiliarity of the interrupt concept.
The ``9 routines'' [ Johnson:86a ] attempted to alleviate the most burdensome issues by removing the interrupt nature of the system and performing buffering and queueing internally. The resultant system was a simple generalization of the wtELT concept, which replaced the ``channel'' argument with a processor number. As a result, messages could be sent to non-neighboring nodes. An additional level of sophistication was provided by associating a ``type'' with each message. The recipient of the messages could then sort incoming data into functional categories based on this type.
The benefit to the user was a simplified programming model. The only remaining problem was how to handle this new found freedom of sending messages to arbitrary destinations.
We had originally planned to build a ray tracer from the tools developed while studying melting. There is, however, a fundamental difference between the melting process and the distributed database searching that goes on in a ray tracer. Ray tracing is relatively simple if the whole model can be stored in each processor, but we were considering the case of a geometric database larger than this.
Melting posed problems because the exact nature of the interaction was known only statistically-we might need to communicate with all processors up to two hops away from our node, or three hops, or some indeterminate number. Other than this, however, the problem was quite homogeneous, and every node could perform the same tasks as the others.
The database search is inherently non-deterministic and badly load-balanced because it is impossible to map objects into the nodes where they will be used. As a result, database queries need to be directed through a tree structure until they find the necessary information and return it to the calling node.
A suitable methodology for performing this kind of exercise seemed to be a real multitasking system where ``processes'' could be created and destroyed on nodes in a dynamic fashion which would then map naturally onto the complex database search patterns. We decided to create an operating system.
The crystalline system had been described, at least in the written word, as an operating system. The concept of writing a real operating system with file systems, terminals, multitasking, and so on, was clearly impossible while communication was restricted to single hops across hypercube channels. The new system, however, promised much more. The result was the ``Multitasking, Object-Oriented, Operating System'' (MOOOS, commonly known as MOOSE) [ Salmon:86a ]. The follow-up MOOS II is described in Section 15.2 .
The basic idea was to allow for multitasking-running more than one process per node, with remote task creation, scheduling, semaphores, signals-to include everything that a real operating system would have. The implementation of this system proved quite troublesome and strained the capabilities of our compilers and hardware beyond their breaking points, but was nothing by comparison with the problems encountered by the users.
The basic programming model was of simple, ``light weight'' processes communicating through message box/pipe constructs. The overall structure was vaguely reminiscent of the standard UNIX multiprocessing system; fork/exec and pipes (Figure 5.4 ). Unfortunately, this was by now completely alien to our programmers, who were all more familiar with the crystalline methods previously employed. In particular, problems were encountered with naming. While a process that created a child would automatically know its child's ``ID,'' it was much more difficult for siblings to identify each other, and hence, to communicate. As a natural result, it was reasonably easy to build tree structures but difficult to perform domain decompositions. Despite these problems, useful programs were developed including the parallel ray tracer with a distributed database that had originally motivated the design [Goldsmith:86a;87a].
Figure 5.4:
A ``MOOSE'' Process Tree
An important problem was that architectures offered no memory protection between the lightweight processes running on a node. One had to guess how much memory to allocate to each process, which complicated debugging when the user guessed wrong. Later, the Intel iPSC implemented the hardware memory protection, which made life simpler ([ Koller:88b ] and Section 15.2 ).
In using MOOSE, we wanted to explore dynamic load-balancing issues. A problem with standard domain decompositions is that irregularities in the work loads assigned to processors lead to inefficiencies since the entire simulation, proceeding in lock step, executes at the speed of the slowest node. The Crystal Router , developed at the same time as MOOSE, offered a simpler strategy.
Guy Robinson
By 1986, we began to classify our algorithms in order to generalize the performance models and identify applications that could be expected to perform well using the existing technology. This led to the idea of ``loosely synchronous'' programming.
The central concept is one in which the nodes compute for a while, then synchronize and communicate, continually alternating between these two types of activities. This computation model was very well-suited to our crystalline communication system, which enforced synchronization automatically. In looking at some of the problems we were trying to address with our asynchronous communication systems (The ``9 routines'' and MOOSE), we found that although the applications were not naturally loosely synchronous at the level of the individual messages, they followed the basic pattern at some higher level of abstraction.
In particular, we were able to identify problems in which it seemed that messages would be generated at a fairly uniform rate, but in which the moment when the data had to be physically delivered to the receiving nodes was synchronized. A load balancer, for example, might use some type of simulated-annealing [ Flower:87a ] or neural-network [ Fox:88e ] approach, as seen in Chapter 11 , to identify work elements that should be relocated to a different processor. As each decision is made, a message can be generated to tell the receiving node of its new data. It would be inefficient, however, to physically send these messages one at a time as the load-balancing algorithm progresses, especially since the results need only be acted upon once the load-balancing cycle has completed.
We developed the Crystal Router to address this problem [Fox:88a;88h]. The idea was that messages would be buffered on their node of origin until a synchronization point was reached when a single system call sent every message to its destination in one pass. The results of this technology were basically twofold.
The resultant system had some of the attractive features of the ``9 routines,'' in that messages could be sent between arbitrary nodes. But it maintained the high efficiency of the crystalline system by performing all its internode communications synchronously. A glossary of terms used is in Figure 5.5 .
The crystal router was an effective system on early Caltech, JPL, and commercial multicomputers. It minimized latency, interrupt overhead and used optimal routing. It has not survived as a generally important concept as it is not needed in this form on modern machines with different topologies and automatic hardware routing.
Guy Robinson
In all of the software development cycles, one of our primary concerns was portability . We wanted our programs to be portable not only between various types of parallel computers, but also between parallel and sequential computers. It was in this sense that Cubix was such a breakthrough, since it allowed us to leave all the standard runtime system calls in our programs. In most cases, Cubix programs will run either on a supported parallel computer or on a simple sequential machine through the provision of a small number of dummy routines for the sequential machine. Using these tools, we were able to implement our codes on all of the commercially and locally built hypercubes.
The next question to arise, however, concerned possible extensions to alternative architectures, such as shared-memory or mesh-based structures. The crystal router offered a solution. By design, messages in the crystal router can be sent to any other node. This step merely involves construction of a set of appropriate queues. When the interprocessor communication is actually invoked, the system is responsible for routing messages between processors-a step in which potential differences in the underlying hardware architecture can be concealed. As a result, applications using the crystal router can conceivably operate on any type of parallel or sequential hardware.
Guy Robinson
At the end of 1987, ParaSoft Corporation was founded by a group from C P with the goal of providing a uniform software base-a set of portable programming tools-for all types of parallel processors (Figure 5.6 ).
Figure 5.6:
Express System Components
The resultant system, Express [ ParaSoft:88a ], is a merger of the C P message passing tools, developed into a unified system that can be supported on all types of parallel computers. The basic components are:
Additionally, ParaSoft added:
ParaSoft extended the parallel debugger originally developed for the nCUBE hypercube [ Flower:87c ] and created a set of powerful performance analysis tools [ Parasoft:88f ] to help users analyze and optimize their parallel programs. This toolset, incorporating all of the concepts of the original work and available on a wide range of parallel computers, has been widely accepted and is now the most commonly used system at Caltech. It is interesting to note that the most successful parallel programs are still built around the crystalline style of internode communication originally developed for the Mark II hypercube in 1982. While other systems occasionally seem to offer easier routes to working algorithms, we usually find that a crystalline implementation offers significantly better performance.
At the current stage of development, we also believe that parallel processing is reasonably straightforward. The availability of sophisticated debugging tools, and I/O systems has resulted in several orders of magnitude reduction in debugging time. Similarly, the performance evaluation system has proved itself very powerful in analyzing areas where potential improvements can be made in algorithms.
ParaSoft also supports a range of other parallel computing tools, some of which are described later in this chapter.
Guy Robinson
It is interesting to compare the work of other organizations with that performed at Caltech. In particular, our problem-solving approach to the art of parallel computing has, in some cases, led us down paths which we have since abandoned but which are still actively pursued by other groups. Yet, a completely fresh look at parallel programming methods may produce a more consistent paradigm than our evolutionary approach. In any case, the choice of a parallel programming system depends on whether the user is more interested in machine performance or ease of programming.
These are several systems that offer some or all of the features of Express, based on long-range communication by message passing. Many are more general operating environments with the features of ``real'' operating systems missing in Express and especially CrOS. We summarize some examples in the following:
JPL developed this message-passing system [ Lee:86a ] at the same time as we developed the 9 routines at Caltech. Mercury is similar to the 9 routines in that messages can be transmitted between any pair of nodes, irrespective of whether a channel connects them. Messages also have ``types'' and can be sorted and buffered by the system as in the 9 routines or Express. A special type of message allows one node to broadcast to all others.
Centaur is a simulation of CrOS III built on Mercury. This system was designed to allow programmers with crystalline applications the ability to operate either at the level of the hardware with high performance (with the CrOS III library) or within the asynchronous Mercury programming model, which had substantially higher (about a factor of three) message startup latency. When operating in Centaur mode, CrOS III programs may use advanced tools, such as the debugger, which require asynchronous access to the communication hardware.
VERTEX is the native operating system of the nCUBE. It shares with Express, Mercury, and the 9 routines the ability to send messages, with types, between arbitrary pairs of processors. Only two basic functions are supported to send and receive messages. I/O is not supported in the earliest versions of VERTEX, although this capability has been added in support of the second generation nCUBE hypercube.
The Reactive Kernel [ Seitz:88b ] is a message-passing system based on the idea that nodes will normally be sending messages in response to messages coming from other nodes. Like all the previously mentioned systems, the Reactive Kernel can send messages between any pair of nodes with a simple send/receive interface. However, the system call that receives messages does not distinguish between incoming messages. All sorting and buffering must be done by the user. As described in Chapter 16 , Zipcode has been built on top of the Reactive Kernel to provide similar capabilities to Express.
The NX system provided for the Intel iPSC series of multicomputers is also similar in functionality to the previously described long-range communication systems. It supports message types and provides sorting and buffering capabilities similar to those found in Express. No support is provided for nearest-neighbor communication in the crystalline style, although some of the collective communication primitives are supported.
The MACH operating system [ Tevanian:89a ] is a full implementation of UNIX for a shared-memory parallel computer. It supports all of the normally expected operating system facilities, such as multiuser access, disks, terminals, printers, and so on, in a manner compatible with the conventional Berkeley UNIX. MACH is also built with an elegant small (micro) kernel and a careful architecture of the system and user level functionality.
While this provides a strong basis for multiuser processing, it offers only simple parallel processing paradigms, largely based on the conventional UNIX interprocess communication protocols, such as ``pipes'' and ``sockets.'' As mentioned earlier in connection with MOOSE, these types of tools are not the easiest to use in tightly coupled parallel codes. The Open Software Foundation (OSF) has extended and commercialized MACH. They also have an AD (Advanced Development) prototype version for distributed memory machines. The latest Intel Paragon multicomputer offers OSF's new AD version of MACH on every node, but the operating system has been augmented with NX to provide high-performance message passing.
Helios [ DSL:89a ] is a distributed-memory operating system designed for transputer networks-distributed-memory machines. It offers typical UNIX-like utilities, such as compilers, editors, and printers, which are all accessible from the nodes of the transputer system, although fewer than the number supported by MACH. In common with MACH, however, the level of parallel processing support is quite limited. Users are generally encouraged to use pipes for interprocessor communication-no collective or crystalline communication support is provided.
The basic concept used in Linda [ Ahuja:86a ] is the idea of a tuple-space (database) for objects of various kinds. Nodes communicate by dropping objects into the database, which other nodes can then extract. This concept has a very elegant implementation, which is extremely simple to learn, but which can suffer from quite severe performance problems. This is especially so on distributed-memory architectures, where the database searching necessary to find an ``object'' can require intensive internode communication within the operating system.
More recent versions of Linda [ Gelertner:89a ] have extended the original concept by adding additional tuple-spaces and allowing the user to specify to which space an object should be sent and from which it should be retrieved. This new style is reminiscent of a mailbox approach, and is thus, quite similar to the programming paradigm used in CrOS III or Express.
PVM is a very popular elegant system that is available freely from Oak Ridge [ Sunderam:90a ], [ Geist:92a ]. This parallel virtual machine is notable for its support of a heterogeneous computing environment with, for instance, a collection of disparate architecture computers networked together.
There are several other message-passing systems, including active messages [ Eiken:92a ] discussed earlier, P4 [ Boyle:87a ], PICL [ Geist:90b ], EUI on the IBM SP-1, CSTools from Meiko, Parmacs [ Hempel:91a ], and CMMD on the CM-5 from Thinking Machines. PICL's key feature is the inclusion of primitives to support the gathering of data to support performance visualization (Section 5.4 ). This could be an important feature in such low-level systems.
Most of the ideas in Express, PVM, and the other basic message-passing systems are incorporated in a new Message-Passing Interface (MPI) standard [ Walker:94a ]. This important development tackles basic point to point, and collective communication. MPI does not address issues such as ``active messages'' or distributed computing and wide-area networks (e.g., what are correct protocols for video-on-demand and multimedia with real time constraints). Operating systems issues, outside the communication layer, are also not considered in MPI.
Guy Robinson
Guy Robinson
The history of our message-passing system work at Caltech is interesting in that its motivation departs significantly from that of most other institutions. Since our original goals were problem-oriented rather than motivated by the desire to do parallel processing research, we tended to build utilities that matched our hardware and software goals rather than for our aesthetic sense. If our original machine had had multiplexed DMA channels and specialized routing hardware, we might have started off in a totally different direction. Indeed, this can be seen as motivation for developing some of the alternative systems described in the previous section.
In retrospect, we may have been lucky to have such limited hardware available, since it forced us to develop tools for the user rather than rely on an all-purpose communication system. The resultant decomposition and collective communication routines still provide the basis for most of our successful work-even with the development of Express, we still find that we return again and again to the nearest-neighbor, crystalline communication style, albeit using the portable Express implementation rather than the old rdELT and wtELT calls. Even as we attempt to develop automated mechanisms for constructing parallel code, we rely on this type of technology.
The advent of parallel UNIX variants has not solved the problems of message passing-indeed these systems are among the weakest in terms of providing user-level support for interprocessor communication. We continually find that the best performance, both from our parallel programs and the scientists who develop them, is obtained when working in a loosely synchronous programming environment such as Express, even when this means implementing such a system on top of a native, ``parallel UNIX.''
We believe that the work done by is still quite unique, at least in its approach to problem solving. It is amusing to recall the comment of one new visitor to Caltech who, coming from an institution building sophisticated ``parallel UNIXs,'' was surprised to see the low level at which CrOS III operated. From our point of view, however, it gets the job done in an efficient and timely manner, which is of paramount importance.
Guy Robinson
Guy Robinson
Relatively little attention was paid in the early days of parallel computers to debugging the resulting parallel programs. We developed our approaches by trial and error during our various experiments in C P, and debugging was never a major research project in C P.
In this section, we shall consider some of the history and current technology of parallel debugging, as developed by C P.
Method 1. Source Scrutiny The way one worked on the early C P machines was to compile the target code, download it to the nodes, and wait. If everything worked perfectly, results might come back. Under any other circumstances, nothing would come out. The only real way to debug was to stare at the source code.
The basic problem was that while the communication routines discussed in the previous chapter were adequate (and in some sense ideal) for the task of algorithm development, they lacked a lot in terms of debugging support. In order to ``see'' the value of a variable inside the nodes, one had to package it up into a message and then send it to the host machine. Similarly, the host code had to be modified to receive this message at the right time and format it for the user's inspection. Even then only node 0 could perform this task directly, and all the other nodes had to somehow get their data to node 0 before it could be displayed.
Given the complexity of this task it is hardly surprising that users typically stared at their source code rather than attempt it. Ironically this procedure actually tended to introduce new bugs in the process of detecting the old ones because incorrect synchronization of the messages in nodes and host would lead to the machine hanging, probably somewhere in the new debugging code rather than the location that one was trying to debug. After several hours of fooling around, one would make the painful discovery that the debugging code itself was wrong and would have to start once more.
Method 2. Serial Channels In building the first C P hypercubes, each node had been given a serial RS-232 channel. No one quite knew why this had been done, but it was pointed out that by attaching some kind of terminal, or even a PC, it might be possible to send ``print'' statements out of the back of one or more nodes.
This was quickly achieved but proved less than the dramatic improvement one would have hoped. The interface was really slow and only capable of printing simple integer values. Furthermore, one had to use it while sitting in the machine room and it was necessary to attach the serial cable from the debugging terminal to the node to be debugged-an extremely hazardous process that could cause other cables to become loose.
A modification of the process that should probably have pointed us in the right direction immediately was when the MS-DOS program DEBUG was modified for this interface. Finally, we could actually insert breakpoints in the target node code and examine memory!
Unfortunately, this too failed to become popular because of the extremely low level at which it operated. Memory locations had to be specified in hexadecimal and code could only be viewed as assembly language instructions.
A final blow to this method was that our machines still operated in ``single-user'' mode-that is, only a single user could be using the system at any one time. As a result, it was intolerable for a single individual to ``have'' the machine for a couple of hours while struggling with the DEBUG program while others were waiting.
Method 3. Cubix As has been described in the previous section on communication systems, the advent of the Cubix programming style brought a significant improvement to the life of the parallel code developer. For the first time, any node could print out its data values, not using some obscure and arcane functions but with normal printf and WRITE statements. To this extent, debugging parallel programs really did become as simple as debugging sequential ones.
Using this method took us out of the stone age: Each user would generate huge data files containing the values of all the important data and then go to work with a pocket calculator to see what went wrong.
Method 4. Help from the Manufacturer The most significant advance in debugging technology, however, came with the first nCUBE machine. This system embodied two important advances:
The ``real'' kernel was a mixed blessing. As has been pointed out previously, we didn't really need most of its resources and resented the fact that the kernel imposed a message latency almost ten times that of the basic hardware. On the other hand, it supported real debugging capabilities.
Unfortunately, the system software supplied with the nCUBE hadn't made much more progress in this direction than we had with our `` DEBUG '' program. The debugger expected to see addresses in hex and displayed code as assembly instructions. Single stepping was only possible at the level of a single machine instruction.
Method 5. The Node Debugger: ndb Our initial attempt to get something useful out of the nCUBE's debugging potential was something called `` bdb '' that communicated with nCUBE's own debugger through a pipe and attempted to provide a slightly more friendly user interface. In particular, it allowed stack frames to be unrolled and also showed the names of functions rather than the absolute addresses. It was extremely popular.
As a result of this experience, we decided to build a full-blown, user-friendly, parallel programming debugger , finally resulting in the C P and now ParaSoft tool known as `` ndb ,'' the ``node debugger.''
Guy Robinson
The basics of the design were straightforward, but tedious to code. Much work had to be done building symbol tables from executables, figuring out how line numbers mapped to memory addresses, and so on, but the most important discoveries lay in that a parallel program debugger had to work in a rather different way than normal sequential versions.
Lesson 1. Avoiding Deadlock The first important discovery was that a parallel program debugger couldn't operate in the ``on'' or ``off'' modes of conventional debuggers. In sdb or dbx , for example, either the debugger is in control or the user program is running. There are no other states. Once you have issued a ``continue'' command, the user program continues to run until it either terminates or reaches another breakpoint, at which time you may once again issue debugger commands.
To see how this fails for a parallel program, consider the code outline shown in Figure 5.7 . Assume that we have two nodes, both stopped at breakpoints at line one. At this point, we can do all of the normal debugger activities including examination of variables, listing of the program, and so on. Now assume that we single-step only node 0 . Since line one is a simple assignment we have no problem and we move to line two.
Figure 5.7:
Single-Stepping Through Message-Passing Code
Repeating this process is a problem, however, since we now try to send a message to node 1 which is not ready to receive it-node 1 is still sitting in its breakpoint at line one in its node. If we adopted the sequential debugger standard in which the user program takes control whenever a command is given to step or continue the program, we would now have a deadlock, because node 0 will never return from its single-step command until node 1 does something. On the other hand node 1 cannot do anything until it is given a debugger command.
In principle, we can get around this problem by redefining the action of the send_message function used in node 0. In the normal definition of our system at that time, this call should block until the receiving node is ready. By relaxing this constraint, we can allow the send_message function to return as soon as the data to be transmitted is safely reusable, without waiting for the receive.
This does not save the debugger. We now expect the single step from line two to line three to return, as will the trivial step to line four. But the single step to line five involves receiving a message from node 1 and no possible relaxing of the communication specification can deal with the fact that node 1 hasn't sent anything.
Deadlock is unavoidable.
The solution to this problem is to simply make debugging a completely autonomous process which operates independently of the user program. Essentially, this means that any debugger command immediately returns and gives the user a new prompt. The single-step command, for example, doesn't wait for anything important to happen but allows the user to carry on giving debugger commands even though the user process may be ``hung'' as a consequence of the single step as shown in Figure 5.7 .
Lesson 1a. Who Gets the Input? As an immediate consequence of the decision to leave the debugger in control of the keyboard at all times, we run into the problem of how to pass input to the program being debugged.
Again, sequential debuggers don't have this problem because the moment you continue or run the program it takes control of the keyboard and you enter data in the normal manner. In ndb , life is not so simple because if you start up your code and it prints on the screen
Enter two integers: [I,J] or some such, you can't actually type the values because they would be interpreted as debugger commands! One way around this is to have multiple windows on your workstation; in one you type debugger commands; in the other, input to your program. Another solution is to have a debugger command that explicitly switches control to the user program in just the same way that a sequential debugger would: ndb supports both mechanisms.
Lesson 2. Show State Because the debugger operates in the manner just described, it becomes very important to give the user a quick way of seeing when something has really happened. Normal sequential debuggers give you this feedback by simply returning a prompt whenever the user program has encountered a breakpoint or terminated. In our case, we provide a simple command, `` show state ,'' to allow the user to monitor the progress of the node program.
As an example, the output when executed on node 0 at line two might be something like
Node 0: Breakpoint, PC=[foo.c,2] which shows that the node is stopped at a breakpoint at the indicated line of a source file named `` foo.c ''. If we step again, the debugger gives us back a prompt and a very quick `` show state '' command might now show
Node 0: Running, PC=[send.c, 244] showing that the node is now running code somewhere inside a source file called `` send.c ''. Another similar command would probably show something like
Node 0: Breakpoint, PC=[foo.c, 3] indicating that the node had now reached the breakpoint on the next line. If the delay between the first two `` show state '' commands were too long, you might never see the ``Running'' state at all because the node will have performed its ``send'' operation and reached line three.
If you continued with this process of single stepping and probing with the `` show state '' command, you would eventually get to a state where the node would show as ``Running'' in the receive function from which it would never return until node 1 got around to sending its message.
Lesson 3. Sets of Nodes The simplest applications of a sequential debugger for parallel programs would be similar to those already seen. Each command issued by the user to the debugger is executed on a particular node. Up to now, for example, we have considered only actions on node 0. Obviously, we can't make much progress in the example code shown in Figure 5.7 until node 1 moves from its initial breakpoint at line one.
We might extend the syntax by adding a `` pick '' command that lets us, for example,
pick node 1 and then execute commands there instead of on node 0. This would clearly allow us to make progress in the example we have been studying. On the other hand, it is very tedious to debug this way. Even on as few as four nodes, the sequence
is used frequently and is very painful to type. Running on 512 nodes in this manner is out of the question. The solution adopted for ndb is to use ``node sets.'' In this case, the above effect would be achieved with the command
on all show state or alternatively
The basic idea is that debugger commands can be applied to more than a single processor at once. In this way, you can obtain global information about your program without spending hours typing commands.
In addition to the simple concepts of a single node and ``all'' nodes, ndb supports other groups such as contiguous ranges of nodes, discontinuous ranges of nodes, ``even'' and ``odd'' parity groups, the ``neighbors'' of a particular node, and user-defined sets of nodes to facilitate the debugging process. For example, the command
on 0, 1, nof 1, even show state executes the `` show state '' command on nodes 0, 1, the neighbors of node 1, and all ``even parity'' nodes.
Lesson 4. Smart stepping Once node sets are available to execute commands on multiple processors, another tricky issue comes up concerning single stepping. Going back to the example shown in Figure 5.7 , consider the effect of executing the sequence of commands
starting from the initial state in which both nodes are at a breakpoint at line one. The intent is fairly obvious-the user wants to single-step over the intermediate lines of code from line one, eventually ending up at line five.
In principle, the objections that gave rise to the independence of debugger and user program should no longer hold, because when we step from line two, both nodes are participating and thus the send/receive combination should be satisfied properly.
The problem, however, is merely passed down to the internal logic of the debugger. While it is true that the user has asked both nodes to step over their respective communication calls, the debugger is hardly likely to be able to deduce that. If the debugger expands (internally) the single-step command to something like
then all might be well, since node 0 will step over its ``send'' before node 1 steps over its receive-a happy result. If, however, the debugger chooses to expand this sequence as
it will hang just as badly as the original user interaction.
Even though the ``obvious'' expansion is the one that works in this case, this is not generally true-in fact, it fails when stepping from line four to line five in the example.
In general, there is no way for the debugger to know how to expand such command sequences reliably, and as a result a much ``smarter'' method of single stepping must be used, such as that shown schematically in Figure 5.8 .
Figure 5.8:
Logic for Single Stepping on Multiple Nodes
The basic idea is to loop over each of the nodes in the set being stepped trying to make ``some'' progress towards reaching the next stopping point. If no nodes can make progress, we check to see if some time-out has expired and if not, continue. This allows us to step over system calls that may take a significant time to complete when measured in machine instructions.
Finally, if no more progress can be made, we attempt to analyze the reason for the deadlock and return to the user anyway.
This process is not foolproof in the sense that we will sometimes ``give up'' on single steps that are actually going to complete, albeit slowly. But it has the great virtue that even when the user program ``deadlocks'', the debugger comes back to the user, often with a correct analysis of the reason for the deadlock.
Lesson 5. Show queue Another interesting question about the debugger concerns the extensions and/or modifications that one might make to a sequential debugger.
One might be tempted to say that the parallel debugger is so different from its sequential counterparts that a totally new syntax and method of operation is justified. One then takes the chance that no one will invest the time needed to learn the new tool and it will never be useful.
For ndb , we decided to adopt the syntax of the well-known UNIX dbx debugger that was available on the workstations that we used for development. This meant that the basic command syntax was familiar to everyone using the system.
Of course we have already introduced several commands that don't exist in dbx , simply because sequential debuggers don't have need for them. The `` show state '' command is never required in sequential debuggers because the program is either running or it's stopped at a point that the debugger can tell you about. Similarly, one never has to give commands to multiple processors.
Another command that we learned early on was very important was `` show q '', which monitored the messages in transit between processors. Because our parallel programs were really just sequential programs with additional message passing, the ``bugs'' that we were trying to find were not normally algorithmic errors but message-passing ones.
A typical scenario would be that the nodes would compute (correctly) and then reach some synchronization or communication point at which point the logic relating to message transfer would be wrong and everything would hang. At this point, it proved to be immensely useful to be able to go in with the debugger and look at which nodes had actually sent messages to other nodes.
Often one would see something like
Node 0:indicating that node 0 has received two messages of type 12 and length 32 bytes from node 1 and node 2 but that neither node 1 nor node 2 has any.Node 1, type 12, len 32
(12 4a 44 82 3e 00 ...)
Node 2, type 12, len 32
(33 4a 5f ff 00 00 ...)
Node 1: No messages
Node 2: No messages
Armed with this type of information, it is usually extremely easy to detect the commonest type of parallel processing problem.
Lesson 5a. Message Passing Is Easy An interesting corollary to the debugging style just described is that we learned that debugging message-passing programs was much easier than other types of parallel programming.
The important advantage that a user-friendly debugger brings to the user is the ability to slow down the execution of the program to the point where the user can ``see'' the things that go wrong. This fits well with the ``message-passing'' methodology since bugs in message passing usually result in the machine hanging. In this state, you have plenty of time to examine what's happening and deduce the error. Furthermore, the problem is normally completely repeatable since it usually relates to a logic error in the code.
In contrast, shared-memory or multiprocessing paradigms are much harder because the bugs tend to depend on the relative timing of various events within the code. As a result, the very act of using the debugger can cause the problem to show up in a different place or even to go away all together. This is akin to that most frustrating of problems when you are tracking down a bug with print statements, only to find that just as you insert the climactic final statement which will isolate your problem, it goes away altogether!
Lesson 6. How Many Windows? The debugger ndb was originally designed to be driven from a terminal by users typing commands, but with the advent of graphical workstations with windowing systems it was inevitable that someone would want a ``windowing'' version of the debugger.
It is interesting to note that many users' original conception was that it would now be correct to port a sequential debugger and have multiple instances of it, each debugging one node.
This illusion is quickly removed, however, when we are debugging a program on many nodes with many invocations of a sequential debugger. Not only is it time-consuming setting up all of the windows, but activities such as single stepping become extremely frustrating since one has to go to each window in turn and type the ``continue'' command. Even providing a ``button'' that can be clicked to achieve this doesn't help much because you still have to be able to see the button in the overlapping windows, and however fast you are with the mouse it gets harder and harder to achieve this effect as the number of nodes on which your program is running grows.
Our attempt at solving this problem is to have two different window types: an ndb console and a node window. The console is basically a window-oriented version of the standard debugger. The lower panel allows the user to type any of the normal debugger commands and have them behave in the expected fashion. The buttons at the top of the display allow ``shortcuts'' for the often issued commands, and the center panel allows a shortcut for the most popular command of all:
on all show state This button doesn't actually generate the output from this command in the normal mode since, brief as its output is, it can still be tedious watching 512 copies of
Node XXX: Breakpoint, [foo.c, 13] scroll past. Instead, it presents the node state as a colored bar chart in which the various node states each have different colors. In this way, for example, you can ``poll'' until all the nodes hit a breakpoint by continually hitting the `` Update '' button until the status panel shows a uniform color and the message shows that all nodes have reached a breakpoint.
In addition to this usage, the color coding also vividly shows problems such as a node dividing by zero. In this case, the bar chart would show uniform colors except for the node that has died, which might show up in some contrasting shade.
The second important use of the `` Update '' button is to synchronize the views presented by the second type of window, the ``node windows.''
Each of these presents a view of a ``group'' of nodes represented by a particular choice. Thus, for example, you might choose to make a node window for the nodes 0-3, represented by node 0. In this case, the upper panel of the node window would show the source code being executed by node 0 while the lower panel would automatically direct commands to all four nodes in the group. The small status bar in the center shows a ``smiley'' face if all nodes in the group appear to be at the same source line number and a ``sad'' face if one or more nodes are at different places.
This method allows the user to control large groups of nodes and simultaneously see their source code while also monitoring differences in behavior. A common use of the system, for example, is to start with a single node window reflecting ``all nodes'' and to continue in this way until the happy face becomes sad, at which point additional node windows can be created to monitor those nodes which have departed from the main thread of execution.
The importance of the `` Update '' button in this regard is that the node windows have a tendency to get out of sync with the actual execution of the program. In particular, it would be prohibitively expensive to have each node window constantly tracking the program location of the nodes it was showing, since this would bombard the node program with status requests and also cause constant scrolling of the displayed windows. Instead, ndb chooses its own suitable points to update the displayed windows and can be forced to update them at other times with the `` Update '' button.
Guy Robinson
This section has emphasized the differences between ndb and sequential debuggers since those are the interesting features from the implementation standpoint. On the other hand, from the user's view, the most striking success of the tool is that it has made the debugging process so little different from that used on sequential codes. This can be traced to the loosely synchronous structure of most (C P) parallel codes. Debugging fully asynchronous parallel codes can be much more challenging than the sequential case.
In practice, users have to be shown only once how to start up the debugging process, and be given a short list of the new commands that they might want to use. For users who are unfamiliar with the command syntax, the simplest route is to have them play with dbx on a workstation for a few minutes.
After this, the process tends to be very straightforward, mostly because of the programming styles that we tend to use. As mentioned in an earlier section, debugging totally asynchronous programs that generate multiple threads synchronizing with semaphores in a time-dependent manner is not ndb 's forte. On the other hand, debugging loosely synchronous message-passing programs has been reduced to something of a triviality.
In some sense, we can hardly be said to have introduced anything new. The basis on which ndb operates is very conventional, although some of the implications for the implementation are non-trivial. On the other hand, it provides an important and often critical service to the user. The next section will describe some of the more revolutionary steps that were taken to simplify the development process in the areas of performance analysis and visualization.
Guy Robinson
From the earliest days of parallel computing, the fundamental goal was to accelerate the performance of algorithms that ran too slowly on sequential machines. As has been described in many other places in this book, the effort to do basic research in computer science was always secondary to the need for algorithms that solved practical problems more quickly than was possible on other machines.
One might think that an important prerequisite for this would be advanced profiling technology. In fact, about the most advanced piece of equipment then in use was a wristwatch! Most algorithms were timed on one node, then on two, then on four, and so on. The results of this analysis were then compared with the theoretically derived models for the applications. If all was well, one proceeded to number-crunch; if not, one inserted print statements and timed the gaps between them to see what pieces of code were behaving in ways not predicted by the models.
Even the breakthrough of having a function that a program could call to get at timing information was a long time coming, and even then proved somewhat unpopular, since it had different names on different machines and didn't even exist on the sequential machines. As a result, people tended to just not bother with it rather than mess up their codes with many different timing routines.
Guy Robinson
Of course, this was all totally adequate for the first few applications that were parallelized, since their behavior was so simple to model. A program solving Laplace's equation on a square grid, for example, has a very simple model that one would actually have to work quite hard not to find in a parallel code. As time passed, however, more complex problems were attempted which weren't so easy to model and tools had to be invented.
Of course, this discussion has missed a rather important point which we also tended to overlook in the early days.
When comparing performance of the problems on one, two, four, eight, and so on nodes, one is really only assessing the efficiency of the parallel version of the code. However, an algorithm that achieves 100 percent efficiency on a parallel computer may still be worthless if its absolute performance is lower than that of a sequential code running on another machine.
Again, this was not so important in the earliest days, since the big battle over architectures had not yet arisen. Nowadays, however, when there is a multitude of sequential and parallel supercomputers, it is extremely important to be able to know that a parallel version of a code is going to outperform a sequential version running on another architecture. It is becoming increasingly important to be able to understand what complex algorithms are doing and why, so that the performance of the software and hardware can both be tuned to achieve best results.
This section attempts to discuss some of the issues surrounding algorithm visualization, parallelization and performance optimization, and the tools which C P developed to help in this area. A major recent general tool, PABLO [ Reed:91a ] has been developed at Illinois by Reed's group, but here we only describe the C P activity. One of the earliest tools was Seecube [Couch:88a;88b].
Guy Robinson
The first question that must be asked of any algorithm when a parallel version is being considered is, ``What does it do?'' Surprisingly, this question is often quite hard to answer. Vague responses such as ``some sort of linear algebra'' are quite common and even if the name of the algorithm is actually known, it is quite surprising how often codes are ported without anyone actually having a good impression of what the code does.
One attempt to shed light on these issues by providing a data visualization service is vtool . One takes the original (sequential) source code and runs it through a preprocessor that instruments various types of data access. The program is then compiled with a special run time library and run in the normal manner. The result is a database describing the ways in which the algorithm or application makes use of its data.
Once this has been collected, vtool provides a service analogous to a home VCR which allows the application to be ``played back'' to show the memory accesses being made. Sample output is shown in Figure 5.9 .
Figure 5.9:
Analysis of a Sorting Algorithm Using
vtool
The basic idea is to show ``pictures'' of arrays together with a ``hot spot'' that shows where accesses and updates are being made. As the hot spot moves, it leaves behind a trail of continuingly fading colors that dramatically show the evolution of the algorithm. As this proceeds, the corresponding source code can be shown and the whole simulation can be stopped at any time so that a particularly interesting sequence can be replayed in slow motion or even one step at a time, both forward and backward.
In addition to showing simple access patterns, the display can also show the values being stored into arrays, providing a powerful way of debugging applications.
In the parallel processing arena, this tool is normally used to understand how an algorithm works at the level of its memory references. Since most parallel programs are based on the ideas of data distribution, it is important to know how the values at a particular grid point or location in space depend on those of neighbors. This is fundamental to the selection of a parallelization method. It is also central to the understanding of how the parallel and sequential versions of the code will differ which becomes important when the optimization process begins.
It should be mentioned in passing that we have been surprised in using this tool how often people's conceptions of the way that numerical algorithms work are either slightly or completely revised after seeing the visualization system at work.
Guy Robinson
Hopefully, the visualization system goes some way towards the development of a parallel algorithm. One must then code and debug the application which, as has been described previously, can be a reasonably time-consuming process. Finally, one comes to the ``crisis'' point of actually running the parallel code and seeing how fast it goes.
One of our major concerns in developing performance analysis tools was to make them easy to use. The standard UNIX method of taking the completed program, deleting all its object files, and then recompiling them with special switches seemed to be asking too much for parallel programs because the process is so iterative. On a sequential machine, the profiler may be run once or twice, usually just to check that the authors' impressions of performance are correct. On a parallel computer, we feel that the optimization phase should more correctly be included in the development cycle than as an afterthought, because we believe that few parallel applications perform at their best immediately after debugging is complete. We wanted, therefore, to have a system that could give important information about an algorithm without any undue effort.
The system to be described works with the simple addition of either a runtime switch or the definition of an ``environment'' variable, and makes available about 90% of the capabilities of the entire package. To use some of the most exotic features, one must recompile code.
Guy Robinson
As an example of the ``free'' profiling information that is available consider the display from the ctool utility shown in Figure 5.10 . This provides a summary of the gross ``overheads'' incurred in the execution of a parallel application divided into categories such as ``calculation,'' ``I/O,'' ``internode communication,'' ``graphics,'' and so on. This is the first type of information that is needed in assessing a parallel program and is obtained by simply adding a command line argument to an existing program.
Figure 5.10:
Overhead Summary from
ctool
At the next level of detail after this, the individual overhead categories can be broken down into the functions responsible for them. Within the ``internode communication'' category, for example, one can ask to be shown the times for each of the high-level communication functions, the number of times each was called and the distribution of message lengths used by each. This output is normally presented graphically, but can also be generated in tabular form (Figure 5.11 ) for accurate timing measurements. Again, this information can be obtained more or less ``for free'' by giving a command line argument.
Figure 5.11:
Tabular Overhead Summary
Guy Robinson
The overhead summaries just described offer replies to the important question, ``What are the costs of executing this algorithm in parallel?'' Once this information is known, one typically proceeds to the question, ``Why do they cost this much?''
To answer this question we use etool , the event-tracing profiler.
The purpose of this tool is probably most apparent from its sample output, Figure 5.12 . The idea is that we present timelines for each processor on which the most important ``events'' are indicated by either numbered boxes or thin bars. The former indicate atomic events such as ``calling subroutine foo '' or ``beginning of loop at line 12,'' while the bars are used to indicate the beginning and end of extended events such as a read operation on a file or a global internode communication operation.
Figure 5.12:
Simple Event Traces
The basic idea of this tool is to help understand why the various overheads observed in the previous analysis exist. In particular, one looks for behavior that doesn't fit with that expected of the algorithm.
One common situation, for example, is to look for places where a ``loosely synchronous'' operation is held up by the late arrival of one or more processors at the synchronization point. This is quite simple in etool ; an ``optimal'' loosely synchronous event would have bars in each processor that aligned perfectly in the vertical direction. The impact of a late processor shows up quite vividly, as shown in Figure 5.13 .
Figure 5.13:
Sample Application Behavior as Seen by
etool
This normally occurs either because of a poorly constructed algorithm or because of poor load balancing due to data dependencies.
An alternative pattern that shows up remarkably well is the sequential behavior of ``master-slave'' or ``client-server'' algorithms in which one particular node is responsible for assigning work to a number of other processors. These algorithms tend to show patterns similar to that of Figure 5.12 , in which the serialization of the loop that distributed work is quite evident.
Another way that the event-profiling system can be used is to collect statistics regarding the usage of particular code segments. Placing calls to the routine eprof_toggle around a particular piece of code causes information to be gathered describing how many times that block was executed, and the mean and variance of the time spent there. This is analogous to the ``block profiling'' supported by some compilers.
Guy Robinson
The system first described, vtool , had as its goal the visualization of sequential programs prior to their parallelization. The distribution profiler dtool serves a similar purpose for parallel programs which rely on data distribution for their parallelism. The basic idea is that one can ``watch'' the distribution of a particular data object change as an algorithm progresses. Sample output is shown in Figure 5.14 .
Figure 5.14:
Data Distribution Analysis
At the bottom of the display is a timeline which looks similar to that used in the event profiler, etool . In this case, however, the events shown are the redistribution operations on a particular data object. Clicking on any event with the mouse causes a picture of the data distribution among the nodes to be shown in the upper half of the display. Other options allow for fast and slow replays of a particular sequence of data transformations.
The basic idea of this tool is to look at the data distributions that are used with a view to either optimizing their use or looking for places in which redundant transformations are being made that incur high communication costs. Possible restructuring of the code may eliminate these transformations, thus improving performance. This is particularly useful in conjunction with automatic parallelization tools, which have a tendency to insert redundant communication in an effort to ensure program correctness.
Guy Robinson
As mentioned earlier, the most often neglected question with parallel applications is how fast they are in absolute terms. It is possible that this is a throwback to sequential computers, where profiling tools, although available, are rarely used. In most cases, if a program doesn't run fast enough when all the compiler's optimization capabilities are exhausted, one merely moves to a higher performance machine. Of course, this method doesn't scale well and doesn't apply at all in the supercomputer arena. Even more importantly, as processor technology becomes more and more complex, the performance gap between the peak speed of a system and that attained by compiled code gets ever wider.
The typical solution for sequential computers is the use of profiling tools like prof or gprof that provide a tabular listing of the routines in a program and the amount of time spent in each. This avoids the use of the wristwatch but only goes so far. You can certainly see which routines are the most expensive but no further.
The profiler xtool was designed to serve this purpose for parallel computers and in addition to proceed to lower levels of resolution: source code and even machine instructions. Sample displays are shown in Figure 5.15 . At the top is a graphical representation of the time spent executing each of the most expensive routines. The center shows a single routine at the level of its source code and the bottom panel shows individual machine instructions.
Figure 5.15:
Output from the CPU Usage Profiler
The basic goal of this presentation is to allow the user to see where CPU time is being spent at any required level of detail. At the top level, one can use this information to develop or restructure algorithms, while at the lowest level one can see how the processor instructions operate and use this data to rework pieces of code in optimized assembly language.
Note that while the other profiling tools are directed specifically towards understanding the parallel processing issues of an application, this tool is aimed mostly at a thorough understanding of sequential behavior.
Guy Robinson
One of the most often asked questions about this profiling system is why there are so many separate tools rather than an all-encompassing system that tells you everything you wish to know about the application.
Our fundamental reason for choosing this method was to attempt to minimize the ``self-profiling'' problem that tends to show up in many systems in which the profiling activity actually spends most of its time profiling the analysis system itself. Users of the UNIX profiling tools, for example, have become adept at ignoring entries for routines such as mcount , which correspond to time spent within the profiling system itself.
Unfortunately, this is not so simple in a parallel program. In sequential applications, the effect of the profiling system is merely to slow down other types of operation, an effect which can be compensated for by merely subtracting the known overheads of the profiling operations. On a parallel computer, things are much more complicated, since slowing down one processor may affect another which in turn affects another, and so on until the whole system is completely distorted by the profiling tools.
Our approach to this problem is to package up the profiling subsystems in subsets which have more or less predictable effects, and then to let the user decide which systems to use in which cases. For example, the communication profiler, ctool , incurs very small overheads-typically a fraction of 1%-while the event profiler costs more and the CPU usage profiler, xtool , most of all. In common use, therefore, we tend to use the communication profiler first, and then enable the event traces. If these two trials yield consistent results, we move on to the execution and distribution profilers. We have yet to encounter an application in which this approach has failed, although the fact that we are rarely interested in microsecond accuracy helps in this regard.
Interestingly, we have found problems due to ``clock-skewing'' to have negligible impact on our results. It is true that clock skewing occurs in most parallel systems, but we find that our profiling results are accurate enough to be helpful without taking any special precautions in this regard. Again, this is mostly due to the fact that, for the kinds of performance analysis and optimization in which we are interested, resolution of tens or even hundreds of microseconds is usually quite acceptable.
Guy Robinson
Our assumption that parallel algorithms are complex entities seems to be borne out by the fact that nearly everyone who has invested the (minimal) time to use the profiling tools on their application has come away understanding something better than before. In some cases, the revelations have been so profound that significant performance enhancements have been made possible.
In general, the system has been found easy to use, given a basic understanding of the parallel algorithm being profiled, and most users have no difficulty recognizing their applications from the various displays. On the other hand, the integration between the different profiling aspects is not yet as tight as one might wish and we are currently working on this aspect.
Another interesting issue that comes up with great regularity is the request on behalf of the users for a button marked ``Why?'', which would automatically analyze the profile data being presented and then point out a block of source code and a suggestion for how to improve its performance. In general, this is clearly too difficult, but it is interesting to note that certain types of runtime system are more amenable to this type of analysis than others. The ``distribution profiler,'' for instance, possesses enough information to perform quite complex communication and I/O optimizations on an algorithm and we are currently exploring ways of implementing these strategies. It is possible that this line of thought may eventually lead us to a more complete programming model than is in use now-one which will be more amenable to the automation of parallel processing that has long been our goal.
Guy Robinson
Guy Robinson
Synchronous problems have been defined in Section 3.4 as having the simplest temporal or computational structure. The problems are typically defined by a regular grid, as illustrated in Figure 4.3 , and are parallelized by a simple domain decomposition. A synchronous temporal structure corresponds to each point in the data domain being evolved with an identical computational algorithm, and we summarize this in the caricature shown in Figure 6.1 . We find several important synchronous problems in the academic applications, which formed the core of C P's work. We expect-as shown in Chapter 19 -that the ``real world'' (industry and government) will show fewer problems of the synchronous class. One hopes that a fundamental theory will describe phenomena in terms of simple elegant and uniform laws; these are likely to lead to a synchronous or computational (temporal) structure. On the other hand, real-world problems typically involve macroscopic phenomenological models as opposed to fundamental theories of the microscopic world. Correspondingly, we find in the real world more loosely synchronous problems that only exhibit macroscopic temporal synchronization.
Figure 6.1:
The Synchronous Problem Class
There is no black-and-white definition of synchronous since, practically, we allow some violations of the rigorous microscopic synchronization. This is already seen in Section 4.2 's discussion of the irregularity of Monte Carlo ``accept-reject'' algorithms. A deeper example is irregular geometry problems, such as the partial differential equations of Chapters 9 and 12 with an irregular mesh. The simplest of these can be implemented well on SIMD machines as long as each node can access different addresses. In the High Performance Fortran analysis of Chapter 13 , there is a class of problems lacking the regular grid of Figure 4.3 . They cannot be expressed in terms of Fortran 90 with arrays of values. However, the simpler irregular meshes are topologically rectangular-they can be expressed in Fortran 90 with an array of pointers. The SIMD Maspar MP-1,2 supports this node-dependent addressing and has termed this an ``autonomous SIMD'' feature. We believe that just as SIMD is not a precise computer architecture, the synchronous problem class will also inevitably be somewhat vague, with some problems having architectures in a grey area between synchronous and loosely synchronous.
The applications described in Chapter 4 were all run on MIMD machines using the message-passing model of Chapter 5 . Excellent speedups were obtained. Interestingly, even when C P acquired a SIMD CM-2, which also supported this problem class well, we found it hard to move onto this machine because of the different software model-the data parallel languages of Chapter 13 -offered by SIMD machines. The development of High Performance Fortran, reviewed in Section 13.1 , now offers the same data-parallel programming model on SIMD and MIMD machines for synchronous problems. Currently, nobody has efficiently ported the message-passing model to SIMD machines-even with the understanding that it would only be effective for synchronous problems. It may be that with the last obvious restriction, the message-passing model could be implemented on SIMD machines.
This chapter includes a set of neural network applications. This is an important class of naturally parallel problems, and represents one approach to answering the question:
``How can one apply massively parallel machines to artificial intelligence (AI)?''
We were asked this many times at the start of C P, since AI was one of the foremost fields in computer science at the time. Today, the initial excitement behind the Japanese fifth-generation project has abated and AI has transitioned to a routine production technology which is perhaps more limited than originally believed. Interestingly, the neural network approach leads to synchronous structure, whereas the complementary actor or expert system approaches have a very different asynchronous structure. The high temperature superconductivity calculations in Section 6.3 made a major impact on the condensed matter community. Quoting from Nature [ Maddox:90a ]
``Yet some progress seems to have been made. Thus Hong-Qiang Ding and Miloje S. Makivic, from California Institute of Technology, now describe an exceedingly powerful Monte Carlo calculation of an antiferromagnetic lattice designed to allow for the simulation of ( Phys. Rev. Lett. 64 , 1,449; 1990). In this context, a Monte Carlo simulation entails starting with an arbitrary arrangement of spins on the lattice, and then changing them in pairs according to rules that allow all spin states to be reached without violating the overall constraints. The authors rightly boast of their access to Caltech's parallel computer system, but they have also devised a new and efficient algorithm for tracing out the evolution of their system. As is the custom in this part of the trade, they have worked with square patches of two-dimensional lattice with as many as 128 lattice spacings to each side.The outcome is a relationship between correlation length-the distance over which order, on the average, persists-and temperature; briefly, the logarithm of the correlation length is inversely proportional to the temperature. That, apparently, contradicts other models of the ordering process. In lanthanum copper oxide, the correlation length agrees well with that measured by neutron diffraction below (where there is a phase transition), provided the interaction energy is chosen appropriately. For what it is worth, that energy is not very different from estimates derived from Raman-scattering experiments, which provide a direct measurement of the energy of interaction by the change of frequency of the scattered light.''
The hypercube allowed much larger high- calculations than the previous state of the art, with conventional machines. Curiously, with QCD simulations (described in Section 4.3 ), we were only able at best to match the size of the Cray calculations of other groups. This probably reflects different cultures and computational expectations of the QCD and condensed matter communities. C P had the advantage of dedicated facilities and could devote them to the most interesting applications.
Section 6.2 describes an early calculation, which was a continuation of our collaboration with Sandia on nCUBE applications. They, of course, followed this with a major internal activity, including their impressive performance analysis of 1024-node applications [ Gustafson:88a ]. There were several other synchronous applications in C P that we will not describe in this book. Wasson solved the single-particle Schrödinger equation in a regular grid to study the ground state of nuclear matter as a function of temperature and pressure. His approach used the time-dependent Hartree-Fock method, but was never taken past the stage of preliminary calculations on the early Mark II machines [ Wasson:87a ]. There were also two interesting signal-processing algorithms. Pollara implemented the Viterbi algorithm for convolutional decoding of data sent on noisy communication channels [ Pollara:85a ], [ Pollara:86a ]. This has similarities with the Cooley-Tukey binary FFT parallelization described in [ Fox:88a ]. We also looked at alternatives to this binary FFT in a collaboration with Aloisio from the Italian Space Agency. The prime number (nonbinary) discrete Fourier transform produces a more irregular communication pattern than the binary FFT and, further, the node calculations are less easy to pipeline than the conventional FFT. Thus, it is hard to achieve the theoretical advantage of the nonbinary FFT. This often has less floating-point operations needed for a given analysis whose natural problem size may not be the power of two demanded by the binary FFT
[Aloisio:88a;89b;90b;91a;91b]. This parallel discrete FFT was designed for synthetic aperture radar applications for the analysis of satellite data [Aloisio:90c;90d].
The applications in Sections 6.7.3 , 6.5 , and 6.6 use the important multiscale approach to a variety of vision or image processing problems. Essentially, all physical problems are usefully considered at several different length scales, and we will come back to this in Chapters 9 and 12 when we study partial differential equations (multigrid) and practice dynamics (fast multipole).
Guy Robinson
This work implemented a code on the nCUBE-1 hypercube for studying the evolution of two-dimensional, convectively-dominated fluid flows. An explicit finite difference scheme was used that incorporates the flux-corrected transport (FCT) technique developed by Boris and Book [ Boris:73a ]. When this work was performed in 1986-1987, it was expected that explicit finite difference schemes for solving partial differential equations would run efficiently on MIMD distributed-memory computers, but this had only been demonstrated in practice for ``toy'' problems on small hypercubes of up to 64 processors. The motivation behind this work was to confirm that a bona fide scientific application could also attain high efficiencies on a large commercial hypercube. The work also allowed the capabilities and shortcomings of the newly-acquired nCUBE-1 hypercube to be assessed.
HPFA Applications and Paradigms
Guy Robinson
Although first-order finite difference methods are monotonic and stable, they are also strongly dissipative, causing the solution to become smeared out. Second-order techniques are less dissipative, but are susceptible to nonlinear, numerical instabilities that cause nonphysical oscillations in regions of large gradient. The usual way to deal with these types of oscillation is to incorporate artificial diffusion into the numerical scheme. However, if this is applied uniformly over the problem domain, and enough is added to dampen spurious oscillations in regions of large gradient, then the solution is smeared out elsewhere. This difficulty is also touched upon in Section 12.3.1 . The FCT technique is a scheme for applying artificial diffusion to the numerical solution of a convectively-dominated flow problem in a spatially nonuniform way. More artificial diffusion is applied in regions of large gradient, and less in smooth regions. The solution is propagated forward in time using a second-order scheme in which artificial diffusion is then added. In regions where the solution is smooth, some or all of this diffusion is subsequently removed, so the solution there is basically second order. Where the gradient is large, little or none of the diffusion is removed, so the solution in such regions is first order. In regions of intermediate gradient, the order of the solution depends on how much of the artificial diffusion is removed. In this way, the FCT technique prevents nonphysical extrema from being introduced into the solution.
Guy Robinson
The governing equations are similar to those in Section 12.3.1 , namely, the two-dimensional Euler equations,
where,
Here is the fluid mass density, E is the specific energy, u and v are the fluid velocities in the x and y directions, and are body force components, and the pressure, p , is given by,
where is the constant adiabatic index. The motion of the fluid is tracked by introducing massless marker particles and allowing them to be advected with the flow. Thus, the number density of the marker particles, , satisfies,
The equations are solved on a rectilinear two-dimensional grid. Second-order accuracy in time is maintained by first advancing the velocities by a half time step, and then using these velocities to update all values for the full time step. The size of the time step is governed by the Courant condition.
The basic procedure in each time step is to first apply a five-point difference operator at each grid point to convectively transport the field values. These field values are then diffused in each of the positive and negative x and y directions. The behavior of the resulting fields in the vicinity of each grid point is then examined to determine how much diffusion to remove at that point. In regions where a field value is locally monotonic, nearly all the diffusion previously applied is removed for that field. However, in regions close to extrema, the amount of diffusion removed is less.
Guy Robinson
The code used in this study parallelizes well for a number of reasons. The discretization is static and regular, and the same operations are applied at each grid point, even though the evolution of the system is nonlinear. Thus, the problem can be statically load balanced at the start of the code by ensuring that each processor's rectangular subdomain contains the same number of grid points. In addition, the physics, and hence the algorithm, is local so the finite difference algorithm only requires communication between nearest neighbors in the hypercube topology. The extreme regularity of the FCT technique means that it can also be efficiently used to study convective transport on SIMD concurrent computers, such as the Connection Machine, as has been done by Oran, et al. [ Oran:90a ].
No major changes were introduced into the sequential code in parallelizing it for the hypercube architecture. Additional subroutines were inserted to decompose the problem domain into rectangular subdomains, and to perform interprocessor communication. Communication is necessary in applying the Courant condition to determine the size of the next time step, and in transferring field values at grid points lying along the edge of a processor's subdomain. Single rows and columns of field values were communicated as the algorithm required. Some inefficiency, due to communication latency, could have been avoided if several rows and/or columns were communicated at the same time, but in order to avoid wasting memory on larger communication buffers, this was not done. This choice was dictated by the small amount of memory (about ) available on each nCUBE-1 processor.
Guy Robinson
As a sample problem, the onset and growth of the Kelvin-Helmholtz instability was studied. This instability arises when the interface between two fluids in shear motion is perturbed, and for this problem the body forces, and , are zero. In Figure 6.2 (Color Plate), we show the development of the Kelvin-Helmholtz instability at the interface of two fluids in shear motion. In these figures, the density of the massless marker particles normalized by the fluid density is plotted on a color map, with red corresponding to a density of one through green, blue, and white to a density of zero. Initially, all the marker particles are in the upper half of the domain, and the fluids in the lower- and upper-half domains have a relative shear velocity in the horizontal direction. An finite difference grid was used. Vortices form along the interface and interact before being lost to numerical diffusion. By processing the output from the nCUBE-1, a videotape of the evolution of the instability was produced. This sample problem demonstrates that the FCT technique is able to track the physical instability without introducing numerical instability.
Figure 6.2:
Development of the Kelvin-Helmholtz
instability at the interface of two fluids in shear motion.
Guy Robinson
Table 6.1:
Timing Results in Seconds for a 512-processor and a
1-processor nCUBE-1. The values
and
represent the numbers
of grid points per processor in the
x
and
y
directions. The
concurrent efficiency, overhead, and speedup are denoted by
,
f
, and
S
.
The code was timed for the Kelvin-Helmholtz problem for hypercubes with dimension ranging from zero to nine. The results for the 512-processor case are presented in Table 6.1 , and show a speedup of 429 for the largest problem size considered. Subsequently, a group at Sandia National Laboratories, using a modified version of the code, attained a speedup of 1009 on a 1024-processor nCUBE-1 for a similar type of problem [ Gustafson:88a ]. The definitions of concurrent speedup, overhead, and efficiency are given in Section 3.5 .
An analytic model of the performance of the concurrent algorithm was developed, and ignoring communication latency, the concurrent overhead was found to be proportional to , where n is the number of grid points per processor. This is in approximate agreement with the results plotted in Figure 6.3 , that shows the concurrent overhead for a number of different hypercubes dimensions and grain sizes.
Guy Robinson
The FCT code was ported to the nCUBE-1 by David W. Walker [ Walker:88b ]. Gary Montry of Sandia National Laboratories supplied the original code, and made several helpful suggestions. A videotape of the evolution of the Kelvin-Helmholtz instability was produced by Jeff Goldsmith at the Image Processing Laboratory of the Jet Propulsion Laboratory.
Figure 6.3:
Overhead,
f
, as a Function of
, Where
n
Is
the Number of Grid Points per Processor. Results are shown for nCUBE-1
hypercubes of dimension one to nine. The overhead for the 2-processor case
(open circles) lies below that for the higher dimensional hypercubes. This
is because the processors only communicate in one direction in the
2-processor case, whereas for hypercubes of dimension greater than one,
communication is necessary in both the
x
and
y
directions.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Following the discovery of high-temperature superconductivity, two-dimensional quantum antiferromagnetic spin systems have received enormous attention from physicists worldwide. It is generally believed that high-temperature superconductivity occurs in the planes, which is shown in Figure 6.4 . Many features can be explained [ Anderson:87a ] in the Hubbard theory of the strongly coupled electron, which at half-filling is reduced to spin-1/2 antiferromagnetic Heisenberg model:
where are quantum spin operators. Furthermore, the neutron scattering experiments on the parent compound, , reveal a rich magnetic structure which is also modelled by this theory.
Physics in two dimensions (as compared to three dimensions) is characterized by the large fluctuations. Many analytical methods work well in three dimensions, but fail in two dimensions. For the quantum systems, this means additional difficulties in finding solutions to the problem.
Figure 6.4:
The Copper-Oxygen Plane, Where the Superconductivity Is
Generally Believed to Occur. The arrows denote the quantum spins.
,
,
denote the wave functions which
lead to the interactions among them.
Figure:
Inverse Correlation Length of
Measured in Neutron
Scattering Experiment, Denoted by Cross; and Those Measured in our
Simulation, Denoted by Squares (Units in
.
. At
,
undergoes a structural transition. The curve is the fit shown in
Figure
6.11
.
New analytical methods have been developed to understand the low-T behavior of these two-dimensional systems, and progress had been made. These methods are essentially based on a expansion. Unfortunately, the extreme quantum case lies in the least reliable region of these methods. On the other hand, given sufficient computer power, Quantum Monte Carlo simulation [ Ding:90g ] can provide accurate numerical solutions of the model theory and quantitative comparison with the experiment (see Figure 6.5 ). Thus, simulations become a crucial tool in studying these problems. The work described here has made a significant contribution to the understanding of high- materials, and has been well received by the science community [ Maddox:90a ].
Guy Robinson
Using the Suzuki-Trotter transformation , the two-dimensional quantum problem is converted into three-dimensional classical Ising spins with complicated interactions. The partition function becomes a product of transfer matrices for each four-spin interaction
with . These four-spin squares go in the time direction on the three-dimensional lattice. This transfer matrix serves as the probability basis for a Monte Carlo simulation. The zero matrix elements are the consequence of the quantum conservation law. To avoid generating trial configurations with zero probability, thus wasting the CPU time since these trials will never be accepted, one should have the conservation law built into the updating scheme. Two types of local moves may locally change the spin configurations, as shown in Figure 6.6 . A global move in the time direction flips all the spins along this time line. This update changes the magnetization. Another global move in spatial directions changes the winding numbers.
Figure 6.6:
(a) A ``Time-Flip.'' The white
plaquette is a
non-interacting one. The eight plaquettes surrounding it are interacting
ones. (b) A ``Space-Flip.'' The white
plaquette is a
non-interacting one lying in spatial dimensions. The four plaquettes
going in time direction are interacting ones.
This classical spin system in three dimensions is simulated using the Metropolis Monte Carlo algorithm. Starting with a given initial configuration, we locate a closed loop C of L spins, in one of the four moves. After checking that they satisfy the conservation law, we compute , the probability before all L spins are flipped, which is a product of the diagonal elements of the transfer matrix; and , the probability after the spins are flipped, which is a product of the off-diagonal elements of the transfer matrix along the loop C . The Metropolis procedure is to accept the flip according to the probability .
Figure 6.7:
A Vectorization of Eight ``Time-Flips.'' Spins along the
t
-direction are packed into computer words. The two 32-bit words,
S1 and S2, contain eight ``time plaquettes,'' indicated by the dashed
lines.
We implemented a simple and efficient multispin coding method which facilitates vectorization and saves index calculation and memory space. This is possible because each spin only has two states, up (1) or down (0), which is represented by a single bit in a 32-bit integer. Spins along the t -direction are packed into 32-bit words, so that the boundary communication along the x or y direction can be handled more easily. All the necessary checks and updates can be handled by the bitwise logical operations OR, AND, NOT, and XOR. Note that this is a natural vectorization, since AND operations for the 32 spins are carried out in the single AND operation by the CPU. The index calculations to address these individual spins are also minimized, because one only computes the index once for the 32 spins. The same principles are applied for both local and global moves. Figure 6.7 shows the case for time-loop coding.
Guy Robinson
The fairly large three-dimensional lattices (usually ) are partitioned into a ring of M processors with x -dimension which is uniformly distributed among the M processors. The local updates are easily parallelized since the connection is, at most, next-nearest neighbor (for the time-loop update). The needed spin-word arrays from its neighbor are copied into the local storage by the shift routine in the CrOS communication system [ Fox:88a ] before doing the update. One of the global updates, the time line, can also be done in the same fashion. The communication is very efficient in the sense that a single communication shift, , spins instead of Nt spins in the case where the lattice is partitioned into a two-dimensional grid. The overhead/latency associated with the communication is thus significantly reduced.
The winding-line global update along the x -direction is difficult to do in this fashion, because it involves spins on all the M nodes. In addition, we need to compute the correlation functions which have the same difficulty. However, since these operations are not used very often, we devised a fairly elegant way to parallelize these global operations. A set of gather-scatter routines, based on the cread and cwrite in CrOS, is written. In gather , the subspaces on each node are gathered into complete spaces on each node, preserving the original geometric connection. Parallelism is achieved now since the global operations are done on each node just as in the sequential computer, with each node only doing the part it originally covers. In scatter , the updated (changed) lattice configuration on a particular node (number zero) is scattered (distributed) back to all the nodes in the ring, exactly according to the original partition. Note that this scheme differs from the earlier decomposition scheme [ Fox:84a ] for the gravitation problem, where memory size constraint is the main concern.
The hypercube nodes were divided into several independent rings, each ring holding an independent simulation, as shown in Figure 6.8 . At higher temperatures, a spin system of is enough, so that we can simulate several independent systems at the same time. At low temperatures, one needs larger systems, such as -all the nodes will then be dedicated to a single large system. This simple parallelism makes the simulation very flexible and efficient. In the simulation, we used a parallel version of the Fibonacci additive random numbers generator [ Ding:88d ], which has a period larger that .
Figure 6.8:
The Configuration of the Hypercube Nodes. In the example, 32
nodes are configured as four independent rings, each consisting of 8
nodes. Each ring does an independent simulation.
We have made a systematic performance analysis by running the code on different sizes and different numbers of nodes. The timing results for a realistic situation (20 sweeps of update, one measurement) are measured [ Ding:90k ]. The speedup, / , where ( ) is the time for the same size spins system to run same number operations on one node, is plotted in Figure 6.9 . One can see that speedup is quite close to the ideal case, denoted by the dashed line. For the quantum spin system, the 32-node hypercube speeds up the computation by a factor of 26.6, which is a very good result. However, running the same spin system on a 16-node is more efficient, because we can run two independent systems on the 32-node hypercube with a total speedup of (each speedup a factor 14.5). This is better described by efficiency , defined as speedup/nodes, which is plotted in Figure 6.10 . Clearly, the efficiency of the implementation is very high, generally over 90%.
Figure 6.9:
Speedup of the Parallel Algorithm for Lattice Systems
,
and
. The dashed line is the ideal
case.
Figure 6.10:
Efficiency of the Parallel Algorithm
Comparison with other supercomputers is interesting. For this program, the one-head CRAY X-MP speed is approximately that of a 2-node Mark IIIfp. This indicates that our 32-node Mark IIIfp performs better than the CRAY X-MP by about a factor of % = 14! We note that our code is written in C and the vectorization is limited to the 32-bit inside the words. Rewriting the code in Fortran (Fortran compilers on the CRAY are more efficient) and fully vectorizing the code, one may gain a factor of about three on the CRAY. Nevertheless, this quantum Monte Carlo code is clearly a good example, in that parallel computers easily (i.e., at same programming level) outperform the conventional supercomputers.
Guy Robinson
We obtained many good results which were previously unknown. Among them, the correlation functions are perhaps the most important. First, the results can be directly compared with experiments, thus providing new understanding of the magnetic structure of the high-temperature superconducting materials. Second, and no less important, is the behavior of the correlation function we obtained which gives a crucial test of the assessment of various approximate methods.
In the large spin- S (classical) system, the correlation length goes as
at low temperatures. This predicts a too-large correlation length, compared with experimental results. As , the quantum fluctuations in the system become significant. Several approximate methods [ Chakravarty:88a ], [ Auerbach:88a ] predict a similar low- T behavior. , , and p=0 or 1 . is a quantum renormalization constant.
Our extensive quantum Monte Carlo simulations were performed [ Ding:90g ] on the spin- system as large as at low temperature range -1.0. The correlation length, as a function of , is plotted in Figure 6.11 . The data points fall onto a straight line, surprisingly well, throughout the whole temperature range, leading naturally to the pure exponential form:
where a is the lattice constant. This provides a crucial support to the above-mentioned theories. Quantitatively,
or
Figure 6.11:
Correlation Length Measured at Various Temperatures. The
straight line is the fit.
Direct comparison with experiments will not only test the validity of the Heisenberg model, but also determine the important parameter, the exchange coupling J . The spacing between Cu atoms in plane is . Setting , the Monte Carlo data is compared with those from neutron scattering experiments [ Endoh:88a ] in Figure 6.5 . The agreement is very good. This provides strong evidence that the essential magnetic behavior is captured by the Heisenberg model. The quantum Monte Carlo result is an accurate first principle calculation; no adjustable parameter is involved. Comparing directly with the experiment, the only adjustable parameter is J . This gives an independent determination of the effective exchange coupling:
Note that near , the experimentally measured correlation is systematically smaller than the theoretical curve, shown in Equation 6.4 . This is a combined result of small effects: frustration, anisotropies, inter-layer coupling, and so on.
Various moments of the Raman spectrum are calculated using series expansions and comparing with experiments [ Singh:89a ]. This gives an estimate, ( ), which is quite close to the above value determined from correlation functions. Raman scattering probes the short wavelength region, whereas neutron scattering measures the long-range correlations. The agreement of J 's obtained from these two rather different experiments is another significant indication that the magnetic interactions are dominated by the Heisenberg model.
Equation 6.4 is valid for all the quantum AFM spins. The classic two-dimensional antiferromagnetic system discovered twenty years ago [ Birgeneau:71a ], , is a spin-one system with . Very recently, Birgeneau [ Birgeneau:90a ] fitted the measured correlation lengths to
The fit is very good, as shown in Figure 6.12 . The factor ( ) comes from integration of the two-loop -function without taking the limit, and could be neglected if T is very close to 0. For the spin- AFM , Equation 6.4 also describes the data quite well [ Higgins:88a ].
Figure 6.12:
Correlation Length of
Measured in Neutron
Scattering Experiment with the Fit.
A common feature from Figures 6.11 and 6.12 is that the scaling equation Equation 6.4 , which is derived near , is valid for a wide range of T , up to . This differs drastically from the range of criticality in three-dimensional systems, where the width is usually about 0.2 or less. This is a consequence of the crossover temperature [ Chakravarty:88a ], where the Josephson length scale becomes compatible with the thermal wave length, being relatively high, . This property is a general character in the low critical dimensions. In the quantum XY model, a Kosterlitz-Thouless transition occurs [ Ding:90b ] at and the critical behavior remains valid up to .
As emphasized by Birgeneau, the spin-wave value
S=1 , , fits the experiment quite well, whereas for , spin-wave value differs significantly from the correct value 1.25 as in Equation 6.4 . This indicates that the large quantum fluctuations in the spin- system are not adequately accounted for in the spin-wave theory, whereas for the spin-one system, they are.
Figure 6.13 shows the energy density at various temperatures. At higher T , the high-temperature series expansion accurately reproduces our data. At low T , E approaches a finite ground state energy. Another useful thermodynamical quantity is uniform susceptibility, which is shown in Figure 6.14 . Again, at high- T , series expansion coincides with our data. The maximum point occurs at with . This is useful in determining J and for the material.
Figure 6.13:
Energy Measured as a Function of Temperature. Squares are from
our work. The curve is the 10th order high-T expansion.
Figure:
Uniform Susceptibility Measured as a Function of Temperature.
Symbols are similar to Figure
6.13
.
Guy Robinson
In conclusion, the quantum AFM Heisenberg spins are now well understood theoretically. The data from neutron scattering experiments for both , and S=1 , compare quite well. For , this leads to a direct determination .
Quantum spins are well suited for the hypercube computer. Its spatial decomposition is straightforward; the short-range nature (excluding the occasional long-range one) of interaction makes the extension to large numbers of processors simple. Hypercube connections made the use of the node computer efficient and flexible. High speedup can be achieved with reasonable ease, provided one improves the algorithm to minimize the communications.
The work described here is the result of the collaboration between H. Q. Ding and M. S. Makivic.
Guy Robinson
In this section, we discuss two further important developments based on the previous section (Section 6.3 ) on the isotropic Heisenberg quantum spins. These extensions are important in treating the observed phase transitions in the two-dimensional magnetic systems. Theoretically, two-dimensional isotropic Heisenberg quantum spins remain in paramagnetic state at all temperatures [ Mermin:66a ]. However, all crystals found in nature with strong two-dimensional magnetic characters go through phase transitions into ordered states [ Birgeneau:71a ], [ DeJongh:74a ]. These include the recently discovered high- materials, and , despite the presence of large quantum fluctuations in the spin- antiferromagnets.
We consider the cases where the magnetic spins interact through
In the case , the system goes through an Ising-like antiferromagnetic transition, very similar to those that occur in the high- materials. In the case h = -J , that is, the XY model, the system exhibits a Kosterlitz-Thouless type of transition. In both cases, our simulation provides convincing and complete results for the first time.
Through the Matsubara-Matsuda transformation between spin-1/2 operator and bosonic creation/destruction operations and , a general quantum system can be mapped into quantum spin system. Therefore, the phase transitions described here apply to general two-dimensional quantum systems. These results have broad implications in two-dimensional physical systems in particular, and the statistical systems in general.
HPFA Applications and Paradigms
Guy Robinson
The popular explanation for the antiferromagnetic ordering transitions in these high- materials emphasizes the very small coupling, , between the two-dimensional layers, , and is estimated to be about . However, all these systems exhibit some kind of in-plane anisotropies, which is of order . An interesting case is the spin-one crystal, , discovered twenty years ago [ Birgeneau:71a ]. The magnetic behavior of exhibits very strong two-dimensional characters with an exchange coupling . It has a Néel ordering transition at , induced by an Ising-like anisotropy, .
Our simulation provides clear evidence to support the picture that the in-plane anisotropy is also quite important in bringing about the observed antiferromagnetic transition at the most interesting spin- case. Adding an anisotropy energy as small as will induce an ordering transition at . This striking effect and related results agree well with a wide class of experiments, and provide some insights into these types of materials.
Guy Robinson
In the antiferromagnetic spin system, superexchange leads to the dominant isotropic coupling. One of the high-order effects, due to crystal field, is written as , which is a constant for these spin- high- materials. Another second-order effect is the spin-orbital coupling. This effect will pick up a preferred direction and lead to an term, which also arises due to the lattice distortion in . More complicated terms, like the antisymmetric exchange, can also be generated. For simplicity and clarity, we focus the study on the antiferromagnetic Heisenberg model with an Ising-like anisotropy as in Equation 6.12 . The anisotropy parameter h relates to the usual reduced anisotropy energy through . In the past, the anisotropy field model, , has also been included. However, its origin is less clear and, furthermore, the Ising symmetry is explicitly broken.
Guy Robinson
For the large anisotropy system, h=1 , the specific heat are shown for several spin systems in Figure 6.15 (a). The peak becomes sharper and higher as the system size increases, indicating a divergent peak in an infinite system, similar to the two-dimensional Ising model. Defining the transition temperature at the peak of for the finite system, the finite-size scaling theory [ Landau:76a ] predicts that relates to through the scaling law
Setting , the Ising exponent, a good fit with , is shown in Figure 6.15 (b). A different scaling with the same exponent for the correlation length,
is also satisfied quite well, resulting in . The staggered magnetization drops down near , although the behaviors are rounded off on these finite-size systems. All the evidence clearly indicates that an Ising-like antiferromagnetic transition occurs at , with a divergent specific heat. In the smaller anisotropy case, , similar behaviors are found. The scaling for the correlation length is shown in Figure 6.16 , indicating a transition at . However, the specific heat remains finite at all temperatures.
Figure 6.15:
(a) The Specific Heat for Different Size Systems of
h=1
. (b)
Finite Size Scaling for
.
Figure 6.16:
The Inverse Correlation Lengths for
System
(
),
System (
), and
h=0
System
(
) for the Purpose of Comparison. The straight lines are the
scaling relation:
. From it we can pin down
.
The most interesting case is (or , very close to those in [ Birgeneau:71a ]). Figure 6.17 shows the staggered correlation function at compared with those on the isotropic model [ Ding:90g ]. The inverse correlation length measured, together with those for the isotropic model ( h=0 ), are shown in Figure 6.16 . Below , the Ising behavior of a straight line becomes clear. Clearly, the system becomes antiferromagnetically ordered around . The best estimate is
Figure 6.17:
The Correlation Function on the
System at
for
system. It decays with correlation length
. Also shown is the isotropic case
h=0
, which has
.
Guy Robinson
It may seem a little surprising that a very small anisotropy can lead to a substantially high . This may be explained by the following argument. At low T , the spins are highly correlated in the isotropic case. Since no direction is preferred, the correlated spins fluctuate in all directions, resulting in zero net magnetization. Adding a very small anisotropy into the system introduces a preferred direction, so that the already highly correlated spins will fluctuate around this direction, leading to a global magnetization.
More quantitatively, the crossover from the isotropic Heisenberg behavior to the Ising behavior occurs at , where the correlation length is of order of some power of the inverse anisotropy. From the scaling arguments [ Riedel:69a ], where is the crossover exponent. In the two-dimensional model, both and are infinite, but the ratio is approximately 1/2. For , this relation indicates that the Ising behavior is valid for , which is clearly observed in Figure 6.16 . Similar crossover around for is also observed in Figure 6.16 . At low T , for the isotropic quantum model, the correlation length behaves as [ Ding:90g ] where . Therefore, we expect
where is spin- S dependent constant of order one. Therefore, even a very small anisotropy will induce a phase transition at a substantially high temperature ( ). This crude picture, suggested a long time ago to explain the observed phase transitions, is now confirmed by the extensive quantum Monte Carlo simulations for the first time. Note that this problem is an extreme case both because it is an antiferromagnet (more difficult to become ordered than the ferromagnet), and because it has the largest quantum fluctuations (spin- ). Since varies slowly with h , we can estimate at :
Guy Robinson
This simple result correctly predicts for a wide class of crystals found in nature, assuming the same level of anisotropy, that is, . The high- superconductor exhibits a Néel transition at . With , our results give quite a close estimate: . Similar close predictions hold for other systems, such as superconductor and insulator . For the high- material , [ Ding:90g ]. This material undergoes a Néel transition at . Our prediction of is in the same range of , and much better than the naive expectation that . In this crystal, there is some degree of frustration (see below), so the actual transition is pushed down. These examples clearly indicate that the in-plane anisotropy could be quite important to bring the system to the Néel order for these high- materials. For the S=1 system, , our results predict a , quite close to the observed .
These results have direct consequences regarding the critical exponents. The onset of transition is entirely due to the Ising-like anisotropy. Once the system becomes Néel-ordered, different layers in the three-dimensional crystals will order at the same time. Spin fluctuations, in different layers, are incoherent so that the critical exponents such as , , and will be the two, rather than three-dimensional Ising exponents. and show such behaviors clearly. However, the interlayer coupling, although very small (much smaller than the in-plane anisotropy), could induce coherent correlations between the layers, so that the critical exponents will be somewhere between the two and three-dimensional Ising exponents. and seem to belong to this category.
Whether the ground state of the spin- antiferromagnet spins has the long-range Néel order, is a longstanding problem [ Anderson:87a ]. The existence of the Néel order is vigorously proved for . In the most interesting case , numerical calculations on small lattices suggested the existence of the long-range order. Our simulation establishes the long-range order for .
The fact that near , the spin system is quite sensitive to the tiny anisotropy could have a number of important consequences. For example, the correlation lengths measured in are systematically smaller than the theoretical prediction [ Ding:90g ] near . The weaker correlations probably indicate that the frustrations, due to the next to nearest neighbor interaction, come into play. This is consistent with the fact that is below the suggested by our results.
Guy Robinson
It is well known now that the two-dimensional (2D) classical (planar) XY model undergoes Kosterlitz-Thouless (KT) [ Kosterlitz:73a ] transition at [ Gupta:88a ], characterized by exponentially divergent correlation length and in-plane susceptibility. The transition, due to the unbinding of vortex-antivortex pairs, is weak; the specific heat has a finite peak above .
Does the two-dimensional quantum XY model go through a phase transition? If so, what type of transition? This is a longstanding problem in statistical physics. The answers are relevant to a wide class of two-dimensional problems such as magnetic insulators, superfluidity, melting, and possibly to the recently discovered high- superconducting transition. Physics in two dimensions is characterized by large fluctuations. Changing from the classical model to the quantum model, additional quantum fluctuations (which are particularly strong in the case of spin-1/2) may alter the physics significantly. A direct consequence is that the already weak KT transition could be washed out completely.
Guy Robinson
The quantum XY model was first proposed [ Matsubara:56a ] in 1956 to study the lattice quantum fluids. Later, high-temperature series studies raised the possibility of a divergent susceptibility for the two-dimensional model. For the classical planar model, the remarkable theory of Kosterlitz and Thouless [ Kosterlitz:73a ] provided a clear physical picture and correctly predicted a number of important properties. However, much less is known about the quantum model. In fact, it has been controversial. Using a large-order high-temperature expansion, Rogiers, et al. [ Rogiers:79a ] suggested a second-order transition at for spin-1/2. Later, real-space renormalization group analysis was applied to the model with contradictory and inconclusive results. DeRaedt, et al. [ DeRaedt:84a ] then presented an exact solution and Monte Carlo simulation, both based on the Suzuki-Trotter transformation with small Trotter number m . Their results, both analytical and numerical, supported an Ising-like (second-order) transition at the Ising point , with a logarithmically divergent specific heat. Loh, et al. [ Loh:85a ] simulated the system with an improved technique. They found that specific peak remains finite and argued that a phase transition occurs at -0.5 by measuring the change of the ``twist energy'' from the lattice to the lattice. The dispute between DeRaedt, et al., and Loh, et al., centered on the importance of using a large Trotter number m and the global updates in small-size systems, which move the system from one subspace to another. Recent attempts to solve this problem still add fuel to the controversy.
Guy Robinson
The key to pinning down the existence and type of transition is a study of correlation length and in-plane susceptibility, because their divergences constitute the most direct evidence of a phase transition. These quantities are much more difficult to measure, and large lattices are required in order to avoid finite size effects. These key points are lacking in the previous works, and are the focus of our study. By extensive use of the Mark IIIfp Hypercube, we are able to measure spin correlations and thermodynamic quantities accurately on very large lattices ( ). Our work [Ding:90h;92a] provides convincing evidence that a phase transition does occur at a finite temperature in the extreme quantum case, spin- . At transition point, , the correlation length and susceptibility diverge exactly according to the form of Kosterlitz-Thouless (Equation 6.18 ).
We plot the correlation length, , and the susceptibility, , in Figures 6.18 and 6.19 . They show a tendency of divergence at some finite . Indeed, we fit them to the form predicted by Kosterlitz and Thouless for the classical model
The fit is indeed very good ( per degree of freedom is 0.81), as shown in Figure 6.18 . The fit for correlation length gives
A similar fit for susceptibility, is also very good ( ):
as shown in Figure 6.19 . The good quality of both fits and the closeness of 's obtained are the main results of this work. The fact that these fits also reproduce the expected scaling behavior with
is a further consistency check. These results strongly indicate that the spin-1/2 XY model undergoes a Kosterlitz-Thouless phase transition at . We note that this is consistent with the trend of the ``twist energy'' [ Loh:85a ] and that the rapid increase of vortex density near is due to the unbinding of vortex pairs. Figures 6.18 and 6.19 also indicate that the critical region is quite wide ( ), which is very similar to the spin-1/2 Heisenberg model, where the behavior holds up to . These two-dimensional phenomena are in sharp contrast to the usual second-order transitions in three dimensions.
Figure 6.18:
Correlation Length and Fit. (a)
versus
T
. The vertical
line indicates
diverges at
; (b)
versus
. The straight line indicates
.
Figure:
(a) This figure repeats the plot of Figure
6.18
(a)
showing on a coarser scale both the high temperature expansion (HTE) and the
Kosterlitz-Thouless fit (KT). (b) Susceptibility
and Fit
The algebraic exponent is consistent with the Ornstein-Zernike exponent at higher T . As , shifts down slightly and shows signs of approaching 1/4, the value at for the classical model. This is consistent with Equation 6.21 .
Figure 6.20:
Specific Heat
. For
, the lattice size is
.
We measured energy and specific heat, (for we used a lattice). The specific heat is shown in Figure 6.20 . We found that has a peak above , at around . The peak clearly shifts away from on the much smaller lattice. DeRaedt, et al. [ DeRaedt:84a ] suggested a logarithmic divergent in their simulation, which is likely an artifact of their small m values. One striking feature in Figure 6.20 is a very steep increase of at . The shape of the curve is asymmetric near the peak. These features of the curve differ from that in the classical XY model [ Gupta:88a ].
Guy Robinson
Quantum fluctuations are capable of pushing the transition point from in the classical model, down to in the quantum spin-1/2 case, although they are not strong enough to push it down to 0. They also reduced the constant from 1.67 in the classical case to 1.18 in the spin-1/2 case.
The critical behavior in the quantum case is of the KT-type, as in the classical case. This is a little surprising, considering the differences regarding the spin space. In the classical case, the spins are confined to the X - Y plane (thus the model is conventionally called a ``planar rotator'' model). This is important for the topological order in KT theory. The quantum spins are not restricted to the X - Y plane, due to the presence of for the commutator relation. The KT behavior found in the quantum case indicates that the extra dimension in the spin space (which does not appear in the Hamiltonian) is actually unimportant. These correlations are very weak and short-ranged. The out-of-plane susceptibility remains a small quantity in the whole temperature range.
These results for the XY model, together with those on the quantum Heisenberg model, strongly suggest that although quantum fluctuations at finite T can change the quantitative behavior of these nonfrustrated spin systems with continuous symmetries, the qualitative picture of the classical system persists. This could be understood following universality arguments that, near the critical point, the dominant behavior of the system is determined by long wavelength fluctuations which are characterized by symmetries and dimensionality. The quantum effects only change the short-range fluctuations which, after integrated out, only enter as renormalization of the physical parameters, such as .
Our data also show that, for the XY model, the critical exponents are spin- S independent, in agreement with universality. More specifically, in Equation 6.18 could, in principle, differ from its classical value 1/2. Our data are sufficient to detect any systematic deviation from this value. For this purpose, we plotted in Figure 6.18 (b), using versus . As expected, data points all fall well on a straight line (except the point at where the critical region presumably ends). A systematic deviation from would lead to a slightly curved line instead of a straight line. In addition, the exponent, at , seems to be consistent with the value for the classical system.
Our simulations reveal a rich structure, as shown in the phase diagram (Figure 6.21 ) for these quantum spins. The antiferromagnetic ordered region and the topological ordered region are especially relevant to the high- materials.
Figure:
Phase Diagram for the Spin-
Quantum System Shown in
Equation
6.12
. The solid points are from quantum Monte
Carlo simulations. For large
, the system is practically
an Ising system. Near
h=0
or
h=-2
, the logarithmic relation,
Equation
6.16
holds.
Finally, we point out the connection between the quantum XY system and the general two-dimensional quantum system with continuous symmetry. Through the above-mentioned Matsubara-Matsuda transformations, our result implies the existence of the Kosterlitz-Thouless condensation in two-dimensional quantum systems. The symmetry in the XY model now becomes , a continuous phase symmetry. This quantum KT condensation may have important implications on the mechanism of the recently discovered high-temperature superconducting transitions.
Guy Robinson
Vision (both biological and computer-based) is a complex process that can be characterized by multiple stages where the original iconic information is progressively distilled and refined. The first researchers to approach the problem underestimated the difficulty of the task-after all, it does not require a lot of effort for a human to open the eyes, form a model of the environment, recognize objects, move, and so on. But in the last years a scientific basis has been given to the first stages of the process ( low- and intermediate-level vision ) and a large set of special-purpose algorithms are available for high-level vision.
It is already possible to execute low-level operations (like filtering, edge detection, intensity normalization) in real time (30 frames/sec) using special-purpose digital hardware (like digital signal processors). On the contrary, higher level visual tasks tend to be specialized to the different applications, and require general-purpose hardware and software facilities.
Parallelism and multiresolution processing are two effective strategies to reduce the computational requirements of higher visual tasks (see, for example, [Battiti:91a;91b], [ Furmanski:88c ], [ Marr:76a ]). We describe a general software environment for implementing medium-level computer vision on large-grain-size MIMD computers. The purpose has been to implement a multiresolution strategy based on iconic data structures (two-dimensional arrays that can be indexed with the pixels' coordinates) distributed to the computing nodes using domain decomposition .
In particular, the environment has been applied successfully to the visible surface reconstruction and discontinuity detection problems. Initial constraints are transformed into a robust and explicit representation of the space around the viewer. In the shape from shading problem, the constraints are on the orientation of surface patches, while in the shape from motion problem (for example), the constraints are on the depth values.
We will describe a way to compute the motion ( optical flow ) from the intensity arrays of images taken at different times in Section 6.7 .
Discontinuities are necessary both to avoid mixing constraints pertaining to different physical objects during the reconstruction, and to provide a primitive perceptual organization of the visual input into different elements related to the human notion of objects.
HPFA Applications
Guy Robinson
The purpose of early vision is to undo the image formation process, recovering the properties of visible three-dimensional surfaces from the two-dimensional array of image intensities.
Computationally, this amounts to solving a very large system of equations. In general, the solution is not unique or does not exist (and therefore, one must settle for a suitable approximation).
The class of admissible solutions can be restricted by introducing a priori knowledge: the desired ``typical'' properties are enforced, transforming the inversion problem into the minimization of a functional . This is known as the regularization method [ Poggio:85a ]. Applying the calculus of variations, the stationary points are found by solving the Euler-Lagrange partial differential equations.
In standard methods for solving PDEs, the problem is first discretized on a finite-dimensional approximation space. The very large algebraic system obtained is then solved using, for example, ``relaxation'' algorithms which are local and iterative. The local structure is essential for the efficient use of parallel computation.
By the local nature of the relaxation process, solution errors on the scale of the solution grid step are corrected in a few iterations; however, larger scale errors are corrected very slowly. Intuitively, in order to correct them, information must be spread over a large scale by the ``sluggish'' neighbor-neighbor influence. If we want a larger spread of influence per iteration, we need large scale connections for the processing units, that is, we need to solve a simplified problem on a coarser grid.
The pyramidal structure of the multigrid solution grids is illustrated in Figure 6.22 .
Figure 6.22:
Pyramidal Structure for Multigrid Algorithms and General Flow of
Control
This simple idea and its realization in the multigrid algorithm not only leads to asymptotically optimal solution times (i.e., convergence in operations), but also dramatically decreases solution times for a variety of practical problems, as shown in [ Brandt:77a ].
The multigrid ``recipe'' is very simple. First use relaxation to obtain an approximation with smooth error on a fine grid. Then, given the smoothness of the error, calculate corrections to this approximation on a coarser grid, and to do this first relax, then correct recursively on still coarser grids. Optionally, you can also use nested iteration (that coarser grids provide a good starting point for finer grids) to speed up the initial part of the computation.
Historically, these ideas were developed starting from the 1960s by Bakhvalov, Fedorenko, and others (see Stüben, et al. [ Stuben:82a ]). The sequential multigrid algorithm has been used for solving PDEs associated with different early vision problems in [ Terzopoulos:86a ].
It is shown in [ Brandt:77a ] that, with a few modifications in the basic algorithms, the actual solution (not the error) can be stored in each layer ( full approximation storage algorithm ). This method is particularly useful for visual reconstruction where we are interested not only in the finest scale result, but also in the multiscale representation developed as a byproduct of the solution process.
Guy Robinson
Line processes [ Marroquin:84a ] are binary variables arranged in a two-dimensional array. An active line process ( ) between two neighboring pixels indicates that there is a physical discontinuity between them. Activation is, therefore, based on a measure of the difference in pixel properties but must also take into account the presence of other LPs. The idea is that continuous nonintersecting chains of LPs are preferred to discontinuous and intersecting ones, as it is shown in Figure 6.23 .
Figure:
The Multiscale Interaction Favors the Formation of Continuous
Chains of Line Processes. The figure on the left sketches the multiscale
interaction of LPs that, together with the local interaction at the
same scale, favors the formation of continuous chains of Line
Processes (LP caused by ``noise'' are filtered out at the coarse
scales, the LPs caused by real discontinuities remain and act on the
finer scales, see Figure
6.24
). On the right,
we show a favored (top) and a penalized (bottom) configuration. On
the left, we see coarsest scale with increasing resolution in two
lower outlines of hand.
We propose to combine the surface reconstruction and discontinuity detection phases in time and scale space . To do this, we introduce line processes at different scales, ``connect'' them to neighboring depth processes (henceforth DPs) at the same scale and to neighboring LPs on the finer and coarser scale. The reconstruction assigns equal priority to the two process types.
This scheme not only greatly improves convergence speed (the typical multigrid effect) but also produces a more consistent reconstruction of the piecewise smooth surface at the different scales.
Guy Robinson
Creation of discontinuities must be favored either by the presence of a ``large'' difference in the z values of the nearby DPs, or by the presence of a partial discontinuity structure that can be improved.
To measure the two effects in a quantitative way, it is useful to introduce two functions: cost and benefit . The benefit function for a vertical LP is , and analogously for a horizontal one. The idea is that the activation of one LP is beneficial when this quantity is large.
Cost is a function of neighborhood configuration. A given LP updates its value in a manner depending on the values of nearby LPs. These LPs constitute the neighborhood, and we will to refer to its members as the LPs connected to the original one. The neighborhood is shown in Figure 6.24 .
Figure 6.24:
``Connections'' Between Neighboring Line Processes, at the Same
Scale and Between Different Scales
The updating rule for the LPs derived from the above requirements is:
Because Cost is a function of a limited number of binary variables, we used a look-up table approach to increase simulation speed and to provide a convenient way for simulating different heuristical proposals.
A specific parametrization for the values in the table is suggested in [ Battiti:90a ].
Guy Robinson
The multigrid algorithm described in the previous section can be executed in many different ways on a parallel computer. One essential distinction that has to be made is related to the number of processors available and the ``size'' of a single processor.
The drawback of implementations on fine grain-size SIMD computers (where we assign one processor to each grid point) is that when iteration is on a coarse scale, all the nodes in the other scales (i.e., the majority of nodes) are idle, and the efficiency of computation is seriously compromised.
Furthermore, if the implementation is on a hypercube parallel computer and the mapping is such that all the communications paths in the pyramid are mapped into communication paths in the hypercube with length bounded by two [ Chan:86b ], a fraction of the nodes is never used because the total number of grid points is not equal to a power of two. This fraction is one third for two-dimensional problems encountered in vision.
Fortunately, if we use a MIMD computer with powerful processors, sufficient distributed memory, and two-dimensional internode connections (the hypercube contains a two-dimensional mesh), the above problems do not exist.
In this case, a two-dimensional domain decomposition can be used efficiently: A slice of the image with its associated pyramidal structure is assigned to each processor. All nodes are working all the time, switching between different levels of the pyramid as illustrated in Figure 6.25 .
Figure 6.25:
Domain Decomposition for Multigrid Computation. Processor
communication is on a two-dimensional grid; each processor operates at all
levels of the pyramid.
No modification of the sequential algorithm is needed for points in the image belonging to the interior of the assigned domain. Conversely, points on the boundary need to know values of points assigned to a nearby processor. With this purpose, the assigned domain is extended to contain points assigned to nearby processors, and a communication step before each iteration on a given layer is responsible for updating this strip so that it contains the correct (most recent) values. Two exchanges are sufficient.
The recursive multiscale call mg(lay) is based on an alternation of relaxation steps and discontinuity detection steps as follows (software is written in C language):
int mg(lay) int lay;{
int i;
if(lay==coarsest)step(lay);
else{ i=na;while(i-)step(lay);
i=nb;if(i!=0)
{up(lay);while(i-)mg(lay-1);down(lay-1);}
i=nc;while(i-)step(lay);
}
}
int step(lay) int lay;
{
exchange_border_strip(lay);
update_line_processes(lay); relax_depth_processes(lay);
}
Each step is preceded by an exchange of data on the border of the assigned domains.
Because the communication overhead is proportional to the linear dimension of the assigned image portion, the efficiency is high as soon as the number of pixels in this portion is large. Detailed results are in [ Battiti:91a ].
Guy Robinson
An iterative scheme for solving the shape from shading problem has been proposed in [ Horn:85a ]. A preliminary phase recovers information about orientation of the planes tangent to the surface at each point by minimizing a functional containing the image irradiance equation and an integrability constraint , as follows:
where , , I = measured intensity, and R = theoretical reflectance function.
After the tangent planes are available, the surface z is reconstructed, minimizing the following functional:
Euler-Lagrange differential equations and discretization are left as an exercise to the reader.
Figure 6.26 shows the reconstruction of the shape of a hemispherical surface starting from a ray-traced image . Left is the result of standard relaxation after 100 sweeps, right the ``minimal multigrid'' result (with computation time equivalent to 3 to 4 sweeps at the finest resolution).
Figure 6.26:
Reconstruction of Shape From Shading: Standard Relaxation
(top right) Versus Multigrid (bottom right). The original image is
shown on left.
This case is particularly hard for a standard relaxation approach. The image can be interpreted ``legally'' in two possible ways, as either a concave or convex hemisphere. Starting from random initial values, after some relaxations, some image patches typically ``vote'' for one or the other interpretation and try to extend the local interpretation to a global one. This is slow (given the local nature of the updating rule) and encounters an endless struggle in the regions that mark the border between different interpretations. The multigrid approach solves this ``democratic impasse'' on the coarsest grids (much faster because information spreads over large distances) and propagates this decision to the finer grids, that will now concentrate their efforts on refining the initial approximation.
In Figure 6.27 , we show the reconstruction of the Mona Lisa face painted by Leonardo da Vinci.
Figure 6.27:
Mona Lisa in Three Dimensions. The right figure shows the
multigrid reconstruction.
Guy Robinson
For the surface reconstruction problem (see [ Terzopoulos:86a ]) the energy functional is:
A physical analogy is that of fitting the depth constraints with a membrane pulled by springs connected to them. The effect of active discontinuities is that of ``cutting the membrane'' in the proper places.
Figure 6.28:
Simulation Environment for Multigrid Surface Reconstruction
from a Noisy Image. The top screen shows an intermediate, and the
bottom final results. For each screen, the upper part displays the activated
discontinuities; the lower part, the gray-encoded
z
values of the
surface.
Figure 6.29:
The Original Surface (top) and Surface Corrupted by 25%
Noise (bottom)
Figure 6.30:
The Reconstruction of a ``Randomville'' Scenery using
Multigrid Method. Each figure shows a different resolution.
Figure 6.28 to 6.30 show the simulation environment on the SUN workstation, and the reconstruction of a ``Randomville'' image (random quadrangular blocks placed in the image plane). The original surface, the surface corrupted by noise (25%), are shown in Figure 6.29 while reconstruction on different scales is shown in Figure 6.30 .
For ``images'' and 25% noise, a faithful reconstruction of the surface (within a few percent of the original one) is obtained after a single multiscale sweep (with V cycles) on four layers. The total computational time corresponds approximately to the time required by three relaxations on the finest grid. Because of the optimality of multiscale methods, time increases linearly with the number of image pixels.
Guy Robinson
The parallel simulation environment was written by Roberto Battiti [ Battiti:90a ]. Geoffrey Fox, Christof Koch, and Wojtek Furmanski contributed with many ideas and suggestions [ Furmanski:88c ].
A JPL group [ Synnott:90a ] also used the Mark III hypercube to find three-dimensional properties of planetary objects from the two-dimensional images returned from NASA's planetary missions, and from the Hubble Space Telescope. The hypercube was used in a simple parallel mode with each node assigned calculations for a subset of the image pixels, with no internodal communication required. The estimation uses an iterative linear least-squares approach where the data are the pixel brightness values in the images; and partials of theoretical models of these brightness values are computed for use in a square root formulation of the normal equations. The underlying three-dimensional model of the object consists of a triaxial ellipsoid overlaid with a spherical harmonic expansion to describe low- to mid-spatial frequency topographic or surface composition variations. The initial results were not followed through into production use for JPL missions, but this is likely to become an important application of parallel computing to image processing from planetary missions.
Guy Robinson
Much of the current interest in neural networks can be traced to the introduction a few years ago of effective learning algorithms for these systems ([ Denker:86a ], [ Parker:82a ], [ Rumelhart:86a ]). In [ Rumelhart:86a ] Chapter 8, it was shown that for some problems using multi-layer perceptrons (MLP), back-propagation was capable of finding a solution very reliably and quickly. Back-propagation has been applied to a number of realistic and complex problems [ Sejnowski:87a ], [ Denker:87a ]. The work of this section is described in [ Felten:90a ].
Real-world problems are inherently structured, so methods incorporating this structure will be more effective than techniques applicable to the general case. In practice, it is very important to use whatever knowledge one has about the form of possible solutions in order to restrict the search space. For multilayer perceptrons, this translates into constraining the weights or modifying the learning algorithm so as to embody the topology, geometry, and symmetries of the problem.
Here, we are interested in determining how automatic learning can be improved by following the above suggestion of restricting the search space of the weights. To avoid high-level cognition requirements, we consider the problem of classifying hand-printed upper-case Roman characters. This is a specific pattern-recognition problem, and has been addressed by methods other than neural networks. Generally the recognition is separated into two tasks: the first one is a pre-processing of the image using translation, dilation, rotations, and so on, to bring it to a standardized form; in the second, this preprocessed image is compared to a set of templates and a probability is assigned to each character or each category of the classification. If all but one of the probabilities are close to zero, one has a high confidence level in the identification. This second task is the more difficult one, and the performance achieved depends on the quality of the matching algorithm. Our focus is to study how well an MLP can learn a satisfactory matching to templates, a task one believes the network should be good at.
In regard to the task of preprocessing, MLPs have been shown capable [ Rumelhart:86a ] Chapter 8 of performing translations at least in part, but it is simpler to implement this first step using standard methods. This combination of traditional methods and neural network matching can give us the best of both worlds. In what follows, we suggest and test a learning procedure which preserves the geometry of the two-dimensional image from one length scale transformation to the next, and embodies the difference between coarse and fine scale features.
HPFA Applications
Guy Robinson
There are many architectures for neural networks; we shall work with Multi-Layer Perceptrons. These are feed-forward networks, and the network to be used in our problem is schematically shown in Figure 6.31 . There are two processing layers: output and hidden. Each one has a number of identical units (or ``neurons''), connected in a feed-forward fashion by wires, often called weights because each one is assigned a real number . The input to any given unit is , where i labels incoming wires and is the input (or current) to that wire. For the hidden layer, is the value of a bit of the input image; for the output layer, it is the output from a unit of the hidden layer.
Figure 6.31:
A Multi-Layer Perceptron
Generally, the output of a unit is a nonlinear, monotonic-increasing function of the input. We make the usual choice and take
to be our neuron input/output function. is the threshold and can be different for each neuron. The weights and thresholds are usually the only quantities which change during the learning period. We wish to have a network perform a mapping M from the input space to the output space. Introducing the actual output for an input I , one first chooses a metric for the output space, and then seeks to minimize , where d is a measure of the distance between the two points. This quantity is also called the error function, the energy, or (the negative of) the harmony function. Naturally, depends on the 's. One can then apply standard minimization searches like simulated annealing [ Kirkpatrick:83a ] to attempt to change the 's so as to reduce the error. The most commonly used method is gradient descent, which for MLP is called back-propagation because the calculation of the gradients is performed in a feed-backwards fashion. Improved descent methods may be found in [ Dahl:87a ], [ Parker:87a ] and in Section 9.9 of this book.
The minimization often runs into difficulties because one is searching in a very high-dimensional space, and the minima may be narrow. In addition, the straightforward implementation of back-propagation will often fail because of the many minima in the energy landscape. This process of minimization is referred to as learning or memorization as the network tries to match the mapping M . In many problems, though, the input space is so huge that it is neither conceivable nor desirable to present all possible inputs to the network for it to memorize. Given part of the mapping M , the network is expected to guess the rest: This is called generalization. As shown clearly in [ Denker:87a ] for the case of a discrete input space, generalization is often an ill-posed problem: Many generalizations of M are possible. To achieve the kind of generalization humans want, it is necessary to tell the network about the mapping one has in mind. This is most simply done by constraining the weights to have certain symmetries as in [ Denker:87a ]. Our approach will be similar, except that the ``outside'' information will play an even more central role during the learning process.
Guy Robinson
To do character recognition using an MLP, we assume the input layer of the network to be a set of image pixels, which can take on analogue (or grey scale) values between 0 and 1. The two-dimensional set of pixels is mapped onto the set of input neurons in a fairly arbitrary way: For an image, the top row of N pixels is associated with the first N neurons, the next row of N pixels is associated with the next N neurons, and so forth. At the start of the training process, the network has no knowledge of the underlying two-dimensional structure of the problem (that is, if a pixel is on, nearby pixels in the two-dimensional space are also likely to be on). The network discovers the two-dimensional nature of the problem during the learning process.
We taught our networks the alphabet of 26 upper-case Roman characters. To encourage generalization, we show the net many different hand-drawn versions of each character. The 320-image training set is shown in Figure 6.32 . These images were hand-drawn using a mouse attached to a SUN workstation. The output is encoded in a very sparse way. There are only 26 outputs we want the net to give, so we use 26 output neurons and map the output pattern: first neuron on, rest off, to the character ``A;'' second neuron on, rest off, to ``B;'' and so on. Such an encoding scheme works well here, but is clearly unworkable for mappings with large output sets such as Chinese characters or Kanji. In such cases, one would prefer a more compact output encoding, with possibly an additional layer of hidden units to produce the more complex outputs.
Figure 6.32:
The Training Set of 320 Handwritten Characters, Digitized on a
Grid
As mentioned earlier, we do not feed images directly into the network. Instead, simple, automatic preprocessing is done which dilates the image to a standard size and then translates it to the center of the pixel space. This greatly enhances the performance of the system-it means that one can draw a character in the upper left-hand corner of the pixel space and the system easily recognizes it. If we did not have the preprocessing, the network would be forced to solve the much larger problem of character recognition of all possible sizes and locations in the pixel space. Two other worthwhile preprocessors are rotations (rotate to a standard orientation) and intensity normalization (set linewidths to some standard value). We do not have these in our current implementation.
The MLP is used only for the part of the algorithm where one matches to templates. Given any fixed set of exemplars, a neural network will usually learn this set perfectly, but the performance under generalization can be very poor. In fact, the more weights there are, the faster the learning (in the sense of number of iterations, not of CPU time), and the worse the ability to generalize. This was in part realized in [ Gullichsen:87a ], where the input grid was . If one has a very fine mesh at the input level, so that a great amount of detail can be seen in the image, one runs the risk of having terrible generalization properties because the network will tend to focus upon tiny features of the image, ones which humans would consider irrelevant.
We will show one approach to overcoming this problem. We desire the potential power of the large, high-resolution net, but with the stable generalization properties of small, coarse nets. Though not so important for upper-case Roman characters, where a rather coarse grid does well enough (as we will see), a fine mesh is necessary for other problems such as recognition of Kanji characters or handwriting. A possible ``fix,'' similar to what was done for the problem of clump counting [ Denker:87a ], is to hard wire the first layer of weights to be local in space, with a neighborhood growing with the mesh fineness. This reduces the number of weights, thus postponing the deterioration of the generalization. However, for an MLP with a single hidden layer, this approach will prevent the detection of many nonlocal correlations in the images, and in effect this fix is like removing the first layer of weights.
Guy Robinson
We would like to train large, high-resolution nets. If one tries to do this directly, by simply starting with a very large network and training by the usual back-propagation methods, not only is the training slow (because of the large size of the network), but the generalization properties of such nets are poor. As described above, a large net with many weights from the input layer to the hidden layer tends to ``grandmother'' the problem, leading to poor generalization.
The hidden units of an MLP form a set of feature extractors. Considering a complex pattern such as a Chinese character, it seems clear that some of the relevant features which distinguish it are large, long-range objects requiring little detail while other features are fine scale and require high resolution. Some sort of multiscale decomposition of the problem therefore suggests itself. The method we will present below builds in long-range feature extractors by training on small networks and then uses these as an intelligent starting point on larger, higher resolution networks. The method is somewhat analogous to the multigrid technique for solving partial differential equations.
Let us now present our multiscale training algorithm. We begin with the training set, such as the one shown in Figure 6.32 , defined at the high resolution (in this case, ). Each exemplar is coarsened by a factor of two in each direction using a simple grey scale averaging procedure. blocks of pixels in which all four pixels were ``on'' map to an ``on'' pixel, those in which three of the four were ``on'' map to a ``3/4 on'' pixel, and so on. The result is that each exemplar is mapped to a exemplar in such a way as to preserve the large scale features of the pattern. The procedure is then repeated until a suitably coarse representation of the exemplars is reached. In our case, we stopped after coarsening to .
At this point, an MLP is trained to solve the coarse mapping problem by one's favorite method (back-propagation, simulated annealing, and so on). In our case, we set up an MLP of 64 inputs (corresponding to ), 32 hidden units, and 26 output units. This was then trained on the set of 320 coarsened exemplars using the simple back propagation method with a momentum term [ Rumelhart:86a ], Chapter 8. Satisfactory convergence was achieved after approximately 50 cycles through the training set.
We now wish to boost back to a high-resolution MLP, using the results of the coarse net. We use a simple interpolating procedure which works well. We leave the number of hidden units unchanged. Each weight from the input layer to the hidden layer is split or ``un-averaged'' into four weights (each now attached to its own pixel), with each 1/4 the size of the original. The thresholds are left untouched during this boosting phase. This procedure gives a higher resolution MLP with an intelligent starting point for additional training at the finer scale. In fact, before any training at all is done with the MLP (boosted from ), it recalls the exemplars quite well. This is a measure of how much information was lost when coarsening from to . The boost and train process is repeated to get to the desired MLP. The entire multiscale training process is illustrated in Figure 6.33 .
Figure 6.33:
An Example Flowchart for the Multiscale Training Procedure.
This was the procedure used in this text, but the averaging and boosting
can be continued through an indefinite number of stages.
Guy Robinson
Here we give some details of our results and compare with the standard approach. As mentioned in the previous section, a MLP (1024 inputs, 32 hidden units, 26 output units) was trained on the set of Figure 6.32 using the multiscale method. Outputs are never exactly 0 or 1, so we defined a ``successful'' recognition to occur when the output value of the desired letter was greater than 0.9, and all other outputs were less than 0.1. The training on the grid used back-propagation with a momentum term and went through the exemplars sequentially. The weights are changed to reduce the error function for the current character. The result is that the system does not reach an absolute minimum. Rather, at long times the weight values oscillate with a period equal to the time of one sweep through all the exemplars. This is not a serious problem as the oscillations are very small in practice. Figure 6.34 shows the training curve for this problem. The first part of the curve is the training of the network; even though the grid is a bit coarse, almost all of the characters can be memorized. Proceeding to the next grid by scaling the mesh size by a factor of two and using the exemplars, we obtained the second part of the learning curve in Figure 6.34 . The net got 315/320 correct. After 12 additional sweeps on the net, a perfect score of 320/320 is achieved. The third part of Figure 6.34 shows the result of the final boost to . In just two cycles on the net, a perfect score of 320/320 was achieved and the training was stopped. It is useful to compare these results with a direct use of back-propagation on the mesh without using the multiscale procedure. Figure 6.35 shows the corresponding learning curve, with the result from Figure 6.34 drawn in for comparison. Learning via the multiscale method takes much less computer time. In addition, the internal structure of the resultant network is much different and we will now turn to this question.
How do these two networks compare for the real task of recognizing exemplars not belonging to the training set? We used as a generalization test set 156 more handwritten characters. Though there are no ambiguities for humans in this test set, the networks did make mistakes. The network from the direct method made errors 14% of the time, and the multiscale network made errors 9% of the time. We feel the improved performance of the multiscale net is due to the difference in quality of the feature extractors in the two cases. In a two-layer MLP, we can think of each hidden-layer neuron as a feature extractor which looks for a certain characteristic shape in the input; the function of the output layer is then to perform the higher level operation of classifying the input based on which features it contains. By looking at the weights connecting a hidden-layer neuron to the inputs, we can determine what feature that neuron is looking for.
Figure 6.34:
The Learning Curve for our Multiscale Training Procedure
Applied to 320 Handwritten Characters. The first part of the curve is
the training on the
net, the second on the
net, and the last on the full,
net. The curve is plotted
as a function of CPU time and not sweeps through the presentation set,
in order to exhibit the speed of training on the smaller networks.
Figure:
A Comparison of Multiscale Training with the Usual, Direct
Back-propagation Procedure. The curve labelled ``Multiscale'' is the same as
Figure
6.34
, only rescaled by a factor of two. The curve
labelled ``Brute Force'' is from directly training a
network,
from a random start, on the learning set. The direct approach does not quite
learn all of the exemplars, and takes much more CPU time.
For example, Figure 6.36 shows the input weights of two neurons in the net. The neuron of (a) seems to be looking for a stroke extending downward and to the right from the center of the input field. This is a feature common to letters like A, K, R, and X. The feature extractor of (b) seems to be a ``NOT S'' recognizer and, among other things, discriminates between ``S'' and ``Z''.
Figure 6.36:
Two Feature Extractors for the Trained
net. This figure
shows the connection weights between one hidden-layer, and all the
input-layer neurons. Black boxes depict positive weights, while white
depict negative weights; the size of the box shows the magnitude. The
position of each weight in the
grid corresponds to the position
of the input pixel. We can view these pictures as maps of the features which
each hidden-layer neuron is looking for. In (a), the neuron is looking for a
stroke extending down and to the right from the center of the input field;
this neuron fires upon input of the letter ``A,'' for example. In (b), the
neuron is looking for something in the lower center of the picture, but it
also has a strong ``NOT S'' component. Among other things, this neuron
discriminates between an ``S'' and a ``Z''. The outputs of several such
feature extractors are combined by the output layer to classify the original
input.
Figure:
The Same Feature Extractor as in Figure
6.36
(b),
after the Boost to
. There is an obvious correspondence between
each connection in Figure
6.36
(b) and
clumps
of connections here. This is due to the multiscale procedure, and leads to
spatially smooth feature extractors.
Even at the coarsest scale, the feature extractors usually look for blobs rather than correlating a scattered pattern of pixels. This is encouraging since it matches the behavior we would expect from a ``good'' character recognizer. The multiscale process accentuates this locality, since a single pixel grows to a local clump of four pixels at each rescaling. This effect can be seen in Figure 6.37 , which shows the feature extractor of Figure 6.36 (b) after scaling to and further training. Four-pixel clumps are quite obvious in the network. The feature extractors obtained by direct training on large nets are much more scattered (less smooth) in nature.
Guy Robinson
Before closing, we would like to make some additional comments on the multiscale method and suggest some possible extensions.
In a pattern-recognition problem such as character recognition, the two-dimensional spatial structure of the problem is important. The multiscale method preserves this structure so that ``reasonable'' feature extractors are produced. An obvious extension to the present work is to increase the number of hidden units as one boosts the MLP to higher resolution. This corresponds to adding completely new feature extractors. We did not do this in the present case since 32 hidden units were sufficient-the problem of recognizing upper-case Roman characters is too easy. For a more challenging problem such as Chinese characters, adding hidden units will probably be necessary. We should mention that incrementally adding hidden units is easy to do and works well-we have used it to achieve perfect convergence of a back-propagation network for the problem of tic-tac-toe.
When boosting, the weights are scaled down by a factor of four and so it is important to also scale down the learning rate (in the back-propagation algorithm) by a factor of four.
We defined our ``blocking,'' or coarsening, procedure to be a simple, grey scale averaging of blocks. There are many other possibilities, well known in the field of real-space renormalization in physics. Other interesting blocking procedures include: using a scale factor, , different from two; using majority rule averaging; simple decimation; and so on.
Multiscale methods work well in cases where spatial locality or smoothness is relevant (otherwise, the interpolation approximation is bad). Another way of thinking about this is that we are decomposing the problem onto a set of spatially local basis functions such as gaussians. In other problems, a different set of basis functions may be more appropriate and hence give better performance.
The multiscale method uses results from a small net to help in the training of a large network. The different-sized networks are related by the rescaling or dilation operator. A variant of this general approach would be to use the translation operator to produce a pattern matcher for the game of Go. The idea is that at least some of the complexity of Go is concerned with local strategies. Instead of training an MLP to learn this on the full board of Go, do the training on a ``mini-Go'' board of or . The appropriate way to relate these networks to the full-sized one is not through dilations, but via the translations: The same local strategies are valid everywhere on the board.
Steve Otto had the original idea for the MultiScale training technique. Otto and Ed Felten and Olivier Martin developed the method. Jim Hutchinson contributed by supplying the original back-propagation program.
Guy Robinson
When moving objects in a scene are projected onto an image plane (for example, onto our retina), the real velocity field is transformed into a two-dimensional field, known as the motion field.
By taking more images at different times and calculating the motion field, we can extract useful parameters like the time to collision, useful for obstacle avoidance. If we know the motion of a camera (or our ego motion), we can reconstruct the entire three-dimensional structure of the environment (if the camera translates, near objects will have a larger motion field with respect to distant ones). The depth measurements can be used as starting constraints for a surface reconstruction algorithm like the one described in Section 9.9 .
In particular situations, the apparent motion of the brightness pattern, known as the optical flow, provides a sufficiently accurate estimate of the motion field. Although the adaptive scheme that we propose is applicable to different methods, the discussion will be based on the scheme proposed by Horn and Schunck [ Horn:81a ]. They use the assumptions that the image brightness of a given point remains constant over time, and that the optical flow varies smoothly almost everywhere. Satisfaction of these two constraints is formulated as the problem of minimizing a quadratic energy functional (see also [ Poggio:85a ]). The appropriate Euler-Lagrange equations are then discretized on a single or multiple grid and solved using, for example, the Gauss-Seidel relaxation method [ Horn:81a ], [ Terzopoulos:86a ]). The resulting system of equations (two for every pixel in the image) is:
where and are the optical flow variables to be determined, , , are the partial derivatives of the image brightness with respect to space and time, and are local averages, is the spatial discretization step, and controls the smoothness of the estimated optical flow.
HPFA Applications
Guy Robinson
Now, we need to estimate the partial derivatives in the above equations with discretized formulas starting from brightness values that are quantized (say integers from 0 to n ) and noisy. Given these derivative estimation problems, the optimal step for the discretization grid depends on local properties of the image. Use of a single discretization step produces large errors on some images. Use of a homogeneous multiscale approach, where a set of grids at different resolutions is used, may in some cases produce a good estimation on an intermediate grid and a bad one on the final and finest grid. Enkelmann and Glazer [ Enkelmann:88a ], [ Glazer:84a ] encountered similar problems.
These difficulties can be illustrated with the following one-dimensional example. Let's suppose that the intensity pattern observed is a superposition of two sinusoids of different wavelengths:
where R is the ratio of short to long wavelength components. Using the brightness constancy assumption ( or , see [ Horn:81a ]) the measured velocity is given by:
where and are the three-point approximations of the spatial and temporal brightness derivatives.
Now, if we calculate the estimated velocity on two different grids, with spatial step equal to one and two, as a function of the parameter, R , we obtain the result illustrated in Figure 6.38 .
Figure 6.38:
Measured velocity for superposition of sinusoidal patterns as a
function of the ratio of short to long wavelength components. Dashed line:
, continuous line:
.
While on the coarser grid, the correct velocity is obtained (in this case); on the finer one, the measured velocity depends on the value of R . In particular, if R is greater than , we obtain a velocity in the opposite direction!
We propose a method for ``tuning'' the discretization grid to a measure of the reliability of the optical flow derived at a given scale. This measure is based on a local estimate of the errors due to noise and discretization, and is described in [Battiti:89g;91b].
Guy Robinson
First, a Gaussian pyramid [ Burt:84a ] is computed from the given images. This consists of a hierarchy of images obtained filtering the original ones with Gaussian filters of progressively larger size.
Then, the optical flow field is computed at the coarsest scale using relaxation, and the estimated error is calculated for every pixel. If this quantity is less than a given threshold , the current value of the flow is interpolated to the finer resolutions without further processing. This is done by setting an inhibition flag contained in the grid points of the pyramidal structure, so that these points do not participate in the relaxation process. On the contrary, if the error is larger than , the approximation is relaxed on a finer scale and the entire process is repeated until the finest scale is reached.