Minutes of the Message Passing Interface Forum Dallas, Texas August 11 - 13, 1993 The MPI Forum met August 11-13, 1993, at the Bristol Suites Hotel in North Dallas. This was the eigth meeting of the MPIF and the sixth regular working meeting in Dallas. There were both general meetings of the committee as a whole and meetings of several of the subcommittees. This meeting included the first reading of the Communication Contexts, Environmental Management and Subset chapters, the second reading of the Process Topologies chapter and formal consideration of various topics in the Point-to-point and Collective Communication chapters. There were a substantial number of formal votes taken at this meeting as well as a few straw votes. All of the votes are recorded in these minutes (and can be found by searching for VOTE) and have also been published in summary form to the mpi-core mailing list. These minutes were written by Bob Knighten (knighten@ssd.intel.com) and Rusty Lusk (lusk@anl.gov). These minutes are quite long. If you want to see the important topics you can search for --- and this will quickly lead to each topic (and a few other things.) The basic document that was used at this meeting are: + DRAFT Document for a Standard Message-Passing Interface (August 10,1993) + MPI Environmental Management section Attendees: --------- Robert G. Babb II U. of Denver babb@cs.du.edu Doreen Cheng NASA/Ames dcheng@nas.nasa.gov Lyndon Clarke University.of Edinburgh lyndon@epcc.ed.ac.uk James Cownie Meiko jim@meiko.co.uk Jack Dongarra UT/ORNL dongarra@cs.utk.edu Anne C. Elster Cornell University. elster@cs.cornell.edu Jim Feeney IBM Endicott feenyj@vnet.endicott.ibm.com Al Geist ORNL gst@ornl.gov Ian Glendinning University. of Southampton igl@ecs.soton.ac.uk Brian K. Grant LLNL bkg@llnl.gov Adam Greenberg TMC moose@think.com Leslie Hart NOAA/FSL hart@fsl.noaa.gov Don Heller Shell Development heller@shell.com Tom Henderson NOAA/FSL hender@fsl.noaa.gov Alex Ho IBM Almaden wkh@almaden.ibm.com Gary Howell Florida Tech howell@zach.fit.edu Steven Huss-Lederman SRC lederman@super.org Bob Knighten Intel SSD knighten@ssd.intel.com Rik Littlefield PNL rj_littlefield@pnl.gov Rusty Lusk ANL lusk@mcs.anl.gov Peter Madams nCube pmadams@ncube.com Dan Nessett LLNL nessett@llnl.gov Steve Otto Oregon Graduate Instiute otto@cse.ogi.edu Peter Pacheco U. of San Francisco peter@sun.math.usfca.edu Anthony Skjellum Mississippi State U. tony@cs.msstate.edu Marc Snir IBM, T.J. Watson snir@watson.ibm.com Alan Sussman University. of Maryland als@cs.umd.edu Bob Tomlinson LANL bob@lanl.gov Eric T. Van de Velde CalTech evdv@ama.caltech.edu David Walker ORNL walker@msr.epm.ornl.gov Joel Williamson Convex Computer joelw@convex.com Wednesday, August 11 --------- --------- ------------------------------------------------------------------------------- General Meeting ------------------------------------------------------------------------------- Jack Dongarra opened the meeting by presenting the agenda that was previously sent out by David Walker. AGENDA ------ Wednesday 1:00 - 2:00 Subcommittee meetings 2:00 - 4:00 Point-to-point communications Snir 4:00 - 5:00 Collective communications Geist 6:00 - 7:30 Dinner 7:30 - 10:00 Subcommittee meetings Thursday 9:00 - 12:00 Context Skjellum 12:00 - 1:30 lunch 1:30 - 2:30 Context 3-4 subset Huss 4-6 topology Huss 6-8 dinner 8-10 subcommittees Friday 9-10:30 env Lusk 10:30-12 language Lusk Status of Readings sec\date May June August September --------- +-------------------------------------------------------------- p-p | 2 coll | 1 2 2 profile | 1 2 context | 1 2 topology | 1 2 subset | 1 2 2 lang | 1 2 env | 1 2 The next meeting will be September 22-24. It will again be here in Dallas. Started at 2:10 ------------------------------------------------------------------------------- Report From the Point-to-Point Communication Subcommittee ------------------------------------------------------------------------------- Marc Snir presided. Marc reorganized the chapter to make it more readable. He also added the material in Section 4.13 (Derived datatypes) in line with the straw vote at the last meeting. In response to a question, Marc noted that he welcomes editorial comments, and asks that they be sent to him in e-mail. Derived datatypes (4.3) ----------------------- Marc began by describing the ideas behind "Derived datatypes". What is the relation of this for Fortran 90 data types? Largely orthogonal. Organization count: 20 STRAW VOTE: Should "Derived datatypes" be added to MPI? ---------- Yes: 22 No: 0 Abstain: 0 Introduction (4.13.0) --------------------- VOTE: Approve 4.13.0 (Introduction)? ---- Yes: 19 No: 0 Abstain: 1 Datatype constructors (4.13.1) ------------------------------ Marc gave brief descriptions of the five functions in this section and contrasted them with the earlier buffer construction functions. VOTE: Approve 4.13.1 (Datatype constructors) ---- Yes: 20 No: 0 Abstain: 0 Additional functions (4.13.2) ----------------------------- It was noted that the sentence starting at line 33 on p. 77 is wrong and contradicts what follows. Marc agreed and will repair this. There was some discussion of exactly what is passed in a message using a datatype that contains gaps. There was no disagreement and this will be clarified. The fact that in MPI_ADDRESS has in integer OUT parameter that provides the byte address of location is a problem on some architectures was briefly discussed. One proposal was to use a suitable implementation specific definition of the return type. There was also a discussion of MPI_TYPE_COMMIT primarily to better understand what was intended here. The principal confusion was that MPI_TYPE_COMMIT has to do with completing the definition of the type, NOT committing data. There is an alternative form of MPI_TYPE_COMMIT with only one paramater (type). Then would need commit before communication; can use type in constructors after commit. Yet another alternative is a lazy commit, i.e. a datatype object becomes commited at first use in a communication. There are several issues in considering these alternatives. One is ease of use for the programmer (lazy commit is easy); use of resources (the original allows reclaiming resources as soon as they are not being used); the cost of using a datatype buffer in a communication. Adam Greenberg proposed yet another option - provide an optional function, MPI_TYPE_COMPILE, which can be used for efficiency but otherwise there is lazy commit. The objection to this was efficiency of the communication. Adam's counter was that he expected this to be in the noise of the general overhead of communication. Marc noted that we need to write up full proposals and have a more focused discussion. VOTE: Approve 4.13.2 (Additional functions) EXCEPT MPI_TYPE_COMMIT and ---- MPI_TYPE_FREE? Yes: 16 No: 0 Abstain: 2 Use of general datatypes in communication (4.13.3) -------------------------------------------------- This section specifies how datatypes are used in send and receive, including how type matching working. In particular datatypes match if they are structurally equivalent (i.e. the type signatures are the same.) Is there some function that gives a count (of some suitable sort) of elements in a send when using datatypes? {{{I am confused?}}} VOTE: Have query function that returns ther number of top-most elements ---- received? Yes: 14 No: 0 Abstain: 4 VOTE: Have ONLY a query function that returns the count of top-most level ---- elements in a receive. Yes: No: Abstain: VOTE: Approve 4.13.3 (Use of general datatypes in communication) as amended. ---- Yes: 15 No: 0 Abstain: 4 Examples (4.13.4) ----------------- There are no requirements here and so no vote was taken. Correct use of addresses (4.13.5) --------------------------------- This section is about dealing with addressing when the system does not have limits in using addresses systems which do not have a flat address space. VOTE: Approve section 4.13.5 (Correct use of addresses)? ---- Yes: 18 No: 0 Abstain: 1 Message data (4.2.1) -------------------- Marc asked if the table in this section needs to be expanded to include all of the native C data types. YES! Marc reviewed all of the changes (other than order) that he made in this chapter. More detail in the discussion of the semantics of point-to-point communication. Has added Progress (some guaranteed) and Fairness (none guaranteed) requirements. {{{Where?}}} Can use null when nothing is needed. On p. 61 there is a new function, MPI_TESTALL, as suggested by David Walker. It is included for reason of completeness and symmetry. {{{See Discussion on p. 62}}} VOTE: Allow null pointers in an array of pointers (with the system ---- required to do the right thing? Yes: 20 No: 0 Abstain: 1 On p. 63 there is a new function, MPI_PROBE_COUNT, which uses a datatype to interpret the result of a probe and get a count. On p. 64 there is a typo in MPI_PROBE. There should not be a datatype parameter. On p. 65, the name of MPI_IS_CANCELLED has been changed to MPI_TEST_CANCELLED. The time arrived to make a decision on SENDRECV (section 4.11). Adam Greenberg suggested that there should be two tag arguments (for sending and for receiving) rather than only one. One effect of this is that a wild card can now be used in the receive_tag. VOTE: Have two tag arguments (send_tag and receive_tag) in MPI_EXCHANGE? ---- Yes: 6 No: 3 Abstain: 10 Section 4.12 (Null processes) is a proposal to the suggestion from Jon Flower that was supported in a straw vote at the last meeting. There was a substantial discussion of the utility and cost of MPI_PROCNULL. Steve Huss-Lederman, as a proxy for Rolf Hempel, argued for the value of this in use with process topologies. Various people argued that there would a universal overhead if MPI_PROCNULL were allowed as a source/destination. Three positions: Use as legitimate source/destination; only in send/receive/exchange; not legitimate source/destination. time (based on the suggestion from Jon Flower.) VOTE: (1) Allow MPI_PROCNULL as a source or destination for all communication operations. (2) Allow MPI_PROCNULL only for MPI_SENDRECV and MPI_EXCHANGE. (3) Never allow MPI_PROCNULL as a source or destination in communication operations. (3) Yes: 3 No: 9 Abstain: 8 (2) Yes: 11 No: 5 Abstain: 6 Section 4.14 (Universal communication functions) includes one new convenience function, MPI_COMM_INIT. This was added by Marc Snir to make it part of the base functions in terms of which all other functions can be defined. An alternative approach is to put section 4.14 in an Annex and NOT require these functions as part of VOTE: Approve Chapter 4 (Point to Point Communication) as ammended. ---- Yes: 18 No: 0 Abstain: 1 Note that all that remains to consider in Chapter 4 are the type_commit and free functions. ------------------------------------------------------------------------------- Report From the Collective Communication Subcommittee ------------------------------------------------------------------------------- Al Geist presided. This was a continuation of the second reading of this chapter that was begun at the last meeting. The number of changes was small. The format of the buffer arguments have all been changed to agree with those in the Point-to-point Communication chapter. The material on user_reduce has been rewritten to provide two variants, one assuming commutativity of the user operation and one not. Steve Huss-Lederman asked the question if there are ANY collective operations that are guaranteed to give the same result for repeated runs with the same initial conditions. Al Geist replied "Broadcast. Next question." A long discussion ensued. As in previous discussions of this topic, there was a wide spectrum of opinions expressed. Some insisted that reproducibility, at least in a debug mode, is required. Others insisted this is a quality of implementation issue. Other opinions expressed included that this is outside of the scope of MPI; an implementation must document the extent to which reproducibility is available; etc. In line with the Discussion paragraph on p. 95, it was agreed that it is unneeded to have the completely general ALLTOALL. Rik Littlefield noted that there is a useful kind of reduce that is missing, a "scatter-reduce" that does a reduction of sections of an array to an array of processes. He will write up a proposal and distribute it to the collective communication mailing list. Al Geist noted the change mentioned in the paragraph labeled Missing on p. 99. There was some confusion, so Al Geist promised to add an example to clarify this. Jim Cownie pointed out that it is important that implementors should provide an implementation guide that specifies which of the collective operations that may/may not be synchronizing actually are synchronizing. Adam Greenberg countered that users must assume that these operations are not synchronizing and therefore such a document serves no purpose. Adam Greenberg also expressed unease with section 5.6 and STRAW VOTE: Should MPI require documentation of the implementation ---------- variations in synchronization properties of collective operations. Yes: 18 No: 2 Abstain: 2 Eric Van de Velde asked why two versions of user_reduce were being provided. A brief review of the arguments of last time ensued. VOTE: Approve Chapter 5 (Collective Communications) ---- Yes: 21 No: 0 Abstain: 0 ----------------------------------------------------------------------------- The group adjourned for dinner at 6:10pm ============================================================================= Thursday, August 12, 1993 -------- --------------- ------------------------------------------------------------------------------- Report From the Communication Contexts Subcommittee ------------------------------------------------------------------------------- Anthony Skjellum presented. Group, Contexts and Communicators (Chapter 3) - First Reading ------------------------------------------------------------- Introduction (3.1) ------------------ Objection to use of term intra-communication and inter-communication without definition. This is editorial and will be addressed outside the meeting. There are no requirements in this section, so no vote was taken. Context (3.2) ------------- There was confusion about the term "hypertag". This too is editorial. There are no requirements in this section, so no vote was taken. Groups (3.3) ------------ Predefined Groups (3.3.1) ------------------------- It was proposed to change the wording describing MPI_GROUP_ALL to be "all processes at moment of process creation" and to add another group, MPI_GROUP_SIBLING which is "all processes with the same program text." Don Heller asked about system defined server processes and the like. It was agreed to modify the wording to include these. Jim Cownie noted that there is a problem with the notion of HOST because the host would have to have many different versions of MPI_GROUP_HOST. As an alternative Jim proposed that host should acessible via some constant or function that would give the rank of the host in MPI_ALL. After various proposals for additional predefined groups {MPI_GROUP_PEER "all processes except the host (if there is one)", MPI_GROUP_PARENT "parent of all children spawned"} it was proposed that this section be revised to say something like "There are no predefined groups. The effect of predefined groups is gotten by using the groups associated with the predefined communicators." Adam Greenberg asked that a vote on this section NOT be taken until after the decision on which communicators are predefined. Communicators (3.4) ------------------- Predefined Communicators (3.4.1) -------------------------------- MPI_COMM_ALL MPI_COMM_SIBLING MPI_COMM_PEER MPI_COMM_PARENT MPI_COMM_SELF After dealing for a time with the complexity and inclarity of this situation various alternatives were offered. The simplest was to have only MPI_COMM_ALL. Dan Nessett Organization count: 22 VOTE: Revise sections 3.3 and 3.4 as follows: ---- (1) There are no predefined groups (2) The only predefined communicators are MPI_COMM_ALL and MPI_COMM_PEER. (3) There is a predefined MPI_HOST_RANK which gives the rank of the host in the ALL group. It is MPI_UNDEFINED if there is no host. Yes: 17 No: 2 Abstain: 3 VOTE: Approve sections 3.3 and 3.4 as amended. ---- Yes: 18 No: 0 Abstain: 3 Group Management (3.5) ---------------------- Local Operations (3.5.1) ------------------------ There was some discussion of how MPI manages the coordination between various groups (always by relation to the ALL group), which of the inquiry functions in this section properly belong in the environmental sections (none) and general confusion about what the various functions do. No changes were made. VOTE: Approve section 3.5.1 (Local Operations)? ---- Yes: 19 No: 0 Abstain: 3 Local Group Constructors (3.5.2) -------------------------------- Most of the discussion had to do with clarification and editorial corrections. Eric V. asked for a function to reorder the ranks in a group. After some discussion as to exactly what is needed and why it was noted that the MPI_LOCAL_SUBGROUP provides the desired function. It was agreed to add a remark to this effect. It was noted that all of the functions in section 3.5 need descriptions rather than just names and parameters. Marc Snir asked for a clarifying note that of the functions in this section only MPI_LOCAL_SUBGROUP and MPI_LOCAL_SUBGROUP_RANGES can change the ranks. This has as a side effect that the ranges must not overlap. This was agreed. Lyndon Clarke asked for a clarification that the effect of MPI_LOCAL_SUBGROUP_RANGES is as though the ranges were expanded to a list of ranks and MPI_LOCAL_SUBGROUP were called with these ranks. There should be a similar statement for the relation between MPI_LOCAL_SUBGROUP_EXCL_RANGES and MPI_LOCAL_EXCL_SUBGROUP. This was agreed as well. VOTE: Approve section 3.5.2 (Local Group Constructors) as clarified? ---- Yes: 17 No: 0 Abstain: 5 Collective Group Constructors (3.5.3) ------------------------------------- The phrase "a stable sort is used to determine rank order" on line 23 of p. 18 will be change to say that in the event of ties the rank in the comm group will be used to determine the rank in new_group. After discussion of the meaning of MPI_COLL_SUBGROUP it was proposed to have instead: MPI_COLL_SUBCOMM(comm, key, color, new_comm) which will then appear in section 3.7.3. The effect on section 3.5.3 is that it would simply say there there are no collective group constructors. VOTE: Approve section 3.5.3. (Collective Group Constructors) as amended? ---- Yes: 18 No: 0 Abstain: 3 --- break 10:30 - 11 --- Sections 3.6 and 3.7 (pp. 18-21) were handled by giving a function by function discussion followed by an overview of a "tuning" proposal by Marc Snir. Operations on Contexts (3.6) ---------------------------- Local Operations (3.6.1) ------------------------ Collective Operations (3.6.2) ----------------------------- In MPI_CONTEXTS_ALLOC, the len parameter is removed. The "void *" in the descriptions is removed and better words will be provided. The idea of quiescence that was prominent in the context proposal at the last meeting has largely disappeared. The manner of dealing with the problem that quiescence was designed to solve is discussed at length on p. 19. A discussion of the relation of point-to-point and collective communication was prompted by a dispute between Marc Snir and Jim Cownie. Jim made the point that the collective communication routines can be written using the point-to-point communication routines and the material in the context chapter. It was noted that there is one collective communication routine in the context chapter - MPI_CONTEXTS_ALLOC - and some system magic must insure that this works correctly. Marc Snir noted that the system can provide similar magic throughout for optimization purposes. Operations on Communicators (3.7) --------------------------------- Local communicator Operations (3.7.1) ------------------------------------- There were no issues in this section. Local Constructors (3.7.2) -------------------------- There is a MPI_COMM_BIND functions missing. It was accidently deleted and will be added. The form is: MPI_COMM_BIND(group, context, new_comm) IN group IN context OUT new_comm The details will be provided in the next draft. The name of the function MPI_COMM_UNBIND will be changed to MPI_COMM_FREE (and the function of this name in the point-to-point chapter will be renamed.) The frequent reference to MPI_COMM_GROUP(comm) will be changed to MPI_COMM_GROUP(comm, group). Collective Communicator Constructors (3.7.3) -------------------------------------------- The one collective operation for communicators is MPI_COMM_MAKE. Adam Greenberg noted that as currently written every member of the group associated with sync_comm gets comm_new which has At this point Marc Snir An out of context proposal - Only use of context is for local creation of communicators. - Result can be achieved without explicit context object (some loss of safety) - Either case needs ruls for coordinated context allocation. Communication context - specified by communicator - can be "preallocated" and then locally bound to communicator. MPI_CONTEXTS_ALLOC(comm, n) - preallocates n contexts and "caches" them with comm. {This can be called repeatedly and adds the number of contexts specified on each call. This is a collective operation.} MPI_CONTEXTS_FREE(comm, n) - releases up to n preallocated contexts. {This can be called repeatedly. It is a local operation.} MPI_COMM_CONTEXT(comm, n) - queries the number of available preallocated contexts. {This is a local operation.} MPI_COMM_DUP(comm, new_comm) - duplicates a comunicator. Uses locally cached context, if available, otherwise this is a collective operation {It is erroneous if some but not all have locally cached context available. Note that new_comm does NOT have any cached context.) MPI_COM_LDUP(comm, new_comm) - duplicates a communicator. Uses locally cached context and returns NULL if none available. MPI_COMM_MAKE(sync_comm, comm_group, comm_new) MPI_COMM_LMAKE(sync_comm, comm_group, comm_new) - both of these create new communicator associated with the comm_group which is a subgroup of the group of sync_comm. This must be call by all members of the group of sync_comm. Some unease was expressed about sometime collective sometimes not. Safety? In response Marc noted that there could also be MPI_COMM_GDUP which would always do a collective operation. Correctness rule ---------------- All processes in a comm group must execute the same sequence of calls to MPI_CONTEXT_ALLOC, MPI_COTEXT_FREE, MPI_COMM_xDUP, MPI_COMM_xMAKE with comm as argument. - simple to state - same as for collective communication - too conservative? Note: This is compatible with the existence of static preallocated contexts. At this point Lyndon Clarke, having been waving his hand in the air for several minutes, stood on his chair to try and get Marc to address his question. After a brief discuss between Adam, Marc and Tony, Lyndon was allowed to speak. Lyndon Clarke noted that there is no way for the system to check for proper usage of arguments, so this offers no additional security compared with earlier proposals. Others noted that this did provide some additional safety, but it is hard to make direct comparison. Steve Huss asked if any one in the contex comm wanted to keep the current material. Tony said no, but that would likely not be true if Mark Sears were here. STRAW VOTE: Make this into a full proposal to replace the current 3.7.2 & ---------- 3.7.3 Yes: 25 No: 0 Abstain: 2 -- break for lunch 12:10 - 1:40 -- Cacheing (3.9) -------------- Rik Littlefield presented this material. Attribute Cacheing Function: Safely attach arbitrary informatio to groups (and communicators). Purpose: Allow modules to retain or exchange gropup-specific information WITHOUT complicated calling sequences or correctness ruls for use of module Examples Basic Capabilities: - Attributes are local - Attribute value can be pointer to arbitrary structure - Attributes are referenced by a key value obtained from MPI - Attributes can be defined and retrieved - Destructur routine is called when the group (communicator) is freed - Propagation routine is called when the group (communicator) is duplicated. Funtions: MPI_GET_ATTRIBUTE_KEY( OUT keyval) MPI_GROUP_ATTR(IN group, OUT attr_set_handle) MPI_COMM_ATTR(IN comm, OUT attr_set_handle) MPI_SET_ATTRIBUTE( IN attr_set_handle IN keyval IN attribute_value IN *attribute_destructor_routine IN *attribute_propagation_routine ) MPI_TEST_ATTRIBUTE( IN attr_set_handle IN keyval OUT attribute_value OUT result_status ) attribute_destructor_routine(IN attribute_value) attribute_propagation_routine(IN attribute_value, ......) This list is an updated and organized variant of the text. In particular the two routines, MPI_ATTRIBUTE_ALLOC and MPI_ATTRIBUTE_FREE, have been eliminated. What is the rationale for these functions altogether? These provide a method for managing resources associated with groups and communicators. For example this provides facilities to implement the topology facilities on top of MPI. Rik observed that this allows one to effectively extend MPI, e.g. to provide a user written collective operation that can be safely use with MPI and which looks like an MPI routine. Marc Snir asked for a routine to change the value of attributes without having to provide the destructor and propagation routines. There was a question if this introduced a degree of insecurity. Jim Cownie noted that one might well want at attribute with null destructor and propagation routines. Such a reset routine will be provided in the next draft of this chapter. Do we need attributes for both groups and communicators? Why not just on groups? This would allow elimination of attribute handles. There do seem to be situations where it is need on communicators, not just on groups. Adam noted that this proposal puts a resource burden on the system, so he asked about the possiblity of providing only a single system slot with the remainder of the storage provided by the user. Adam is concerned about the admixture of resources at both user and system level. {{{I'm confused}}} Tony proposed adding MPI_ATTRIBUTE_ALLOC and MPI_ATTRIBUTE_FREE back. It was claimed that providing attribute allocate and free routines together with a call back mechanism associated with the free is sufficient to provide all of the functions in section 3.9. Various people countered that this would introduce new problems of coordination and safety. In particular each library might have independent attribute mechanisms and this would require using multiple callback on each call of free. It was noted that this is very similar to the that was solved in X by using a register of callback functions. That can provide a model for this group to use. STRAW VOTE: Do we want a cacheing mechanism? ---------- Yes: 14 No: 4 Abstain: 7 How would toplogy need this? Steve Huss, as virtual Rolf Hemple, noted that topologies need a variety of information (e.g. dimensions, periodicities) that need to be associated with group (and communicators?). As topologies are a part of MPI, a general cacheing mechanism is not required. But without it there is likely to be conflicts between topology and user written libraries. VOTE: Approve section 3.9 (Cacheing) as amended in the presentation? ---- Yes: 8 No: 7 Abstain: 6 -------- break 2:45 - 3:10 Adam Greenberg asked about having a User's Guide meeting tonight. There are enough to have a meeting here after dinner -------- Introduction to Inter-Communication (3.8) ----------------------------------------- Tom Henderson presented on Inter-communication ( A ) <=> ( B ) <=> ( C ) / / Arank Brank Arank: ----- send(...,Brank,tag, commAB) Brank: ----- recv(...,Arank,tag,commAB,...) Want to be able to send from process in one group to process in another group using the rank in the target group. ALTERNATIVES Local group has acess to remote group and have a rank translation in some common ancestor. User maintains tables and communicators for group-pairs. {{{SLIDES HANDED OUT}}} STRAW VOTE: Hear the full proposal? ---------- Yes: 25 No: 1 Abstain: 4 Basic concepts Local Group Remote Group local group leader remote group leader Peer-group that contains both group leaders. {Not to be confused with MPI_COMM_PEER introduced earlier today.} - All members of both groups must call MPI_COMM_PEER_MAKE. What is reason for tag? - serves as the identifier for this particular inter-communicator. Joel Williamson asked why do this rather than just working in MPI_ALL? So can use names that are convenient for the problem at hand. This is also suitable for generalization to a dynamic situation rather than the static situation that is now in MPI. - Why does everyone call PEER_MAKE? To get a common communicator. - Discussion slide Add an IN argument, my_leader_rank, to MPI_COMM_PEER_MAKE(). This allows later addition of dynamic process creation. The IN argument peer_comm need only be valid in the local group leader. Only the group leaders need to be members of peer_comm. - LOOSELY-SYNCHRONOUS INTER-COMMUNICATOR CONSTRUCTOR - SYNCHRONIZATION PROPERTIES OF MPI_COMM_PEER_MAKE_START() AND MPI_COMM_PEER_MAKE_FINISH() - COMMUNICATOR STATUS (convenience function - make go away) - Synchronization issue: Can a process using an inter-communicator send a message using that inter-communicator as soon as it has the inter-communicator? Something needs to be said. - INTER-COMMUNICATOR SUPPORT - EXAMPLE 1 - "UNDER THE HOOD" - INTRA-COMMUNICATION INTER-COMMUNICATION - POSSIBLE IMPLEMENTATION OF MPI_COMM_PEER_MAKE() - POSSIBLE IMPLEMENTATION OF MPI_COMM_PEER_MAKE_START() - POSSIBLE IMPLEMENTATION OF MPI_COMM_PEER_MAKE_FINISH() - What is comparison with using the common ancestor approach? To do this one would create a union group STRAW VOTE: Have inter-communication? ---------- Yes: 14 No: 3 Abstain: 9 VOTE: Approve section 3.8 (Introduction to Inter-Communication) as ---- amended but minus the name service? Yes: 8 No: 2 Abstain: 10 - ? - EXAMPLE 2 - EXAMPLE 3 - EXAMPLE 4 - Al Geist noted all of the examples should be expanded to show at least one message being sent! - POSSIBLE IMPLEMENTATION OF MPI_COMM_NAME_MAKE() - POSSIBLE IMPLEMENTATION OF MPI_COMM_NAME_MAKE_START() - POSSIBLE IMPLEMENTATION OF MPI_COMM_NAME_MAKE_FINISH() - VOTE: Approve the name serverice material in section 3.8 (as amended)? ---- Yes: 8 No: 1 Abstain: 10 ------------------------------------------------------------------------------- Report From the Process Topology Subcommittee ------------------------------------------------------------------------------- Process Topology - Second Reading --------------------------------- Steve Huss presented. {He put on a pair of Birkenstocks to emphasize that he was acting as a virtual Rolf Hemple.} A couple of proposals that were made verbally at the last meeting were not written and so have not been included. Introduction (6.1) ------------------ VOTE: Approve section 6.1 (Introduction) ---- Yes: 13 No: 0 Abstain: 5 Virtual Topologies (6.2) ------------------------ Terminology - change the name to "process topologies" (Eric Van ) or "application topologies" (David Walker). VOTE: Approve section 6.2 (Virtual Topologies)? ---- Yes: 10 No: 0 Abstain: 6 Embedding in MPI (6.3) ---------------------- There were various editorial remarks which Steve recorded to relay to Rolf Hemple. VOTE: Approve section 6.3 (Embedding in MPI)? ---- Yes: 9 No: 0 Abstain: 8 Overview of the proposed MPI functions (6.4) -------------------------------------------- The initial part of 6.4 (before 6.4.1) will go away. VOTE: Specify that we use row major order? ---- Yes: 4 No: 0 Abstain: 13 Steve Huss pointed out that the extent to which these functions are global functions is not specified. Lyndon Clarke offered the amendment that they be specified as collective. There was a discussion of the group parameters in these functions. Steve agreed to propose to Rolf that they be systematically replaced by communicators. VOTE: Replace all group parameters throughout by communicator? ---- Yes: 8 No: 1 Abstain: 7 VOTE: MPI_MAP_CART and MPI_MAP_GRAPH shall be global routines? ---- Yes: 4 No: 2 Abstain: 12 The large number of abstentions in the recent votes led to a discussion of the value of topology and also to a discussion of our procedures. There was no strong interest in discussing including topology. Neither was there any strong interest in changing procedures. VOTE: Postpone the second reading of the topology chapter until the next ---- meeting? Yes: 3 No: 12 Abstain: 2 VOTE: Approve chapter 6 as amended? ---- Yes: 12 No: 3 Abstain: 2 ------------------------------------------------------------------------------- Report from the Subset Subcommittee ------------------------------------------------------------------------------- Steve Huss presided. This was NOT a second reading. Jim Cownie argued that the profiling material should be included in the subset because this provides important facilities and the cost of providing it in an initial implementation is not large. Several people agreed with this, so there was a quick vote. STRAW VOTE: Include profiling in the subset? ---------- Yes: 24 No: 0 Abstain: 1 A discussion of the parts of environmental management and inquiry to be included lead to an agreement that this should be deferred until the presentation on this material. There was nothing to be said about language binding - there will be F77 and C bindings. STRAW VOTE: Exclude topology from the subset? ---------- Yes: 21 No: 2 Abstain: 3 It was agreed that the list for collective communication in the document is OK. In considering the Point-to-point Functions, the list in the document includes MPI_SENDRECV but excludes MPI_EXCHANGE. It was generally agreed that this is sensible. Steve Otto argued for including hvec type functions in the subset because of common usage. In considering this and the issue of derived datatype several posibilities were considered. The one that got general support is to include all of the material in section 4.13. It was noted that data conversion is not a subset issue - heterogeneous systems have to have it; homogeneous systems do not need it. --- break for dinner at 6:05 --- Subcommitte meetings - starting about 8:30 Subset - immediately after dinner User Guide - after subset meeting Context - after subset meeting ============================================================================= Friday, June 25, 1993 ------ ------- ---- ------------------------------------------------------------------------------- Report from the Environmental Management Subcommittee ------------------------------------------------------------------------------- Rusty Lusk presided. Rusty began by handing out a new version Environmental Management and Inquiry 1 Initialization 2 Environmental query 3 Others Initialization and Termination Current draft: MPI_INIT() "idempotent" MPI_FINALIZE() "last MPI call" Discussion: How does a library know whether to call MPI_FINALIZE? Is MPI_FINALIZE optional? MPI_INIT requires some state not attached to any object; why not a communicator? MPI_INIT requires some state not attached to any object; why not a communicator? A proposed amendment: MPI_INIT(old_comm, new_comm) {if old_comm is null, gets first communicator} MPI_FINALIZE(current_comm, old_comm) Nests MPI invocations Attaches MPI state to communicator STRAW VOTE: We shall provide a mechanism that allows a library written ---------- using to be called from either within or without MPI? Yes: 11 No: 8 Abstain: 3 Steve Huss asks what happens if two libraries are invoked using different numbers of processors, then what is the ALL group? Jim Cownie offered a very simple proposal - All processes must call MPI_INIT at the start and all processes must call MPI_FINALIZE at the end. Note that the picture is that by a vendor provided miracle the MPI system is started and only after this can MPI_INIT be called. This is likely to be a global barrier. STRAW VOTE: MPI_INIT and MPI_FINALIZE must be called exactly once in each ---------- process. Yes: 18 No: 0 Abstain: 5 organization count: 20 Lotes of continuing discussion. It was agreed that, in the context of the straw vote, any program that violated the requirement is erroneous. VOTE: Have an MPI_INITIALIZED flag? Yes: 16 No: 1 Abstain: 2 VOTE: MPI_INIT and MPI_FINALIZE must be called exactly once in each ---- process. MPI_INIT is a global operation. It must be called before any other MPI routine. MPI_FINALIZE is the last MPI call. Yes: 16 No: 1 Abstain: 3 Rusty offered a proposal for MPI_ABORT(error_code) which terminates every process in the ALL group. VOTE: Have MPI_ABORT? ---- Yes: 13 No: 3 Abstain: 2 MPI-Specific (1.1) [Section numbers from chapter handed out at meeting] ------------------ Why are there communicator arguments in MPI_NumCommunicator? Rusty did not know. No one had a convincing argument. VOTE: Remove communicator arguments from MPI_ValidTags and ---- MPI_NumCommunicator? Yes: 14 No: 0 Abstain: 5 There was a fair amount of confusion about the intent and value of the MPI_BufferParams routine. In particular, various alternative proposals were mentioned. Rik Littlefield has proposed that the user be able to specify buffering capability STRAW VOTE: Should there be some way of asking the system about buffering? ---------- Yes: 15 No: 2 Abstain: 7 STRAW VOTE: Should there be some way of telling the system about buffer ---------- requirements? Yes: 6 No: 3 Abstain: 15 Rik will provide a proposal at the next meeting. VOTE: Remove MPI_IOmode? ---- Yes: 18 No: 1 Abstain: 2 Discussion of MPI_Errormode? There was again uncertainty about the communicator argument leading to: VOTE: Remove the communicator argument from MPI_Errormode? ---- Yes: 7 No: 4 Abstain: 8 After further discussion it was realized that while something of this sort is desirable, that is much more detail (e.g. how are error handlers established) that is essential before accepting this function. STRAW VOTE: Should there be a facility to set and query error mode? ---------- Yes: 18 No: 0 Abstain: 0 There was quick agreement that MPI_Has_Nonblocking and MPI_Has_Heterogeneous are not useful. VOTE: Remove MPI_Has_Nonblocking and MPI_Has_Heterogeneous? ---- Yes: 9 No: 0 Abstain: 4 STRAW VOTE: Have functions to inspect receive queue and other interesting ---------- internal structures? Yes: 12 No: 3 Abstain: 2 Anne Elster again asked for MPI_LOAD_INFO. She was proposing a modified version that had less time-specific information, but no written proposal was available at the meeting. A concrete proposal will be seen at the next meeting. VOTE: Remove sections 1.2 (Parallel programming) and 1.3 (non-MPI) except keep for MPI_Ge ---- Yes: 8 No: 3 Abstain: 2 VOTE: Accept MPI Environmental Management chapter as amended. ---- Yes: 8 No: 2 Abstain: 3 ------------------------------------------------------------------------------ MPI Sound Bites Jim Cownie Where's David? Oh No! Don't think about that one too much What's the question again? Those in favor of going to the bar? Shall we accept the chapter as eviscerated? ------------------------------------------------------------------------------ Report from the Language Binding Subcommittee ------------------------------------------------------------------------------- Rusty Lusk presided. Language Binding 7.1-7.4 will go into another chapter (MPI Terms and Conventions) 7.5 will go into an Appendix We need to: 1. Update and read 7.1-7.4 2. Decide on principles for binding presentation 3. Decide on format of definitions in the chapters 4. Decide on format and order of definitions in appendix. 5. Choose a procedure for agreeing on names of C functions, Fortran subroutines, named constants, types. 6. Enforce consistency 1. Modification to 7.1-7.4 (see draft) 2. Presenting the bindings a. Named constants (MPI_SUCCESS) b. Named types (MPI_COMMUNICATOR) c. Functions and arguments i. C - use ANSI C style ii. Fortran - use prototypes and declarations d. Consistency of formal argument names e. IN arguments before OUT arguments f. return code last argument in FORTRAN g. others? 3. Current 7.5 OK modulo name updating? Jim Cownie argued that the principle that "all C functions should return an error code" should be relaxed for those functions that would best be implemented using macros. VOTE: Accept chapter and annex with the modifications outline by Rusty ---- Yes: 12 No: 0 Abstain: 1 Format In the chapters Current Format <- match C binding in order and number of arguments + Fortran Binding <- match appendix + C Binding <- match appendix In the appendix, Sort by appearance order? Alphabetically within chapter? <-- this one chosen Alphabetically? Keep appendix after using if for consistency check? (Note alphabetical index) As a technical issue, Steve Otto would like to have the bindings appear in the chapter source, but only appear printed in the appendices. There was general agreement that in the appendix the functions should appear alphabetically within each chapter. ------------------------------------------------------------------------------