One of the most interesting features of MPI is the ability for applications to define MPI datatypes. MPI datatypes can describe almost any C or Fortran data object except C structures with pointers (there is no easy way to automatically ``follow'' the pointer).
User-defined datatypes are part of MPI for two basic reasons: they allow automatic data conversion in heterogeneous environments and they allow certain performance optimizations. The PVM Pack/Unpack approach allows automatic conversion in heterogeneous environments while the MPI approach additionally allows optimizations such as using special hardware or a coprocessor to perform scatter/gather, or pipelining scatter/gather with message transfer.
Unfortunately, while this is a nice idea in principle, most implementations do not implement these optimizations and using certain MPI datatypes can dramatically slow down communication performance. The two important cases are:
This is not the ``MPI way'' of doing things and may not always be the most efficient. For the foreseeable future, compiled user code will usually be able to pack data faster than an MPI library (but not specialized hardware). The wildcard is multithreaded MPI implementations where a thread running on another processor can pack data overlapped with computation.
One exception to the above rule is sending strided arrays on the T3E, where specialized hardware can send strided arrays faster than an MPI program can pack them. However, the hardware kicks in only if you use MPI_TYPE_VECTOR, not an contiguous array of building blocks that contain data followed by ``holes.''
I therefore reluctantly suggest that MPI applications implement two different methods for sending non-contiguous data. One should use the ``MPI way'' with non-contiguous datatypes, the other should pack into a user-managed buffer, and the choice should be made at compile time or runtime. Unfortunately this is not an elegant solution, but it is the best available at this time.