T3E MPI is derived from the T3D implementation developed at the Edinburgh Parallel Computing Centre. The T3D version was in turn derived from the Chimp implementation. Though the T3D version allegedly suffered from performance and robustness problems, these seem to have been fixed in the T3E implementation.
T3E MPI is robust, and well-integrated with the environment. Parallel jobs are understood by the operating system as distinct entities and are managed directly by the operating system, rather than by layered software. Many standard tools (e.g. ps, accounting) understand parallel applications. T3E MPI is generally quite easy to use.
On the performance front, an interesting feature is that MPI is able to take advantage of special hardware on the T3E for sending strided arrays. This is discussed in more detail in Section 7.4. On the other hand, the T3E copies non-aligned data slowly, so make sure to use buffers that are 8-byte aligned -- this is automatic for the usual case of sending double precision data.
There are a few minor but longstanding problems. For instance, tools to show what parallel applications are running are primitive; there is no flexibility in how standard I/O is handled.
On the tools front, the Totalview debugger is available for debugging parallel programs. This is not the debugger from Dolphin Interconnect Solutions, but is a Cray product with a common ancestor. Cray Totalview has fallen behind its counterpart in ease of use and functionality, but is still useful. In particular, Cray Totalview cannot display message queues. There are no tools to extract or view message traces.