One of the most often asked questions about this profiling system is why there are so many separate tools rather than an all-encompassing system that tells you everything you wish to know about the application.
Our fundamental reason for choosing this method was to attempt to minimize the ``self-profiling'' problem that tends to show up in many systems in which the profiling activity actually spends most of its time profiling the analysis system itself. Users of the UNIX profiling tools, for example, have become adept at ignoring entries for routines such as mcount, which correspond to time spent within the profiling system itself.
Unfortunately, this is not so simple in a parallel program. In sequential applications, the effect of the profiling system is merely to slow down other types of operation, an effect which can be compensated for by merely subtracting the known overheads of the profiling operations. On a parallel computer, things are much more complicated, since slowing down one processor may affect another which in turn affects another, and so on until the whole system is completely distorted by the profiling tools.
Our approach to this problem is to package up the profiling subsystems in subsets which have more or less predictable effects, and then to let the user decide which systems to use in which cases. For example, the communication profiler, ctool, incurs very small overheads-typically a fraction of 1%-while the event profiler costs more and the CPU usage profiler, xtool, most of all. In common use, therefore, we tend to use the communication profiler first, and then enable the event traces. If these two trials yield consistent results, we move on to the execution and distribution profilers. We have yet to encounter an application in which this approach has failed, although the fact that we are rarely interested in microsecond accuracy helps in this regard.
Interestingly, we have found problems due to ``clock-skewing'' to have negligible impact on our results. It is true that clock skewing occurs in most parallel systems, but we find that our profiling results are accurate enough to be helpful without taking any special precautions in this regard. Again, this is mostly due to the fact that, for the kinds of performance analysis and optimization in which we are interested, resolution of tens or even hundreds of microseconds is usually quite acceptable.