This section provides some ramblings on commonly encountered problems. Note that it has not yet been updated for the alpha test release of the MPIBLACS.

General problems:

  1. Undefined BLACS symbols during link.
PVM-specific problems:
  1. Spawned processes do not check in.
  2. Strange behavior while using blacs_setup.dat.
  3. Got message like "pvm error #XXX".
  4. Code hangs when run on RS6000.
  5. Code hangs when run on multiple machines.

Undefined BLACS symbols during link

The BLACS routines are roughly divided into two categories. The top level codes (i.e. the ones callable by the user), and those routines that these top level routines call. The top level routines provide the interface for the library. Non-interface routines are referred to as internal routines. Interface routines are those documented in the manual or quick reference guide. Internal routines will have names that vary from system to system. Many may end in the postfix 00. Examples include Smpath_bs, Asend00, itrpack00 etc. If you have a library produced by UT, you can discover the names of the interface routines by doing an ls in your ...BLACS/SRC/<ARCH>/ directory. Internal routines are in ...BLACS/SRC/<ARCH>/INTERNAL/

If all of the missing symbols are interface routines, then you probably have an interface problem. If they are all internal routines, then you probably have a internal problem. If the missing symbols include internal and interface routines, you are probably pointing at an invalid library.

Invalid libraries are usually fairly straight-forward. You may be pointing at a BLACS library that does not exist. If you are maintaining BLACS for several different systems, you may be pointing at the wrong system's libraries. For instance, you may want the HP version of the PVM BLACS, but you are pointing at the SUN4 version.

There are a couple of things which commonly cause interface problems. The most likely is that the a particular interface was not installed. The BLACS may be installed for Fortran and/or C: the installer may choose to install only one.

If the missing symbols all begin with C, then you need to build the C interface library. This can be done from the top-level makefile by typing make intface=Clib. If it is instead the Fortran interface that has not been installed, this may be accomplished from the top-level makefile by make intface=F77lib

Another possibility is that the BLACS library was compiled with an incorrect Bmake.inc file. The BLACS are written in C, but made to be callable from Fortran77. The method for calling a C routine from Fortran is system specific. Therefore, Bmake.inc allows the user to vary the BLACS naming scheme, so that the routines may be called from Fortran.

Bmake.inc contains a macro called INTFACE, which performs this function. Some systems (e.g. SUN4, CM-5, and Intel) require an underscore to be postfixed to a C routine name for it to be callable from Fortran. This type of interface should be indicated by defining INTFACE = -DAdd_.

Others (e.g. HP, RS6000) let Fortran and C share the same name space, so that no change is required to call a C routine from Fortran. This type of interface should be indicated by defining INTFACE = -DNoChange.

Finally, some systems (primarily CRAY) require the the C routine name be in upper case for it to be callable from Fortran. This type of interface should be indicated by defining INTFACE = -DUpCase.

Having on the internal routines missing happens most often to PVM users where a previous platform's internals have been compiled into this library (i.e., the internals for HP are compiled into a SUN4 library) . If this is the case, remove the library, do a make clean and rebuild.


Spawned processes do not check in

This usually indicates that there is insufficient memory to spawn the required number of processes. The pvm_spawn succeeds, but when the system attempts to allocate the process's memory, it fails, and thus the process never truly gets started. The usual fix for this is to add more hosts, or to reduce the size of your executable.


Strange behavior while using blacs_setup.dat

Old blacs_setup.dat files often remain around after their use, and when you run your next program, the old file is accessed, causing the wrong executable to be spawned. The BLACS do not require you to use blacs_setup.dat, and it is recommended that you do not. If you require the extra power blacs_setup.dat gives (e.g., you need to spawn with debug), then of course it should be used.

Got message like "pvm error #XXX"

The BLACS may encounter a PVM error that they are not designed to handle. In that case, they simply abort after reporting the error. The number printed is a PVM error number. The meaning of these error number can be found in the PVM manual or quick reference guide.

Code hangs when run on RS6000

This should only happen when you use an old PVM version, and perform some very strenuous communication, such as running the BLACS tester. The recommended fix is downloading the newest version of PVM.

Code hangs when run on multiple machines

Usually this is caused by PVM dying because there were too many messages, and the network was too busy to service them. Eventually, one of the pvmd3's will go down, and the network will get confused. Examine your pvml.<user id> file on each system for clues. Usually, you'll get something like "lost track of master, you're screewwwweeed".