As a simple example of data decomposition, consider the addition of
two vectors, A[1..N] and B[1..N], to produce the result vector,
C[1..N]. If we assume that P processes are working on this problem, data partitioning
involves the allocation of N/P elements of each vector to each process,
which computes the corresponding N/P elements of the resulting vector.
This data partitioning may be done either ``statically,''
where each process knows *a priori* (at least in terms of
the variables N and P) its share of the workload,
or ``dynamically,'' where a control process (e.g., the master process)
allocates subunits of the workload to processes as and when they
become free. The principal difference between these two approaches
is ``scheduling.''
With static scheduling, individual process workloads are fixed;
with dynamic scheduling, they vary as the computation progresses. In
most multiprocessor environments, static scheduling is effective for
problems such as the vector addition example; however, in the
general PVM environment, static scheduling is not necessarily beneficial.
The reason is
that PVM environments based on networked clusters are susceptible to
external influences; therefore, a statically scheduled, data-partitioned
problem might encounter one or more processes that complete their portion
of the workload much faster or much slower than the others. This
situation could also arise when the machines in a PVM system are
heterogeneous, possessing varying CPU speeds and different memory
and other system attributes.

In a real execution of even this trivial vector addition problem, an issue that cannot be ignored is input and output. In other words, how do the processes described above receive their workloads, and what do they do with the result vectors? The answer to these questions depends on the application and the circumstances of a particular run, but in general:

- 1.
- Individual processes generate their own data internally,
for example, using random numbers or statically known values. This
is possible only in very special situations or for program testing purposes.
- 2.
- Individual processes independently input their data subsets
from external devices. This method is meaningful in many cases, but
possible only when parallel I/O facilities are supported.
- 3.
- A controlling process sends individual data subsets to each process.
This is the most common scenario, especially when parallel I/O facilities
do not exist. Further, this method is also appropriate when input data
subsets are derived from a previous computation within the same application.

The third method of allocating individual workloads is also consistent with dynamic scheduling in applications where interprocess interactions during computations are rare or nonexistent. However, nontrivial algorithms generally require intermediate exchanges of data values, and therefore only the initial assignment of data partitions can be accomplished by these schemes. For example, consider the data partitioning method depicted in Figure 4.2. In order to multiply two matrices A and B, a group of processes is first spawned, using the master-slave or node-only paradigm. This set of processes is considered to form a mesh; the matrices to be multiplied are divided into subblocks, also forming a mesh. Each subblock of the A and B matrices is placed on the corresponding process, by utilizing one of the data decomposition and workload allocation strategies listed above. During computation, subblocks need to be forwarded or exchanged between processes, thereby transforming the original allocation map, as shown in the figure. At the end of the computation, however, result matrix subblocks are situated on the individual processes, in conformance with their respective positions on the process grid, and consistent with a data partitioned map of the resulting matrix C. The foregoing discussion illustrates the basics of data decomposition. In a later chapter, example programs highlighting details of this approach will be presented .