Next: Acknowledgments and Other Handshaking
Up: The Distributed Parallel Multipole
Previous: The Distributed Parallel Multipole
Below is a description of the structure of the parallel implementation
for the PMTA algorithm for a distributed memory multi-processor. The
parallel algorithm is based upon previous work by
Leathrum[9], Board[2] and
Elliott[5].
Further descriptions of parallel implementations of the FMA are found in
[7], [10], [12] and [13].
- Step 0: Initialization of Slave Processes
- The master process spawns all of the required slave processes.
The slaves then computes their interaction and inverse interaction
lists.
Each slave then process precomputes constant arrays used
for computing the multipole expansions, and allocates memory to hold
the particle and cell data structures.
- Step 1: Particle Distribution
- The master process sends the particle position data to each of the
slave processes. Each slave receives data for only the particles in
the cells owned by that slave.
- Step 2: Particle Redistribution
- Each slave process sends particle data to the other slave processes
that require the data for their direct calculations. The
contents of the messages are determined from the inverse interaction
lists sent to each slave.
- Step 3: Calculate the Multipole Expansions
- For each cell at the lowest tree level, the slave process
computes the multipole expansion at the center of that box due to all
particles within that box. No communication or synchronization is
necessary as each process relies only on data already assigned to
that specific slave.
- Step 4: Upward Pass
- Starting at the lowest tree level, the multipole expansion for each
cell owned by the slave process is shifted to the center of the parent
cell and the results accumulated in the multipole expansion for that
parent cell. Then, if the parent cell is not owned by the
current slave process, it sends this multipole expansion
to the slave process that owns the parent cell.
- Step 5: Particle Reception
- Each slave process receives and stores the particle information sent
out in Step 2.
- Step 6: Multipole Distribution
- Each slave process sends the multipole expansion data computed in
Steps 2 and 3 to the other slave processes
that require the data for their multipole calculations. The
contents of the messages are determined from the inverse interaction
lists.
- Step 7: Direct computation
- Each particle must interact directly with the particles in all nearby
cells as specified by its interaction list.
For each cell that a slave owns, it computes the particle to particle
interactions with the other cells that it owns, updating the force and
potential vectors for the particles in both cells.
For particle to particle interaction between cells not owned by the same
slave process only the force and potential values for the local cell
(the cell owned by the processor) are updated.
- Step 8: Multipole Reception
- Each slave process receives and stores the multipole expansion information
sent out in Step 7.
- Step 9: Downward Pass
- Each cell takes local expansion information from its parent cell,
and computes its own local expansion based upon the parent's local
expansion and the multipole expansion of each cell in its interaction list.
If a child cell of the current cell is owned by a different slave
process, then the local expansion information is passed to the
appropriate slave process.
- Step 10: Local Expansion Evaluation
- Each slave process evaluates its local expansion for each particle in
the cells at the lowest level of the tree. These forces are added to the
forces calculated using the direct method to give a total force for each
particle due to all other particles in the simulation region.
- Step 11: Collect Results
- Each slave process sends the force information for its particles back
to the master process for collection. The master process receives this
information, which concludes the processing cycle.
Key to performance of the parallel algorithm is the interleaving of
message passing and computation. By performing as much computation as
possible between the time messages are sent and received, the
effect of message passing latency can be well hidden.
In addition, the use of the inverse interaction list mechanism allows
the slave processes to have a priori knowledge of which data will be
needed by other processes and send that specific data to the other
slaves without having to be prompted. This greatly reduces the
overhead of sharing data in a distributed multiprocessor environment.
Note to Author: More detail needs to be provided on the integration of
the algorithms to compute the Macroscopic Assemblies and Virial
Tensor as well as perform Load Balancing.
Next: Acknowledgments and Other Handshaking
Up: The Distributed Parallel Multipole
Previous: The Distributed Parallel Multipole
Bill Rankin
2002-04-04