next up previous
Next: Acknowledgments and Other Handshaking Up: The Distributed Parallel Multipole Previous: The Distributed Parallel Multipole

Implementation

Below is a description of the structure of the parallel implementation for the PMTA algorithm for a distributed memory multi-processor. The parallel algorithm is based upon previous work by Leathrum[9], Board[2] and Elliott[5]. Further descriptions of parallel implementations of the FMA are found in [7], [10], [12] and [13].

Step 0: Initialization of Slave Processes
The master process spawns all of the required slave processes. The slaves then computes their interaction and inverse interaction lists. Each slave then process precomputes constant arrays used for computing the multipole expansions, and allocates memory to hold the particle and cell data structures.

Step 1: Particle Distribution
The master process sends the particle position data to each of the slave processes. Each slave receives data for only the particles in the cells owned by that slave.

Step 2: Particle Redistribution
Each slave process sends particle data to the other slave processes that require the data for their direct calculations. The contents of the messages are determined from the inverse interaction lists sent to each slave.

Step 3: Calculate the Multipole Expansions
For each cell at the lowest tree level, the slave process computes the multipole expansion at the center of that box due to all particles within that box. No communication or synchronization is necessary as each process relies only on data already assigned to that specific slave.

Step 4: Upward Pass
Starting at the lowest tree level, the multipole expansion for each cell owned by the slave process is shifted to the center of the parent cell and the results accumulated in the multipole expansion for that parent cell. Then, if the parent cell is not owned by the current slave process, it sends this multipole expansion to the slave process that owns the parent cell.

Step 5: Particle Reception
Each slave process receives and stores the particle information sent out in Step 2.

Step 6: Multipole Distribution
Each slave process sends the multipole expansion data computed in Steps 2 and 3 to the other slave processes that require the data for their multipole calculations. The contents of the messages are determined from the inverse interaction lists.

Step 7: Direct computation
Each particle must interact directly with the particles in all nearby cells as specified by its interaction list. For each cell that a slave owns, it computes the particle to particle interactions with the other cells that it owns, updating the force and potential vectors for the particles in both cells. For particle to particle interaction between cells not owned by the same slave process only the force and potential values for the local cell (the cell owned by the processor) are updated.

Step 8: Multipole Reception
Each slave process receives and stores the multipole expansion information sent out in Step 7.

Step 9: Downward Pass
Each cell takes local expansion information from its parent cell, and computes its own local expansion based upon the parent's local expansion and the multipole expansion of each cell in its interaction list. If a child cell of the current cell is owned by a different slave process, then the local expansion information is passed to the appropriate slave process.

Step 10: Local Expansion Evaluation
Each slave process evaluates its local expansion for each particle in the cells at the lowest level of the tree. These forces are added to the forces calculated using the direct method to give a total force for each particle due to all other particles in the simulation region.

Step 11: Collect Results
Each slave process sends the force information for its particles back to the master process for collection. The master process receives this information, which concludes the processing cycle.

Key to performance of the parallel algorithm is the interleaving of message passing and computation. By performing as much computation as possible between the time messages are sent and received, the effect of message passing latency can be well hidden.

In addition, the use of the inverse interaction list mechanism allows the slave processes to have a priori knowledge of which data will be needed by other processes and send that specific data to the other slaves without having to be prompted. This greatly reduces the overhead of sharing data in a distributed multiprocessor environment.

Note to Author: More detail needs to be provided on the integration of the algorithms to compute the Macroscopic Assemblies and Virial Tensor as well as perform Load Balancing.


next up previous
Next: Acknowledgments and Other Handshaking Up: The Distributed Parallel Multipole Previous: The Distributed Parallel Multipole
Bill Rankin 2002-04-04