In an effort to allow us to have more particles in our runs, and to improve the performance of our implementation, we implemented a very simple load balancing scheme. We build the tree, and we then have each processor walk the tree, copying its "subset" of the bodies to a duplicate array. Each processor then exchanges the real bodies for this copy, and the algorithm proceeds as before.
Unfortunately, because of the caching algorithm, we can run out of memory trying to build the initial tree for the bodies. For this reason, our implementation will build subsets of the particles into a tree, and exchange the subsets. It will slowly increase the size of the subsets it is building into trees relying on the fact that balancing the smaller subsets will improve the balance of the larger subsets. We find in practice this works very well, and allows us to run with much larger data sets. For the random sphere surface distribution, we can only get 512k particles without load balancing. With load balancing, we can get 2M particles. The performance increases are also substantial. Load balancing roughly doubles the performance of tree build, and quadruples the performance of tree-walk for the random-sphere-surface distribution. Unfortunately, our load-balancing doesn't balance the work in tree-walk, and hence the percentage of imbalance in our algorithm when the total time has been decreased.