Barnes-Hut: An Implementation and Study

Our implementation of caching

We rely on a number of properties of the algorithm:

During tree-build, once a node has been split it will not change
During tree-downwalk, no node in the tree will be modified
All walks of the tree (that need caching) start at the root and go down
The tree will be modified between tree-build and tree-downwalk in the upwalk step
The tree will be entirely re-created at the beginning of each timestep

Given these properties, we perform complete caching. Any time a walk of the tree finds a node, it checks to see if it is cached. If it is not, then the node is added to the cache, and the pointer the parent has to the node is overwritten by the pointer to the cached copy. For this reason, we cache nodes that are stored locally. We wanted to make sure that this overwriting would not corrupt the tree. During the tree-build stage, we only cache nodes which have been split (we call these internal nodes). During the tree-downwalk stage we cache both internal and leaf nodes.

After the tree-build stage, each node throws away its entire cache. After the tree-downwalk stage, each node throws away its entire cache. Between the tree-build stage and the tree-downwalk stage, each node keeps a pointer to the root of the "real" tree, so that they can perform the walk and rebuild the cache.

Between the tree-build and tree-downwalk stages, only the position and mass of each leaf and internal node changes. For this reason, our implementation wastes a lot of bandwidth. Only 32 bytes of data have changed, but we re-copy the entire 128 byte internal/leaf union data structure. We did this because it was easier to guarantee correctness using this implementation.

More detailed results of the tree-build caching results are given in the results section. The executive summary is that caching gives a 2-5x speedup.

Useful Support for Caching in Architectures and Languages

Note: These ideas have not been carefully examined for feasibility or usefulness. They should therefore be considered very preliminary.

Given the vast improvement provided by caching, is clear that we want some sort of support for caching. While it was not really difficult to implement in our code, we took substantial advantage of the structure of the algorithm. This results in our caching implementation being suboptimal. See the section on load balancing for more comments on this. Therefore, we thought about possible support for load balancing in both the architecture and in the languages.

Architecture support for caching

Caching support in architectures will probably look like an optimization of the memory hierarchy. The work in shared memory multiprocessors shows the initial interface; processors access memory which is automatically cached. The processors and caches work together to guarantee that the memory is consistent. Work on distributed shared-memory multiprocessors has shown that it is not easy to continue to scale the size of the multiprocessor, and retain the single (short) distance to memory model of the machine. There is a substantial performance difference between hitting in the local cache, local memory, and remote memory. Therefore, to achieve optimal performance, operations need to be provided to control where data is located.

Control of the data's home location

Each piece of data implicitly lives in some cache. Unfortunately, in algorithms like Barnes-Hut, data will drift between processors. Therefore, the architecture will have to support operations that copy some chunk of memory from one local memory to another and change the home location of that chunk of memory. For performance, this is likely to be on a page granularity. For our implementation, it would be easier if the specification could be made on a per object basis, but we could probably work with per-page specification of home locations.

Problem with cache-line alignments

Distributed shared-memory multiprocessors generally share data in cache-line groups. This can cause both false sharing and conflict misses because the data being accessed is not a full cache line. I do not see any obvious way to have the architecture help with this problem. I suspect that this would require language help. This is especially true because different implementation will have different cache line sizes, and hence will require different version of the code. It would be very difficult for the programmer to have to re-layout the data every time they get a new version of the code. For more information on the importance of cache-line alignments, you can look at Eric Anderson's report on Cache-Friendly Lists.

Language support for caching

The following operations would have been useful for our project:

Cache-Object(object,cache-group)
UnCache(cache-group)
Partially-UnCache(cache-group)

We could get away with just these operations because the caching work for Barnes-Hut has the tree data structure read-only for most of the algorithm. Because of this, we could specify cache-groups, which are groups of data which are manipulated as a single piece. Raph Levien's master's thesis will be about a similar idea for handling memory allocation by regions. It is possible a similar idea could be adopted for caching. We could have used a partially-uncache operation so that all of the children pointers in the data structure would not have to be re-fetched, just the part of that tree that changed.

Problems with language support

Multiple pointers

The language support will somehow have to deal with multiple pointers to the same object. This is a problem because you don't want the same object to be cached multiple times (it's a waste of memory). This could possibly be handled by keeping a list attached to each cacheable object of all the processors that have cached the object. Then a second request could just return already-cached, or something like that.

Freeing parts of the cache

In our implementation, we only free the cache at algorithmic boundaries. This guarantees that we only have cold misses in our cache. However, it also means that we cannot handle really large problems because the cache tries to use more memory than we have available. Therefore, the language should support evicting unused data from the cache. This could potentially be done though garbage collection or something like that. However, I don't know what the correct solution is for a language like C, which wouldn't support the pointer re-writing necessary to evict objects from the cache. (pointers to things in the cache would have to be replaced with the original pointers) Perhaps some implementation using handles (double pointers) would make this possible, since the double pointers would be small.

Consistency without algorithmic support

The language should support some model for enforcing consistency in writes beyond the read-only model that we used for Barnes-hut. This would require keeping around some sort of list of processors caching each object, so that either invalidates or updates could be sent to each processor. This part of the language support would probably be very similar to what the distributed shared-memory multiprocessors do. The main advantage would be that caching could be on a per-object rather than per-cache-line basis.

Continue to Load-Balancing

-----

maintained by <hodes@cs.berkeley.edu>