Barnes-Hut: An Implementation and Study
Our implementation of caching
We rely on a number of properties of the algorithm:
- During tree-build, once a node has been split it will not change
- During tree-downwalk, no node in the tree will be modified
- All walks of the tree (that need caching) start at the root and go down
- The tree will be modified between tree-build and tree-downwalk in the
upwalk step
- The tree will be entirely re-created at the beginning of each timestep
Given these properties, we perform complete caching. Any time a walk of
the tree finds a node, it checks to see if it is cached. If it is not,
then the node is added to the cache, and the pointer the parent has to the
node is overwritten by the pointer to the cached copy. For this reason,
we cache nodes that are stored locally. We wanted to make sure that this
overwriting would not corrupt the tree. During the tree-build stage, we
only cache nodes which have been split (we call these internal nodes).
During the tree-downwalk stage we cache both internal and leaf nodes.
After the tree-build stage, each node throws away its entire cache. After
the tree-downwalk stage, each node throws away its entire cache. Between
the tree-build stage and the tree-downwalk stage, each node keeps a pointer
to the root of the "real" tree, so that they can perform the walk and rebuild
the cache.
Between the tree-build and tree-downwalk stages, only the position and mass
of each leaf and internal node changes. For this reason, our implementation
wastes a lot of bandwidth. Only 32 bytes of data have changed, but we
re-copy the entire 128 byte internal/leaf union data structure. We did this
because it was easier to guarantee correctness using this implementation.
More detailed results of the tree-build caching results are given in the
results section. The executive summary is
that caching gives a 2-5x speedup.
Useful Support for Caching in Architectures and Languages
Note: These ideas have not been carefully examined for feasibility or
usefulness. They should therefore be considered very preliminary.
Given the vast improvement provided by caching, is clear that we want some
sort of support for caching. While it was not really difficult to implement
in our code, we took substantial advantage of the structure of the algorithm.
This results in our caching implementation being suboptimal. See the section
on load balancing for more comments on this.
Therefore, we thought about possible support for load balancing in both
the architecture and in the languages.
Architecture support for caching
Caching support in architectures will probably look like an optimization
of the memory hierarchy. The work in shared memory multiprocessors shows
the initial interface; processors access memory which is automatically cached.
The processors and caches work together to guarantee that the memory is
consistent. Work on distributed shared-memory multiprocessors has
shown that it is not easy to continue to scale the size of the multiprocessor,
and retain the single (short) distance to memory model of the machine.
There is a substantial performance difference between hitting in the local
cache, local memory, and remote memory. Therefore, to achieve optimal
performance, operations need to be provided to control where data is located.
Control of the data's home location
Each piece of data implicitly lives in some cache. Unfortunately, in
algorithms like Barnes-Hut, data will drift between processors. Therefore,
the architecture will have to support operations that copy some chunk of
memory from one local memory to another and change the home location of
that chunk of memory. For performance, this is likely to be on a page
granularity. For our implementation, it would be easier if the specification
could be made on a per object basis, but we could probably work with
per-page specification of home locations.
Problem with cache-line alignments
Distributed shared-memory multiprocessors generally share data in
cache-line groups. This can cause both false sharing and conflict misses
because the data being accessed is not a full cache line. I do not see
any obvious way to have the architecture help with this problem. I suspect
that this would require language help. This is especially true because
different implementation will have different cache line sizes, and hence
will require different version of the code. It would be very difficult for
the programmer to have to re-layout the data every time they get a new
version of the code. For more information on the importance of cache-line
alignments, you can look at Eric Anderson's report on
Cache-Friendly Lists.
Language support for caching
The following operations would have been useful for our project:
- Cache-Object(object,cache-group)
- UnCache(cache-group)
- Partially-UnCache(cache-group)
We could get away with just these operations because the caching work for
Barnes-Hut has the tree data structure read-only for most of the algorithm.
Because of this, we could specify cache-groups, which are groups of data
which are manipulated as a single piece. Raph Levien's
master's thesis will be about a similar idea for handling memory allocation
by regions. It is possible a similar idea could be adopted for caching.
We could have used a partially-uncache operation so that all of the children
pointers in the data structure would not have to be re-fetched, just the
part of that tree that changed.
Problems with language support
Multiple pointers
The language support will somehow have to deal with multiple pointers to the
same object. This is a problem because you don't want the same object to
be cached multiple times (it's a waste of memory). This could possibly
be handled by keeping a list attached to each cacheable object of all the
processors that have cached the object. Then a second request could
just return already-cached, or something like that.
Freeing parts of the cache
In our implementation, we only free the cache at algorithmic boundaries.
This guarantees that we only have cold misses in our cache. However, it also
means that we cannot handle really large problems because the cache tries to
use more memory than we have available. Therefore, the language should
support evicting unused data from the cache. This could potentially be done
though garbage collection or something like that. However, I don't know
what the correct solution is for a language like C, which wouldn't support
the pointer re-writing necessary to evict objects from the cache. (pointers
to things in the cache would have to be replaced with the original pointers)
Perhaps some implementation using handles (double pointers) would make this
possible, since the double pointers would be small.
Consistency without algorithmic support
The language should support some model for enforcing consistency in writes
beyond the read-only model that we used for Barnes-hut. This would require
keeping around some sort of list of processors caching each object, so that
either invalidates or updates could be sent to each processor. This
part of the language support would probably be very similar to what the
distributed shared-memory multiprocessors do. The main advantage would be
that caching could be on a per-object rather than per-cache-line basis.
Continue to
Load-Balancing
-----
maintained by <hodes@cs.berkeley.edu>