Introduction
Overview of relevant concepts.
Hashing
Producing a signature of an objects based on its contents.
Hash Consing
A constructor is a function which runs automatically to setup an object. The constructor must run to obtain a new object.
Strengths
- Pointer equality
- Saving space
Issues
- Overhead on construction.
- Hash table can become large.
Immutability
Strengths
- Great for shared data. Multiple threads can share the same data with no synchronization issues.
Issues
- Requires more copy on write, which can cause copy overhead.
Reference Counting
Reference Counting @ Wikipedia
Strengths
- Shared ownership of data.
- Prevents use-after-free. A reference is necessary to use the data, and the data will be there as long as at least one reference exists.
- References are cheap to pass and copy.
Issues
- Cannot reclaim reference cycles.
- Overhead for incrementing/decrementing reference counts, which must be atomic because the count is shared.
- Necessary indirection, even when the data is small. This may decrease performance by adding unprefetched random memory accesses.
Persistent Data Structures
Persistent Data Structures @ Wikipedia
Strengths
- Avoid storing duplicate information.
- Can make copies and updates faster, by reducing copying of data.
Directed Acyclic Graph
Directed Acyclic Graph @ Wikipedia
Directed graph is a structure composed on nodes/vertices connected by edges with a direction.
Strengths
- Clearer ownership compared to a graph with cycles
Merkle Tree
Strengths
- Integrity
Issues
- Overhead of performing hashing and storing hashes
- Requires data to be immutable
- Requires more copy on write, which can cause copy overhead.
Rewriting
Relevant search terms: "Term rewriting", "DAG rewriting", "Rewrite rules" (results for this one are typically overwhelmed by URL rewrite rules for web servers).
Strengths
- Systematic modification of structure data
Memoization
Strengths
- Avoiding redundant work
Issues
- Cache invalidation problem
Hirpdag
Data structures which are:
- Hash Consed
- Immutable
- Reference Counted
- Persistent
- Directed Acyclic Graph
These data structures are also Merkle Trees, and amenable to DAG rewriting.
Strengths
Time
Memoization, Persistence, Incremental processing, Easy Caching (no invalidation)
Space
Hash Consing can massively reduce space.
Repeatedly referencing the same content is a common way of creating a large amount of complexity from a small amount of source material.
Synergies
Immutability and Directed Acyclic Graphs
Immutability naturally ensures graph construction produces Directed Acyclic Graphs. We cannot know an object's address in advance, so a mutation would be necessary to create a cycle.
Immutability and Reference Counting
One of the weaknesses of reference counting is that it cannot reclaim reference cycles. Immutability makes it impossible to construct a reference cycle, which prevents this issue.
Immutability and Memoization
One of the weaknesses of Memoization is cache invalidation. Data which is immutable and referentially transparent (no semantic variation with context) avoids the problem of cache invalidation. Information of cached relations typically remains valid.
Implementation
Hashing
Want stable hashing. Do not hash on pointer values.
Hashconsing
Weak references
Intrusive counters with weak count
Reference counts in same allocation as data. When strong ref count hits zero, destroy the data. Until the weak ref count hits zero, the object remains to indicate that there is zero strong refs so weak refs cannot access data.
Consider adding indirection to the data if the data is large and weak pointers may survive a long time.
Good catch coherency, fewer allocations.
Separate counters with weak count
Reference counts in a separate allocation to the data. When strong ref count hits zero the allocation for the data can be freed. The allocation for the counts remains to indicate the pointer is now invalid until the weak ref count also hits zero.
Reference objects have two pointers, one to the counters and one to the object. Strong refs can access the data pointer directly. Weak refs must access the counters first and check strong is nonzero to upgrade.
Less cache coherency, more allocations.
Weak reference with intrusive linked list
All weak refs to an object form an intrusive linked list. The head of the linked list is stored with the reference counts. As part of deleting an object, traverse all the weak references to that object via the intrusive linked list and set them to null.
Reference Counting Optimizations
Cache Invalidation and Immutability
Updating a reference count will modify a cache line containing an object. One of the strengths of immutable data, such as objects in Hirpdag, is that it can be shared between threads. Modifying reference counts adjacent to objects will dirty unmodified shared cache lines.
Unmodified cache lines can remain in the Shared state
Coalescing and Elision
Coalescing reference count modifications together can reduce update operations by 50-90%. Only the first increment and last decrement need to remain in the same place. This kind of statement reodering and combining would need compiler support to be applied comprehensively.
Comparisons
Pointer Equality
Compare object pointers for equality or inequality.
Ordering
Cannot use the object pointers for ordering in many contexts, they are unstable and meaningless.
A deterministic ordering is desirable, which reflects a partial order of the global Hirpdag object DAG.
Normalization
On construction.
Rewriting
Apply rewrite rule to self. Rewrite all children. Construct a replacement for self, if anything changed.
Memoization
Enabled by immutability and reference counting.
Hash map of reference to reference. Key is input, value is output.
Serialization
Leaf objects should appear before other objects which use them. The serialization ordering should be a valid bottom up partial order. A post-order traversal will produce one possible linear ordering.
Good to mark reference handle roots in serialization format.
Cache coherent datastructures
Immutable data will typically make non-contiguous data structures (such as tree-sets, tree-maps, and linked-lists) less appealing.
Cache line size is typically 64 bytes on most modern x86 systems.
$ getconf LEVEL1_DCACHE_LINESIZE
64
Object Metadata
Each Hirpdag Object has a small amount of metadata attached to it. This includes:
- DAG Height
- DAG Count
- Content flags
The content flags are intended to provide a hint to avoid unnecessary traversals.
Reference Count Update Elision
Incrementing and decrementing can be expensive and they may occur frequently. Because the count is shared so these updates must be atomic.
Techniques
Techniques for using Hirpdag to maximize effectiveness.
Referential Transparency
Referential Transparency @ Wikipedia
Hirpdag objects should generally be designed to have referential transparency.
Objects which are identical should not have different meanings in a different context/environment.
Common Normalization
Hirpdag objects should apply normalization to increase the effectiveness of hashconsing.
Normalization is important for pointer inequality to correspond with semantic inequality. Fast pointer equality based comparisons is one of the key features hash consing provides. Without good normalization, deeper comparisons are needed and the pointer equality benefit of hashconsing is lost.
Order Normalization
y+x
x+y
Sort commutative operands. Prefer flatter expression trees to make this easier.
Semantic Normalization
x+x
2*x
Structuring for Persistence and Normalization
The structure of Hirpdag objects can have a big impact on the effectiveness of normalization and persistence.
Prefer Flatter Structures
Consider:
A=a+b+d+e
As a binary tree (before normalization) it might look like:
a=b=sum(a, sum(b, sum(d, e)))
a=b=sum(sum(a, b), sum(d, e))
a=b=sum(sum(sum(a, b), d), e)
With a binary tree representation, the first question is: which of these semantically equivalent structures is the normalized form?
Consider:
B=a+d+b+e
As a binary tree (before normalization) it might look like:
B=Sum(Sum(a, d), Sum(b, e))
B
is semantically equivalent to A
, and should normalize to the same thing.
In this case, the order of the operands needs to change.
With a binary tree representation, the second question is: what is necessary to normalized the operand order? Traversing the existing tree is necessary to gather these operands for sorting. Performance wise, this is similar to traversing a linked list (i.e: bad).
As a n-ary tree: A=Sum(a, b, d, e)
More contiguous. Easier to sort. Easier to traverse. Easier to construct. Easier to normalize (just sort).
When used as a persistent data structure, this means changing one n-ary Sum object rather than several binary Sum objects.
In general, if a Hirpdag object can refer to other Hirpdag objects of the same type and ordering is not important, this suggests you should consider changing their structure to combine them into one flattened Hirpdag object.
Not too big, not too small
If a Hirpdag object has too much information, deduplication opportunities will be unlikely.
If a Hirpdag object has too little information, encoding a useful piece of information will require many objects. This will have a negative impact on performance due to:
- Worse memory access patterns chasing pointers (like a linked list).
- More time spent allocating/deallocating.
Consider which fields may be large (e.g. a vector field may grow large). Consider which fields will need to mutate together.
Encoding Graphs
If the graph to store is acyclic, it could be directly constructed.
If the nodes or edges carry some information, they should likely be separate nodes. This makes the graph better for persistence.
An adjacency list or edge list can encode the graph structure itself.
type NodeIndex = u32;
struct Node {
name: String,
}
struct Graph {
nodes: Vec<Node>,
edges: Vec<(NodeIndex, NodeIndex)>, // Sorted
}