Introduction

Overview of relevant concepts.

Hashing

Producing a signature of an objects based on its contents.

Hash Consing

A constructor is a function which runs automatically to setup an object. The constructor must run to obtain a new object.

Issues

Overhead on construction.
Hash table can become large.

Immutability

Immutability @ Wikipedia

Copy on Write @ Wikipedia

Strengths

Great for shared data. Multiple threads can share the same data with no synchronization issues.

Issues

Requires more copy on write, which can cause copy overhead.

Reference Counting

Reference Counting @ Wikipedia

Strengths

Shared ownership of data.
Prevents use-after-free. A reference is necessary to use the data, and the data will be there as long as at least one reference exists.
References are cheap to pass and copy.

Issues

Cannot reclaim reference cycles.
Overhead for incrementing/decrementing reference counts, which must be atomic because the count is shared.
Necessary indirection, even when the data is small. This may decrease performance by adding unprefetched random memory accesses.

Persistent Data Structures

Persistent Data Structures @ Wikipedia

Strengths

Avoid storing duplicate information.
Can make copies and updates faster, by reducing copying of data.

Directed Acyclic Graph

Directed Acyclic Graph @ Wikipedia

Directed graph is a structure composed on nodes/vertices connected by edges with a direction.

Strengths

Clearer ownership compared to a graph with cycles

Merkle Tree

Merkle Tree @ Wikipedia

Issues

Overhead of performing hashing and storing hashes
Requires data to be immutable
Requires more copy on write, which can cause copy overhead.

Rewriting

Rewriting @ Wikipedia

Relevant search terms: "Term rewriting", "DAG rewriting", "Rewrite rules" (results for this one are typically overwhelmed by URL rewrite rules for web servers).

Strengths

Systematic modification of structure data

Memoization

Memoization @ Wikipedia

Hirpdag

Data structures which are:

Hash Consed
Immutable
Reference Counted
Persistent
Directed Acyclic Graph

These data structures are also Merkle Trees, and amenable to DAG rewriting.

Strengths

Time

Memoization, Persistence, Incremental processing, Easy Caching (no invalidation)

Space

Hash Consing can massively reduce space.

Repeatedly referencing the same content is a common way of creating a large amount of complexity from a small amount of source material.

Synergies

Immutability and Directed Acyclic Graphs

Immutability naturally ensures graph construction produces Directed Acyclic Graphs. We cannot know an object's address in advance, so a mutation would be necessary to create a cycle.

Immutability and Reference Counting

One of the weaknesses of reference counting is that it cannot reclaim reference cycles. Immutability makes it impossible to construct a reference cycle, which prevents this issue.

One of the weaknesses of Memoization is cache invalidation. Data which is immutable and referentially transparent (no semantic variation with context) avoids the problem of cache invalidation. Information of cached relations typically remains valid.

Implementation

Hashing

Want stable hashing. Do not hash on pointer values.

Hashconsing

Weak references

Intrusive counters with weak count

Reference counts in same allocation as data. When strong ref count hits zero, destroy the data. Until the weak ref count hits zero, the object remains to indicate that there is zero strong refs so weak refs cannot access data.

Consider adding indirection to the data if the data is large and weak pointers may survive a long time.

Good catch coherency, fewer allocations.

Separate counters with weak count

Reference counts in a separate allocation to the data. When strong ref count hits zero the allocation for the data can be freed. The allocation for the counts remains to indicate the pointer is now invalid until the weak ref count also hits zero.

Reference objects have two pointers, one to the counters and one to the object. Strong refs can access the data pointer directly. Weak refs must access the counters first and check strong is nonzero to upgrade.

Less cache coherency, more allocations.

Weak reference with intrusive linked list

All weak refs to an object form an intrusive linked list. The head of the linked list is stored with the reference counts. As part of deleting an object, traverse all the weak references to that object via the intrusive linked list and set them to null.

Reference Counting Optimizations

Cache Invalidation and Immutability

Updating a reference count will modify a cache line containing an object. One of the strengths of immutable data, such as objects in Hirpdag, is that it can be shared between threads. Modifying reference counts adjacent to objects will dirty unmodified shared cache lines.

Unmodified cache lines can remain in the Shared state

Coalescing and Elision

Coalescing reference count modifications together can reduce update operations by 50-90%. Only the first increment and last decrement need to remain in the same place. This kind of statement reodering and combining would need compiler support to be applied comprehensively.

Comparisons

Pointer Equality

Compare object pointers for equality or inequality.

Ordering

Cannot use the object pointers for ordering in many contexts, they are unstable and meaningless.

A deterministic ordering is desirable, which reflects a partial order of the global Hirpdag object DAG.

Normalization

On construction.

Rewriting

Apply rewrite rule to self. Rewrite all children. Construct a replacement for self, if anything changed.

Memoization

Enabled by immutability and reference counting.

Hash map of reference to reference. Key is input, value is output.

Serialization

Leaf objects should appear before other objects which use them. The serialization ordering should be a valid bottom up partial order. A post-order traversal will produce one possible linear ordering.

Good to mark reference handle roots in serialization format.

Cache coherent datastructures

Immutable data will typically make non-contiguous data structures (such as tree-sets, tree-maps, and linked-lists) less appealing.

Cache line size is typically 64 bytes on most modern x86 systems.

$ getconf LEVEL1_DCACHE_LINESIZE
64

Object Metadata

Each Hirpdag Object has a small amount of metadata attached to it. This includes:

DAG Height
DAG Count
Content flags

The content flags are intended to provide a hint to avoid unnecessary traversals.

Reference Count Update Elision

Incrementing and decrementing can be expensive and they may occur frequently. Because the count is shared so these updates must be atomic.

Techniques

Techniques for using Hirpdag to maximize effectiveness.

Referential Transparency

Referential Transparency @ Wikipedia

Hirpdag objects should generally be designed to have referential transparency.

Objects which are identical should not have different meanings in a different context/environment.

Common Normalization

Hirpdag objects should apply normalization to increase the effectiveness of hashconsing.

Normalization is important for pointer inequality to correspond with semantic inequality. Fast pointer equality based comparisons is one of the key features hash consing provides. Without good normalization, deeper comparisons are needed and the pointer equality benefit of hashconsing is lost.

Order Normalization

y+x x+y

Sort commutative operands. Prefer flatter expression trees to make this easier.

Semantic Normalization

x+x 2*x

Structuring for Persistence and Normalization

The structure of Hirpdag objects can have a big impact on the effectiveness of normalization and persistence.

Prefer Flatter Structures

Consider:

A=a+b+d+e

As a binary tree (before normalization) it might look like:

a=b=sum(a, sum(b, sum(d, e)))
a=b=sum(sum(a, b), sum(d, e))
a=b=sum(sum(sum(a, b), d), e)

With a binary tree representation, the first question is: which of these semantically equivalent structures is the normalized form?

Consider:

B=a+d+b+e

As a binary tree (before normalization) it might look like:

B=Sum(Sum(a, d), Sum(b, e))

B is semantically equivalent to A, and should normalize to the same thing. In this case, the order of the operands needs to change.

With a binary tree representation, the second question is: what is necessary to normalized the operand order? Traversing the existing tree is necessary to gather these operands for sorting. Performance wise, this is similar to traversing a linked list (i.e: bad).

As a n-ary tree: A=Sum(a, b, d, e)

More contiguous. Easier to sort. Easier to traverse. Easier to construct. Easier to normalize (just sort).

When used as a persistent data structure, this means changing one n-ary Sum object rather than several binary Sum objects.

In general, if a Hirpdag object can refer to other Hirpdag objects of the same type and ordering is not important, this suggests you should consider changing their structure to combine them into one flattened Hirpdag object.

Not too big, not too small

If a Hirpdag object has too much information, deduplication opportunities will be unlikely.

If a Hirpdag object has too little information, encoding a useful piece of information will require many objects. This will have a negative impact on performance due to:

Worse memory access patterns chasing pointers (like a linked list).
More time spent allocating/deallocating.

Consider which fields may be large (e.g. a vector field may grow large). Consider which fields will need to mutate together.

Encoding Graphs

If the graph to store is acyclic, it could be directly constructed.

If the nodes or edges carry some information, they should likely be separate nodes. This makes the graph better for persistence.

An adjacency list or edge list can encode the graph structure itself.

type NodeIndex = u32;

struct Node {
  name: String,
}

struct Graph {
  nodes: Vec<Node>,
  edges: Vec<(NodeIndex, NodeIndex)>, // Sorted
}