Simplicity: Sharing of Witness and Disconnect
Blockstream Research Simplicity Language

Simplicity: Sharing of Witness and Disconnect

Andrew Poelstra

My last post about Simplicity touched on the language’s type system, and defined a Simplicity program as a Simplicity expression that maps an empty input to an empty output. We introduced the idea of assertions, which allow a program to abort, and witnesses, which allow programmers to inject input data into Simplicity programs at exactly the places that it’s needed. Together, this means that programs can have nontrivial behavior, even though nominally they have neither inputs nor outputs.

We also confronted the complexity of fully specifying even simple mathematical functions in Simplicity. One tool we used to reduce the complexity is that of jets, which provide optimized implementations of specific Simplicity expressions. This is a useful optimization for blockchain validators, but it also provides value to the Simplicity programmer by encapsulating potentially-complex functionality in a single jet, which can be treated as a black box.

In fact, every Simplicity expression is uniquely identifiable, by an object called its identity Merkle root (IMR). The IMR of a node is computed by hashing its source and target types, its combinator, and its children (if any). Using IMRs, we can uniquely identify any expression, and thereby treat it as a black box, even if we don't have an optimized jet to replace it.

Bit-Flipping, Revisited


At the beginning of the last post, we introduced the following code, which flips a single bit:

bit0 := injl unit : 1 * 1 -> 2
-- imr 31059b22...
bit1 := injr unit : 1 * 1 -> 2
-- imr 9864ae9d...

padR := pair iden unit : 2 -> 2 * 1
-- imr 457121c2...
bitflip := case bit1 bit0 : 2 * 1 -> 2
-- imr 7d5ff8d0...
main := comp padR bitflip : 2 -> 2
-- imr a6ecc3dd...

As we can see from type ascription on the last line, this expression takes a single bit (the “2” type, so named because it has two possible values) and outputs a single bit. In this listing, we have also computed the IMR of each expression. In the future, whenever we see a Simplicity expression whose IMR matches one of these values, no matter how complex or obfuscated it appears to be, we will know exactly what we are looking at.

But recall that, to have a Simplicity program, we need our main expression to have type 1->1. Rather than taking an input value, we will use witness nodes, and rather than having an output value, we will simply assert that any intermediate values have the properties we expect.

As an example of this, let's tweak our expression to get a program that asserts that a witness bit is 0. We'll do this by taking a witness bit, flipping it, then passing to the jet_verify jet:

witbit := witness : 1 -> 2
-- imr ???

-- Assume we have the above expression, with the name bitflip.
-- Notice that its IMR is the same as above.

bitflip :: 2 -> 2
-- imr a6ecc3dd...
zerovfy := comp bitflip jet_verify : 2 -> 1
-- imr b3bca637...

main := comp witbit zerovfy : 1 -> 1
-- imr ???

Here we have an interesting observation. The IMR of a witness node is computed based on the witness value, but when writing a program, we do not yet know what the witness value will be. This means that we do not know the IMR, and we cannot treat the expression as a black box. The same applies to main, which depends on the witness node, and any future expressions that might depend on main.

The same considerations apply to disconnected expressions. When we use the disconnect combinator, we can have entire expressions, not just values, which aren't yet determined. We will talk about disconnect in a later post.

IMRs and CMRs

Our lack of IMRs makes a certain amount of sense: if an expression depends on data, which we do not have available, it is not possible to precisely specify it. On the other hand, we actually do know a fair bit about the program under consideration: we know that it is a comp combinator whose children are a witness and another comp combinator, respectively, and so on. In this case, we even know the exact source and target types of all nodes.

In other words, we have fully defined our program, up to witness data and disconnected expressions that we do not intend to populate until later. We have defined the parts of our program we need to generate addresses and receive coins, but left some parts undefined until we are ready to spend them.

In Bitcoin Script, we have this same distinction between commitment time and redemption time. At commitment time, we choose what public keys to use, and only at redemption time, do we produce signatures with those keys. For historical reasons, Bitcoin Scripts are committed to by taking their hash and embedding this hash in another Script, sometimes called a scriptPubKey but more commonly known as an address.

Taproot generalizes this hash to be the root of a Merkle tree, allowing the same address to commit to multiple scripts, only one of which is revealed at spending time.

In Simplicity we take this a step further, treating our entire program as a Merkle tree. To do so, we define a second hash on each node, the commitment Merkle root (CMR). Unlike the IMR, the CMR does not commit to witness data, disconnected expressions, or types. It only commits to the structure of the program.

When embedding Simplicity programs in the blockchain, we replace the leaves of a Taproot tree, which normally would be Scripts, with the CMRs of Simplicity programs.

We can write out the CMR of each node in our bit-flipping program:

witbit := witness : 1 -> 2
-- cmr bf12681a...
bitflip : 2 -> 2
-- cmr 8f214cce...
zerovfy := comp bitflip jet_verify : 2 -> 1
-- cmr ab61cadb...

main := comp witbit zerovfy : 1 -> 1
-- cmr 58655557...

Every node, including witness and disconnect nodes, has a CMR. And every witness node, in every program, has the same CMR.

Naming and Sharing


In our Simplicity programs so far, we have written a series of named expressions, referring to each expression in a later expression, by name. Naturally, we are not limited to referring to names only once. This concept is essential to using Simplicity. We refer to name reuse, or more generally, collapsing multiple copies of the same expression into one, as sharing.

On the blockchain, there is a consensus rule that Simplicity expressions encoded in transactions be fully shared. That is, it is actually illegal to encode two nodes that have the same IMR. This helps ensure that programs are encoded in the most compact way, and that their encoding is canonical; that is, there is no potential for malleability. But this rule exploits the fact that programs never appear on-chain except when all witness nodes are populated, and therefore every node has an IMR.

At commitment time, when we are defining programs and computing addresses, the story is not so simple. Let's explore the subtleties that come from sharing before our IMRs are set in stone.

Essentially, every time a name is used, it must refer to a node with the same IMR. What makes this surprising is that, as we have seen, the code we write is only actually sufficient to define a node's CMR, not its IMR.

Let's see a simple example of where this might lead to trouble. Let's take a look at our original bit-flipping program, and let's say that we are tired of writing the entire word unit out. So we define an alias

u := unit  -- cmr 62274a89...

and try to use it in our program:

bit0 := injl u : 1 * 1 -> 2
-- cmr bd0cce93...
bit1 := injr u : 1 * 1 -> 2
-- cmr 79a70c6a...

padR := pair iden u : 2 -> 2 * 1
-- cmr 7751cd1c...
bitflip := case bit1 bit0 : 2 * 1 -> 2
-- cmr 1d4aa8f4...
main := comp padR bitflip : 2 -> 2
-- cmr 8f214cce...

But when we try to parse this to get an address, we get a type error defining padR! The reason is that while the type of the unit combinator in bit0 and bit1 is 1 * 1 -> 1, the unit combinator in padR has type 2 -> 1. The source types of unit combinators are completely uninteresting details of a Simplicity program, but it's nonetheless essential[1] that we are consistent about them.

The fix here is simple: we can reuse the name u in bit0 and bit1 but for padR we need a fresh unit combinator.

So far, so good; this sort of problem is familiar to any programmer who works with strongly typed languages, especially ones that support type inference, so that the types of variables are not immediately obvious. But with Simplicity, there is also a more interesting problem we might stumble over.

Consider the following program, which checks whether the user has found a SHA256 "fixed point", i.e., an input whose SHA256 hash is equal to itself. Because SHA256 is a cryptographic hash function, this should be impossible, so coins stored at this program should be unspendable.

    -- Witnesses
    -- CMR: bf12681a...
    -- IMR: [none]
    wit_input := witness

    -- All jets have source type 1; but to use the `pair` combinator
    -- we want one with source type 2^256. To get this, we compose
    -- it with unit.
    -- CMR: a520761f…
    -- IMR: 21182050...
    sha256_init : 2^256 -> _
    sha256_init := comp unit jet_sha_256_ctx_8_init

    -- Using this, we can write a self-contained "take 32 bytes and
    -- compute their sha2 hash" function.
    -- IMR: 8e341445...
    -- CMR: bf70ec35...
    sha256 : 2^256 -> 2^256
    sha256 := comp
        comp
            pair sha256_init iden
            jet_sha_256_ctx_8_add_32
        jet_sha_256_ctx_8_finalize

    -- Finally, our main function takes an input witness, hashes it,
    -- and compares the input and output to check for a fixpoint.
    -- CMR: 9b3b900a...
    -- IMR: [none]
    main := comp
        comp
            pair (comp wit_input sha256) wit_input
            jet_eq_256
        jet_verify

Here we have reused the name wit_input. Our rule is that we can only reuse names when all uses have the same IMR, but here we don't even have an IMR to compare. Will this be a problem?

Well, consider this alternate version of main, which takes a second witness, wit_output, and compares this to the hash of wit_input. This new main is trivial to satisfy: provide anything at all for wit_input and provide its SHA256 hash for wit_output.

    -- Witnesses
    -- CMR: bf12681a...
    -- IMR: [none]
    wit_input := witness
    wit_output := witness

    [...rest of program elided...]

    -- CMR: 9b3b900a...
    main := comp
        comp
            pair (comp wit_input sha256) wit_output
            jet_eq_256
        jet_verify

The important thing to notice here is that these two programs, our fixed-point-checking program and the trivially-solvable program, have the same CMR and, therefore, the same Bitcoin address. This is because Simplicity does not commit to sharing. This subtle, but crucial, fact, means that our fixed-point program, rather than being impossible to spend coins from, is trivial.

Fortunately, Simplicity prevents us from making this mistake. If we attempt to assemble our original version of the program, we get the following error:

Error: witness/disconnect node `wit_input` was accessible by two distinct paths from the same root

What this is telling us, in somewhat jargony language, is that we are simply not allowed to share witness or disconnect nodes, or any expressions which refer to them. This is because such nodes, even when they have identical types and CMRs, may not end up the same.

At redemption time, when witness data and disconnected expressions are present, every node has an IMR and it is unambiguous whether nodes are the same. At that time, if two witnesses were present and had the same types and values, they would be shared. But this sharing depends on the particular values chosen for each witness node, which are not decided at commitment time, and therefore do not reflect anything about the actual structure of the program.

To fix our program, there are a couple of strategies that we could take. One thing we could do is to define two witness nodes, with different names, and use the jet_eq_256 jet to assert that they are both equal. This would work, but increase the complexity of code, making it harder to review or to extend. A more direct approach is to duplicate the output of a single witness node.

We do this by pulling our check out of main into its own expression, whose input value is accessed by the iden combinator twice:

    -- IMR: 5ddb65c2...
    -- CMR: 198bfeb5...
    assert_fixpoint : 2^256 -> 1
    assert_fixpoint := comp
        comp
            pair (comp iden sha256) iden
            jet_eq_256
        jet_verify

    -- CMR: ac8df50b...
    -- IMR: [none]
    main := comp wit_input assert_fixpoint

This program is accepted without error.

Conclusion


As with ordinary Taproot transactions, addresses are derived from commitments to programs, but the programs themselves are not revealed until spending time. Also, as with Taproot, in Simplicity, we reveal only those parts of the program that are actually used.

Bitcoin Script has a strict separation between program code and witness data, so the program revealed on-chain is the same as the program committed to in addresses (aside from pruning). Witness data is encoded separately and provided to the program code as input. Script does not have any notion of sharing, or of disconnected expressions, so functionality such as delegation is missing (and functionality such as loops are very inefficient to implement).

In Simplicity, we instead attach witness data directly to our programs, and when encoding programs, we insist that nodes are never repeated. This combination requires us to have a surprisingly subtle notion of "identity," which, when applied to text-encoded programs, allows us to reuse names in a mostly-natural way. But there are potential pitfalls here, which our tooling helps to avoid.

In our next post, we will explore pruning, which is the process by which unused nodes are not revealed, and its interactions with witnesses and the type system.

Until then, join the Simplicity discussions on GitHub to connect with the community, and follow us at @blksresearch on Twitter to stay updated with the latest Simplicity announcements.


[1]In theory, Simplicity could have supported polymorphic types, which would allow the same unit combinator to be used with different type signatures. But aside from increasing the complexity of using and analyzing the language, this would have increased the computational complexity of type inference, a task that must be done by all validators on all programs.

If you have specific preferences, please, mark the topic(s) you would like to read: