Introduction
Some quick thoughts and comments from Andrew Poelstra. This week AJ Towns wrote a thoughtful, if sometimes blunt, critique of Liquid’s recent and future consensus changes. As he observes, Liquid’s functionary source code is unavailable to the public, forcing him to speculate about specific things that went wrong with our recent deployment of Dynamic Federations. As somebody privy to the source code, architecture and history of the system, I’d like to write a response clarifying some things and responding to specific points that AJ raised. I don’t mean this to downplay Liquid’s recent outage, but rather to give some context for what went wrong and how AJ’s suggestions might have changed things.
The rest of this article will focus on what I think are the main two thrusts of AJ’s critique. One is that Liquid changes could be designed to cleanly “fail open” such that the network continues operating even in the presence of unexpected problems. The other is that Liquid should have well-defined, transparent processes for introducing changes, and that its functionary codebase should be open sourced. Finally, I’ll address a few minor points, such as Liquid’s experimental opcodes conflicting with potential future Bitcoin opcodes.
By the way, AJ, while I love your contributions to the Bitcoin and Lightning ecosystems, if you are looking for a change, Blockstream is hiring.
Train Wreck or Emergency Stop?
On Monday, October 4th, the Liquid network attempted to activate the long-awaited Dynamic Federations update. While the hard fork “succeeded”, in the sense that the consensus rules changed for all nodes in the same way at the same time, the block-signing software running on the functionaries shut down, resulting in a 22-hour outage of the Liquid network. During this time, the watchman software, responsible for running the distributed multisignature wallet holding the Liquid funds, continued to operate, although with no blocks to signal pegins or pegouts, it merely idly swept funds to prevent the coins’ emergency timeout clause from activating.
The reason for the shutdown was that in Dynamic Federations, certain consensus parameters are encoded in the block headers. This is the “raw data’’ needed for network nodes to validate transactions: the Elements Script representing the blocksigning policy, a limit on the size of blocksigning witnesses used as an anti-denial-of-service measure, the Bitcoin Script representing the watchman policy, and the set of PAK keys used to authorize pegouts.
The functionary software has a richer set of data representing the same thing, called a “consensus parameter entry” (CPE). CPEs contain a full Miniscript encoding of the blocksigning and watchman scripts, using abstract keys which are mapped to functionary peers. Without this abstract form, the functionaries would be unable to produce blocks; they need to be able to match keys to functionaries to determine which functionaries should be signing which blocks (and in which order), know how to contact these functionaries (network addresses and authentication keys), and know when to signal transitions from one CPE to another. If a functionary sees parameters activated on the Liquid blockchain which do not correspond to any CPE, that functionary will shut down, since it no longer understands what is happening on the network.
Under normal circumstances, the parameters active on the Liquid chain will be ones that were explicitly proposed by functionaries in blockheaders over a multi-week period. However, the initial set of parameters is special: it is computed by the network consensus rules from the original set of blocksigner, watchmen and PAK keys.
The problem happened because in the consensus code, the blocksigner witness limit is hardcoded to the value 1416. However, in the functionary logic, this value is computed from the blocksigner miniscript using rust-miniscript (in fact, using a shim which interprets the non-Miniscript blocksigning script as a Miniscript one), and was computed as 1325. The result: Liquid activated parameters, on the live network, which did not correspond to any CPE that the functionaries knew about. In response to this, a majority of functionaries shut down.
This failure mode illustrates a fundamental difference between Liquid and Bitcoin’s philosophy, which I think AJ is not considering: in Liquid, we “fail closed,” meaning in the presence of problems we prefer to shut down the network rather than continuing in a degraded state. This is very different from Bitcoin, which always tries to “fail open,” making sure that the network stays alive at all costs. There are two reasons for this:
- As AJ says, the HSMs responsible for Liquid blocksigning will not sign a blockchain fork longer than 1 block. This means that if more than ⅓ of the HSMs are induced to sign two invalid blocks in a row, we cannot course-correct and would be forced to hard fork the rest of the network to agree with the confused HSMs. It would a thorny problem to resolve.
- More philosophically, all the value inside Liquid ultimately resides in the “always on” Bitcoin network. So we do not have an imperative “stay up at all costs” mandate the way that Bitcoin does — in fact, the opposite is true. In Liquid, all of the coins are custodied on Bitcoin by the Liquid watchmen, who follow instructions from the Liquid blockchain which in turn is constructed by the blocksigners. In other words, the solvency of the system depends on a majority of functionaries operating correctly, and given a choice between “operating incorrectly” and “not operating at all” we will pick the latter, every time.
Because of this design philosophy, the Liquid functionaries are designed to shut down under a wide variety of circumstances, most of which would be the wrong tradeoff for any critical Bitcoin infrastructure.
Regarding our tests of the transition, we tested this transition on multiple internal Liquid test networks, including a physical HSM-enabled test network of the same size as production, but these tests did not expose the bug, which came down to a network-specific hard-coded constant being incorrect.
As a final observation, even if we did want to design for extreme uptime, we would have our work cut out for us, because unlike Bitcoin, we require a majority of functionaries to operate (and be reliably communicating with each other) for the network to progress at all. Bitcoin’s proof of work is memoryless, which means that Bitcoin could cheerfully adapt to 90% of its hashrate disappearing instantly. Not Liquid.
Source Code and Transparency
The second theme of AJ’s post was that Liquid does not deploy changes with the kind of transparent and careful processes that Bitcoin does. To some extent this is true; Liquid has a much smaller development team and tries to move faster than does Bitcoin, so no matter what we are not going to live up to Bitcoin’s standards. I agree with AJ’s sentiment that even Bitcoin should be doing better on this front, so like many developers with a view of the low level detail I am disquieted by this situation. But this is the nature of the beast.
AJ makes a couple of great points. One is that the functionary software ought to be open-source. Believe it or not, my intention was that this source code would be open source over a year ago. (To answer AJ’s speculation about whether the functionary operators have access to it, yes, it is AGPL-licensed to them.)
As AJ says, there is some value in security by obscurity, although usually less value than there would be in the “many eyes” of open-source development, but in my view this applies only to the logic running directly on the HSM component. Internally, this logic is separate from the rest of the functionary logic, and the interface between them is a binary protocol which we would have no qualms about revealing to the world. So we could release the functionary codebase while keeping the HSM logic proprietary.
However, the HSM and functionary logic exist within the same monorepo, which means that to open source the functionary software, we would need to first separate these repos, which is difficult for several reasons:
- We would need to construct a new monorepo to keep the HSM and functionary repos in sync when doing simultaneous updates, and update our development processes to use this (in many cases this would mean splitting functionary+HSM PRs into two, in different repos with different central servers).
- We would need to physically separate the repos, and verify that we had done so correctly and without revealing anything that we didn’t want to.
- We would need to migrate all the existing issues and PRs to the new public repository, manually separating ones that were overly HSM-focused or security sensitive.
Our goal was to do this “after dynafed was deployed” when there were fewer critical code changes happening at once that would be disrupted. As it has turned out, dynamic federations is a much larger project than we’d originally anticipated and there is still work left to do.
Beyond discussing the functionary source code, AJ also observes that consensus changes to Liquid, such as the new opcodes that will be included in Taproot, are done in less time and with less formal documentation than in Bitcoin. As he points out, these changes are open sourced and are accompanied by in-repo documentation, so it is not like we are working in secret and then springing changes on the world.
My first thought, upon reading AJ’s post, was “we are only a couple of people working full-time to maintain the existing documentation, how could we be expected to maintain a Liquid analogue of the Bitcoin BIPs repo?”
But on reflection I think AJ is totally right about this, for two reasons:
- Although we have a small development team, we are blessed to have a large number of highly-skilled reviewers from the Bitcoin community who are willing to spend time and energy reviewing our designs. This includes AJ, of course, but also Dmitry Petukhov, Kalle Woof, Stepan Snigirev, Thomas Eizinger and many others.
- These reviewers have been immeasurably helpful, and have saved our bacon more than once by privately reporting security issues. Anything we can do to make their lives easier, we should.
- Consensus changes we make on Elements are (usually) ones we hope to one day propose for Bitcoin. (The sole exception to this would be things related to multi-asset support.) In fact, in many cases we try to propose things first for Bitcoin (such as Schnorr signatures) before implementing them on Liquid, both to maximize our compatibility with Bitcoin and to take advantage of the huge and motivated Bitcoin review community.
- Given this, for features such as the new Taproot opcodes, maybe we should start by writing them in a BIP format, proposing them to an Elements-specific repo if the result was inapplicable to Bitcoin. After all, we’ll eventually have to write a BIP anyway if we want to propose our changes for Bitcoin.
As a further observation, we do have an unfortunate habit of discussing designs on our Blockstream-internal chat system, rather than in public on IRC or on Github. We try not to do this so much, but we need to try harder.
Miscellaneous Observations
I have a few small comments for AJ which didn’t really fit into the main body of this article, and I am not on Twitter, so I will just write them here:
- AJ suggests that with a better fork design, we could have avoided a network outage, by having the blocksigners fall back to producing old-style blocks. Aside from our “fail closed” design philosophy outlined above, I think this wouldn’t have worked, because from a consensus point of view, the fork actually succeeded. The reason the blocksigners shut down was that they didn’t like the fork.
- AJ discusses the potential that the new Elements opcodes may conflict with future Bitcoin opcodes, something that “could have been avoided pretty easily” if we had spent more time seeking active feedback on the proposal. We did get a fair bit of outside feedback, and were working in public on this design for the better part of a year, and I don’t think this could’ve been so easily avoided. We spent a lot of time internally discussing ways to avoid these conflicts, including multiple multi-byte opcode designs, and they all seemed to lead to a lot of complexity and extra difference between Bitcoin and Elements in consensus-critical code. Given this cost, and the in-practice unlikelihood of conflicts (how many new opcodes is Bitcoin realistically going to include in this tapleaf version?), and the fairly minor consequences of such conflicts (Bitcoin would just need to change some opcode values when porting from Elements, or vice-versa), we decided to just live with it.
- It is worth noting that the actual near-hard-fork in Elements, where the yet-unreleased Elements 0.21 would not have been able to sync the post-dynafed chain, happened entirely in open-source code. I agree that our functionary code should be open, but this is a reminder that open source is not a panacea.
- In the interest of transparency, by the way, this happened because this PR to Elements 0.18 was not forward-ported to the development branch. We have changed our processes to ensure this does not happen again.
In conclusion, I’d like to thank AJ for digging into Liquid and for providing his feedback. We are all still learning in this space and this dialog is very valuable.
Note: This blog was originally posted at https://medium.com/blockstream/response-to-aj-towns-blog-post-about-liquid-consensus-changes-931c1055dee5