THORChain Consensus Failure Review
Nine Realms takes a deep dive into the recent incident regarding THORChain's consensus failure providing details on detection, timeline of events, escalation, remediation, prevention and more.
On Friday Nov 12, THORChain reached a consensus failure due to an iteration over a map 'error-ing' at different indexes. This caused the chain to halt. After a few initial approaches, the full resync method was chosen. Consensus was restored on Nov 17.
After the network was restored, there was a secondary issue when trading was resumed before all the node's bifrosts had reached the tip of each chain. This caused some nodes to receive slash points for not observing transactions. Trading was halted on these chains until they caught up.
The detailed report can be read here.
The code has already been reviewed for usages where a map is iterated over, but another review would be a good idea.
Eliminating this class of errors is almost impossible, but there is a lot to do to make recovery faster going forward. Now that we have a process in place for this, we should try to ensure nodes have nightly snapshots by default.
There should also be better review and less urgency when issuing non-time critical mimir commands. While this is mostly a human process for now, node-mimir will eventually make this a reality. A better community dashboard for chain health could have helped here.
Other tasks are :
- Alerting for chain halt or slow block times: thorchain/devops/node-launcher#56
- Soft fork: #1166
- Restore slash logic: #1167
- Investigate automatic snapshots for nodes: thorchain/devops/node-launcher#55
- Easily halt trading when network is down, requires manual tuning right now via nginx
- Easy mocknet setup with custom build
- Community dashboard including bifrost status (for mimir change inputs)
- Create node-mimir for start commands? (In addition to regular mimir for now, transition period)
- Investigate THORChain sync speedup