Nine Realms provides a concise review of the recent events surrounding THORChain's consensus failure.
Nine Realms takes a deep dive into the recent incident regarding THORChain's consensus failure providing details on detection, timeline of events, escalation, remediation, prevention and more.
.@THORChain community: here’s a full review of the last week’s events: https://t.co/k1R1A2UZXZ
— Nine Realms (@ninerealms_cap) November 20, 2021
As we continue to improve the process of developing THORChain, you can expect future post-flights that share learnings, improvements, and next steps.
On Friday Nov 12, THORChain reached a consensus failure due to an iteration over a map 'error-ing' at different indexes. This caused the chain to halt. After a few initial approaches, the full resync method was chosen. Consensus was restored on Nov 17.
After the network was restored, there was a secondary issue when trading was resumed before all the node's bifrosts had reached the tip of each chain. This caused some nodes to receive slash points for not observing transactions. Trading was halted on these chains until they caught up.
The detailed report can be read here.
The code has already been reviewed for usages where a map is iterated over, but another review would be a good idea.
Eliminating this class of errors is almost impossible, but there is a lot to do to make recovery faster going forward. Now that we have a process in place for this, we should try to ensure nodes have nightly snapshots by default.
There should also be better review and less urgency when issuing non-time critical mimir commands. While this is mostly a human process for now, node-mimir will eventually make this a reality. A better community dashboard for chain health could have helped here.
Other tasks are :