December 12th Post Mortem
Over the past month, Coinbase has continued to prioritize scaling our backend systems so that we can provide a reliable customer experience during periods of high traffic. Despite our improvements, on December 12th, all time high levels of traffic lead to periods of slowness and elevated error rates.
On December 11th at 22:00 PST Coinbase underwent scheduled maintenance to improve database performance. One of the techniques we utilized during this maintenance window was to distribute database load by splitting existing datasets to new clusters.
On December 12th, starting at 04:00 PST, traffic began to climb as a result of large Litecoin price movements. At 04:15 PST, two of the new clusters from the maintenance window began to experience slowness as a result of the slower network attached disks used by these instances to complete the migration. This database slowness resulted in elevated response times and high error rates. We were able to resolve these issues by 05:42 PST by failing over to high performance nodes which we had begun provisioning the night before.
Following these improvements, site traffic continued to climb, reaching all time high levels by 06:14 PST. Starting at 05:58 PST, one of our primary clusters began to experience degraded performance as a result of overwhelming query volume. We worked throughout the day to reduce pressure on this cluster by deploying application level improvements to more efficiently cache database queries. By 14:00 PST services were fully restored.
Scaling is hard, but we’re working hard to support the millions of people who are discovering cryptocurrency for the first time. We are looking to hire senior backend engineers in San Francisco, London and New York. If working on this sort of challenge excites you please see our careers page.