Postmortem: Outage of a Popular Online Shopping Site

Eric Okemwa
3 min readMay 14, 2023

--

As the saying goes, “To err is human, but to foul things up requires a computer.” And in the world of technology, unexpected outages, and system failures are a fact of life. Whether you’re a seasoned developer or a newbie, you know it’s not a question of if something will go wrong but when. In this postmortem, we’ll look at a recent outage I faced and explore how we resolved it and prevented it from happening again. So buckle up, grab a cup of coffee, and dive into the debugging world!

Duration: Start time: 3:00 PM UTC, June 1, 2022 End time: 7:00 PM UTC, June 1, 2022

Impact: The online shopping site experienced a complete outage, with users unable to access the site or complete any transactions. The site had an average of 500,000 users per hour, all affected by the outage.

Root Cause: The outage's root cause was a misconfiguration in the load balancer that resulted in high CPU usage on the web servers. The increased CPU usage caused the servers to become unresponsive, resulting in an outage.

Timeline:

  • 3:00 PM UTC: The issue was detected when monitoring tools reported a sudden increase in CPU usage on the web servers.
  • 3:10 PM UTC: The site reliability engineer on-call was alerted to the issue and began investigating.
  • 3:30 PM UTC: The engineer discovered that the load balancer was misconfigured, causing an overload on the web servers.
  • 4:00 PM UTC: The team began taking action to resolve the issue by correcting the load balancer configuration.
  • 4:30 PM UTC: Further investigation revealed that the issue had also caused data inconsistencies in the database.
  • 5:00 PM UTC: The team began working on a fix for the data inconsistencies while also continuing to work on resolving the load balancer issue.
  • 6:00 PM UTC: The site was still down, and the team decided to escalate the issue to senior management.
  • 6:30 PM UTC: Senior management was briefed on the situation and provided additional resources to resolve the issue.
  • 7:00 PM UTC: The issue was resolved, and the site was restored to normal functionality.

Misleading investigation/debugging paths: During the investigation, the team initially suspected a DDoS attack due to the sudden spike in CPU usage. However, further investigation revealed that a misconfigured load balancer caused the issue.

Escalation: The incident was initially escalated to senior management within the site reliability engineering team. However, due to the severity of the outage and the impact on users, it was later escalated to senior management within the company.

Resolution: The load balancer misconfiguration was corrected, and the data inconsistencies in the database were resolved. The site was then restored to normal functionality.

Root cause and resolution: The misconfigured load balancer caused high CPU usage on the web servers, which resulted in an outage. The issue was resolved by correcting the load balancer configuration and fixing the data inconsistencies in the database.

Corrective and preventative measures:

  • Conduct a thorough review of the load balancer configuration to prevent similar issues from occurring in the future.
  • Improve monitoring tools to detect issues with the load balancer more quickly.
  • Increase redundancy in the system to minimize the impact of any future outages.
  • Review the incident response process to identify areas for improvement.
  • Conduct regular training for all team members to ensure they are familiar with the incident response process and prepared to respond to similar incidents.

--

--

Eric Okemwa
Eric Okemwa

Written by Eric Okemwa

Let’s connect and make the complex simple!

No responses yet