Fastly - website outages show the importance of bug testing
For an hour or so late on Tuesday night, some of the most-visited websites on the internet displayed error messages when people tried to visit them.
Most of us in Aotearoa were preparing for bed, so the disruption was minimal. But social media lit up with reports from overseas of people trying and failing to access Amazon and the BBC, CNN, Twitch and Reddit, all sites that have a major digital media focus.
The problem turned out to be down to the US company Fastly, which runs an extensive content distribution network (CDN) for companies all over the world. It allows them to mirror their content - text, images and videos, on servers in 26 countries, improving access website times for their customers and cutting down bandwidth costs associated with running highly-trafficked websites.
In May, Fastly pushed out a software update to its servers which went off without a hitch. But earlier this week, a Fastly customer made a configuration change to their content network settings that ended up taking most of Fastly's customers offline.
The timeline of events (Fastly blog)
"On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances," Fastly's head of engineering Nick Rockwell explained in a blog post.
"Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors," he added.
It is hugely embarrassing for Fastly and exposes the vulnerabilities of content distribution resting in the hands of a small number of global players, including Akamai, Cloudflare, Fastly and Amazon's own Akamai, Cloudflare and Amazon's own CloudFront.
"We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal," Rockwell explained.
"We should have anticipated it," he admitted.
"We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support."
At least Fastly was able to take swift action, which appears to have saved it major reputational damage and a hit to its share price. But the incident serves as a lesson to all software developers to undertake thorough software testing before deploying updates.
Fastly has vowed to get to the bottom of how its processes failed. It said it would:
Deploy bug fix across its network as quickly and safely as possible.
Conduct a complete post mortem of the processes and practices followed during this incident.
Figure out why it didn't detect the bug during software quality assurance and testing processes.
Evaluate ways to improve our remediation time.
You must be logged in in order to post comments. Log In