Facebook: ‘Routine Maintenance’ Job Disconnected Data Centers Globally
‘That’s where an MSP adds value’ for smaller businesses,’ Phil Walker, CEO of Network Solutions Provider -- a Microsoft partner based in Manhattan Beach, Calif., and member of CRN’s 2021 Managed Service Provider 500 -- tells CRN in an interview. ‘We can have the redundant plan ready to jump in and save the day.’
Facebook said routine maintenance work had caused a massive global hourslong outage of the social media giant’s namesake social media network as well as its Instagram and WhatsApp platforms.
“This outage was triggered by the system that manages our global backbone network capacity,” said Santosh Janardhan, VP, Engineering and Infrastructure at Facebook, in a blog post Tuesday. “The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.”
Janardhan said a “command was issued with the intention to assess the availability of global backbone capacity,” which “unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”
[RELATED: Facebook, Instagram, WhatsApp Down For Many Users]
The backbone’s removal from operation then led to Facebook’s DNS servers becoming unreachable. The reason the outage lasted for hours was because the DNS loss broke internal tools Facebook would have used to investigate and resolve outages.
“Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems,” according to the Facebook post. “But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
Once Facebook restored backbone network connectivity across data center regions, the company then brought its services back online. It plans to simulate global backbone outage events and create a quicker recovery plan in case the event happens again, according to the post.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
Phil Walker, CEO of Network Solutions Provider -- a Microsoft partner based in Manhattan Beach, Calif., and member of CRN’s 2021 Managed Service Provider 500 -- told CRN in an interview that Facebook’s outages are a reminder to even smaller businesses to think about multi-cloud, redundant networks and multiple paths to business in case a vendor experiences an outage.
“There’s a way of doing 1A, 1B, in order to fit people’s budgets,” Walker said. “But that’s where an MSP adds value. We can have the redundant plan ready to go to jump in and save the day.”
The company said it continues to investigate the outages “so we can continue to make our infrastructure more resilient,” according to the post. The outage lasted for more than five hours and included employees having trouble making calls from work-issued cellphones, receiving external emails and unable to use an internal communications platform called Workplace, according to The New York Times.
Hundreds of thousands of users trying to access Facebook, WhatsApp, Instagram and Facebook Messenger reported outages Monday, according to Downdetector.
Facebook’s post on the cause of the outages included an apology to users.
“To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by” Monday’s outages, the company posted in a separate blog post on Monday. “We’ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.”
Facebook’s stock dropped to $323.13 Monday afternoon after opening at $335.52. Tuesday morning, it was trading at about $334 a share.