The 15 Biggest Cloud Outages Of 2022
Configuration errors, record-setting summer heat in the U.K. and ‘an act of vandalism’ were among the culprits affecting cloud use this year.
Although nothing this year reached the level of the December 2021 Amazon Web Services outage in terms of scale, 2022 saw plenty of disruption to cloud use from major vendors such as Google, Oracle and Microsoft and tools including Zoom, Slack and IT Glue.
Configuration errors, record-setting summer heat and even “an act of vandalism” were among the culprits credited by vendors for downing cloud services.
But “compared to on-site hardware, cloud-based infrastructure results in more frequent downtime but with less severity,” according to an October report from IT services provider PhoenixNAP. The company recommends data backups, disaster recovery and similar services to avoid downtime, which can cost thousands of dollars a minute.
[RELATED: Rackspace ‘Unable To Provide’ Timeline For Restoring Email After Ransomware Attack]
2022’s Biggest Cloud Outages
In an October report, research firm Gartner said that worldwide end-user spending on public cloud services is forecast to grow 20.7 percent to total $591.8 billion in 2023, up from $490.3 billion in 2022.
Infrastructure as a Service (IaaS) should see the highest end-user spending growth in 2023 at 29.8 percent with IT budgets still susceptible to a down economy and inflationary pressures, according to Gartner.
Platform as a Service (PaaS) and Software as a Service (SaaS) should see the most effects from inflation due to staffing challenges and the focus on margin protection, while still achieving more than 15 percent growth each next year, according to Gartner.
Here are some of the biggest cloud outages of 2022.
January Google Cloud Outage
Google started the year with three hours and 22 minutes of increased latency on Jan. 8 at its U.S. West 1B region in Oregon. The outage lasted from 2:15 p.m. Pacific to 5:36 p.m. Pacific.
The cause for the outage was “a routine maintenance event” performed on a software-defined networking (SDN) component and a checkpoint with missing configuration information being reconciled. A subsequent crashing of switches meant on-call engineers had to conduct repairs.
The incident affected the Mountain View, Calif.-based company’s services including Google Cloud Networking, Google Cloud DNS, Cloud Run, Cloud Spanner and Google Compute Engine, according to the company.
Two January IBM Outages
IBM Cloud had a tough start to the year with an issue with its classic infrastructure network, which “provides connectivity across a global footprint of over 60 IBM Cloud data centers and 28 points of presence (PoPs),” according to the tech giant.
Armonk, N.Y.-based IBM started investigating the issue on Jan. 2 and resolved it in about five hours, according to a report. IBM Cloud services users in the Dallas area were affected.
The next day, an issue with IBM’s virtual private cloud offering lasted for about an hour, according to the company. The issue affected users in Washington, D.C.; Japan; London; Dallas; Toronto; Germany and other areas, according to an IBM report.
February Slack Outage
Salesforce subsidiary Slack experienced multiple incidents throughout 2022, but the only issue labeled an outage by the company happened in February.
In its summary of the outage, Slack reported that users couldn’t access the collaboration application from 6 a.m. Pacific to 9:14 a.m. Pacific.
The San Francisco-based company explained that a configuration change led to an increase in database infrastructure activity, causing the databases to fail to serve requests to connect to Slack. Tighter rate limits were introduced, blocking Slack for those not already connected. Once the system stabilized, the company lifted rate limits, imposing them again when the system became overwhelmed again.
Almost 11,000 reports of a Slack outage were logged on Downdetector at 6:19 a.m. Tuesday.
March Google Cloud Outage
On March 8, users of Google’s Traffic Director tool experienced “elevated service errors for 2 hours and 35 minutes,” according to the cloud giant. Services such as Spotify and Discord were hit by the outage.
“A change to the Traffic Director code that processes the configuration was updated,” according to a post by Mountain View, Calif.-based Google. “The code change assumed that the configuration data format migration was fully completed. In fact, the data migration had not completed.
The post continued: “It would inadvertently delete the configurations which caused the downstream clients to lose their programming and deconfigure the data plane.”
April Atlassian Outage
An Atlassian outage started on April 5, with some customers restoring services by April 8 and the rest waiting until April 18, according to the company.
The cloud tools provider, which has offices in Australia and San Francisco, said the outage was due to a “communication gap” between teams working on deleting a stand-alone legacy application and “insufficient system warnings.”
“Although this was a major incident, no customer lost more than five minutes of data,” according to the company, whose most notable products include Jira and Trello. “In addition, over 99.6 percent of our customers and users continued to use our cloud products without any disruption during the restoration activities.”
To prevent the issue in the future, the company has plans for universal “soft deletes” across all systems; adding more customers to its automatic restoration program for multi-site, multi-product deletion events; and creating a large-scale incident communications playbook.
Spring IT Glue Outages
IT documentation software vendor IT Glue, part of Kaseya, sustained multiple outages this year.
The worst outage in terms of time length appears to have happened on March 31. The vendor posted at 5:51 a.m. Pacific to say an issue prevented access to IT Glue’s North American Data Center and caused 502 or “bad gateway” error messages. IT Glue said it resolved the issue by 8:12 a.m. Pacific, only to post a message 45 minutes later saying the issue happened again.
By 11:36 a.m. Pacific, almost six hours after the original post, IT Glue said the issue was finally solved.
Although the outage didn’t appear to get much attention on Reddit, later outages in the season elicited dozens of comments on the r/MSP forum of the social media network.
On April 4 at 6:35 a.m. Pacific, IT Glue posted to its status page to say once again users might receive “bad gateway” messages due to an issue with the North American Data Center. The issue was resolved by 7:26 a.m. Pacific.
A post on the r/MSP subreddit about the outage garnered 157 upvotes and 105 comments. “This is beyond unacceptable at this point,” one user wrote.
A Reddit post about an IT Glue outage on May 11 garnered 86 upvotes and 93 comments. “Glad we pay for this service,” the original poster wrote.
The vendor posted on its status page at 12:20 p.m. Pacific to acknowledge that users were seeing “502/500 error pages on certain pages in the app.” Vancouver, B.C.-based IT Glue resolved the incident 28 minutes later.
June Microsoft Outages
On June 7, customers had trouble connecting to resources hosted in the East U.S. 2 region, located in Virginia, according to Microsoft. The issue lasted for about 12 hours and should not have affected customers with always-available or zone-redundant services.
The Redmond, Wash.-based tech giant blamed the outage on “an unplanned power oscillation in one of our datacenters within one of our Availability Zones in the East US 2 region,” according to a Microsoft report.
It continued: “Components of our redundant power system created unexpected electrical transients, which resulted in the Air Handling Units (AHUs) detecting a potential fault, and therefore shutting themselves down pending a manual reset.”
The outage affected Application Insights, Log Analytics, Managed Identity Service, Media Services and NetApp Files, according to the report.
Microsoft is working on ways to “improve our tooling and processes to flag anomalies more quickly” and “fine-tuning our alerting to inform on-site data center operators more comprehensively,” according to the report.
The company is also “developing a plan for fault injection testing relevant critical environment systems, in partnership with our industry partners, to be even more proactive in identifying and remediating potential risks” and “expanding how many Azure services support Availability Zones, so that customers can opt for automatic replication and/or architect their own resiliency across services.”
On June 21, Microsoft tweeted that it was investigating delays and connection issues with Exchange Online. About two hours later, the company tweeted that it “determined multiple Microsoft 365 services are experiencing delays, connection and search issues,” responding by rerouting traffic.
About nine hours later, Microsoft tweeted that “rerouting traffic combined with targeted infrastructure restarts has successfully restored service access and functionality.”
June Cloudflare Outage
An accidental outage at Cloudflare in June caused major disruptions across large swaths of the internet, reportedly hitting popular sites such as Discord, Shopify, Grindr, Fitbit and Peloton.
IsDown reported that 230 services—including GitLab, Notion, Hubspot, Digital Ocean, Monday.com and Recurly—registered incidents during the outage.
The San Francisco-based vendor, which offers security and performance services for cloud deployments, said the problem was the result of “our error” and was fixed within about an hour and 15 minutes.
In a blog post, Cloudflare said the outage in the early hours of Tuesday affected traffic in 19 of its data centers.
“Unfortunately, these 19 locations handle a significant proportion of our global traffic,” the company said. “A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.”
The company concluded in the introduction of its blog post: “We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”
A post about the outage on social media network Reddit garnered 800 upvotes and 101 comments.
July Microsoft Teams Outage
On July 20 at 6:47 p.m. Pacific, Microsoft reported that its Teams collaboration applicable was inaccessible.
In a tweet, the vendor blamed “a recent deployment” that “contained a broken connection to an internal storage service, which has resulted in impact.” Multiple Microsoft 365 services with Teams integration were affected, including Word, Office Online and SharePoint Online.
Downdetector.com showed more than 4,800 incidents in the U.S. and more than 18,200 in Japan, according to Reuters.
The company tweeted at 5:02 a.m. Pacific to say that “the majority of services have recovered.”
July Heat In London Burns Google, Oracle
Record summer heat forced Google and Oracle into outages when cooling systems failed at data centers in London.
On July 19 at about 4 p.m. local time, Austin, Texas-based Oracle reported that two data center cooler units failed at its U.K. South data center, according to the vendor’s report on the incident.
“As a result, temperatures in the data center began to climb causing a subset of Compute infrastructure to go into protective shutdown,” according to the report.
It continued: “A subset of Oracle Cloud Infrastructure customers experienced a delay in recovering access to their resources hosted in the UK South (London) region with dependencies on the affected Compute Infrastructure.”
The issue was resolved by 10 a.m. local time the next day.
Google posted about its own issues at the Europe West 2 region at 6 p.m. local time and said the problem was fixed by 7 a.m. local time the next day.
Google services hit by the outage include Cloud Memorystore, Cloud SQL, Cloud Storage, BigQuery, managed service for Microsoft Active Directory and Google Kubernete Engine.
“One of the data centers that hosts zone europe-west2-a could not maintain a safe operating temperature due to a simultaneous failure of multiple, redundant cooling systems combined with the extraordinarily high outside temperatures,” according to the vendor. “We powered down this part of the zone to prevent an even longer outage or damage to machines. This caused a partial failure of capacity in that zone, leading to instance terminations, service degradation, and networking issues for a subset of customers.”
A post about the outages on Reddit gained 5,000 updates and 257 comments.
July AWS Region Outage
On July 28, Seattle-based AWS experienced a power loss in a single availability zone of the U.S. East 2 region—located in Ohio—that lasted about 20 minutes but knocked out third-party services for up to three hours, according to a report from ThousandEyes.
The loss of power started at 9:57 a.m. Pacific and was restored at 10:19 a.m. Pacific, according to the report. And customers with multiple availability zone redundancy likely failed over to a working zone.
The outage affected Amazon’s Elastic Compute Cloud (EC2), Webex and Okta, among other services and ISVs, according to the report.
In its own report on the incident, Metrist said the outage affected AWS’ CloudFront, CloudWatch, Amazon Elastic Kubernetes Service (EKS) and Lambda services, among others.
The report also credited the outage with affecting service from ISVs including Zoom and New Relic, according to Metrist.
August Google Outage
On Aug. 9, a Google data center in Iowa experienced an electrical explosion, causing injuries to three people.
However, the vendor said the explosion was unrelated to more than 30,000 reports to Downdetector of issues with Google Search, but YouTube and Google Maps.
The outage “was the result of an internal error, and we don’t have further details to share,” Google said at the time.
September Zoom Outage
Downdetector reported that an outage for videoconferencing application Zoom began at 10:31 a.m. Eastern on Sept. 15, with more than 34,000 reports of outages as of 11:11 a.m. Eastern.
The website showed outages were reported around the U.S. in Boston, New York City, Washington, D.C., and San Francisco.
By 11:49 a.m. Eastern, San Jose, Calif.-based Zoom reported on its service status page that the “incident has been resolved.”
“Thank you all for your patience and our sincere apologies for the disruption,” Zoom tweeted.
October Zscaler Outage
On Oct. 20, cybersecurity vendor Zscaler blamed vandalism for a severed fiber cable in the south of France that affected internet users for almost an entire day in the U.S., Europe and Asia.
The incident started Oct. 19 around 3 p.m. Eastern and was resolved around 2 p.m. Eastern the next day.
Zscaler CEO Jay Chaudhry took to LinkedIn to say “our investigation identified that the issues were a result of a severed fiber cable in Marseille, France” and that the incident was “an act of vandalism.
December Rackspace Ransomware Attack
The aftermath of a ransomware attack against Rackspace continues to play out, with Rackspace posting online Wednesday to say that more than two-thirds of customers have moved to Microsoft 365 environments.
Rackspace also said that third-party cybersecurity vendor CrowdStrike, enlisted to help with the aftermath, confirmed that the attack was limited to hosted Microsoft Exchange environments and that there has been no attacker activity since Dec. 2.
San Antonio-based Rackspace also said that the FBI is investigating the attack.
On Dec. 5, a Monday, Rackspace slowly began restoring email services to thousands of Microsoft 365 customers after a security incident caused a massive weekendlong outage tied to its hosted Exchange.
The company initially didn’t say on Friday that the outage was caused by a security incident. But by 1:57 a.m. on Saturday, Rackspace confirmed it.
“On Friday, Dec 2, 2022, we became aware of an issue impacting our Hosted Exchange environment,” the company said in a post update.
“We proactively powered down and disconnected the Hosted Exchange environment while we triaged to understand the extent and the severity of the impact. After further analysis, we have determined that this is a security incident.”