Coinbase reports service issues, points finger at AWS cloud outage

A localised cooling failure at an Amazon Web Services (AWS) data centre is claimed to have sparked disruption, crippling services for cryptocurrency exchange Coinbase and derivatives giant CME Group. According to a report by news agency Reuters, the issue, which AWS attributed to “increased temperatures” in the AWS Health Dashboard within a single data centre in its Northern Virginia region, forced the cloud giant to scramble for additional cooling capacity to prevent a wider meltdown.

What Coinbase is claiming

Coinbase is pointing to regional failure “in the AWS US-EAST-1 Region” – which is US East (N. Virginia) region. Here’s Coinbase post on X:On May 7th Coinbase experienced service disruptions. Here’s a quick summary of what happened:→ Around 8PM ET, Coinbase systems flagged high error rates across multiple services.→ We traced these errors to amazon failures in Availability Zone (use1-az4) in the AWS US-EAST-1 Region.→ Coinbase systems are designed to be resilient to a single zone outage, and are designed to recover quickly if this happens.→ In this case, we observed failures impacting multiple AWS zones, which caused an extended outage of core trading services.→ Coinbase users experienced an extended outage while the AWS team worked to restore temperature controls and other Amazon Managed Services.This primary issue is now fully resolved – thank you for your patience. If you have any outstanding questions about your account, please reach out to Coinbase Support, we’re ready to help.Our team will conduct a full analysis. Details may change as our investigation progresses and more information is received from AWS’s official retrospective, once published.

What AWS Health Dashboard says

According to AWS Health Dashboard, there is an increased error rate and latency at the time of writing. Here are the updates:Increased Error Rate and LatencyMay 08 1:32 AM PDT: Mitigation efforts remain underway to resolve the impaired EC2 instances and degraded EBS volumes in a single Availability Zone (use1-az4) in the US-EAST-1 Region. These EC2 instances and EBS volumes were impacted due to a loss of power during the thermal event. The work to bring additional cooling system capacity online, which will enable us to recover the remaining affected infrastructure in a controlled and safe manner, is taking longer than we had initially anticipated. Some services, such as IoT Core, ELB, NAT Gateway, and Redshift, have seen significant improvements in the recovery of their workflows. However, some customers will continue to see their affected EC2 instances and EBS volumes as impaired until we achieve full recovery. While we do not currently have an ETA for full recovery, we are prioritizing this issue and will provide another update by 3:30 AM PDT or sooner if additional information becomes available.May 07 11:38 PM PDT: We continue to make progress in resolving the impaired EC2 instances in the affected Availability Zone (use1-az4) in the US-EAST-1 Region, and are working towards full recovery. We are actively working to bring additional cooling system capacity online, which will enable us to recover the remaining affected racks in a controlled and safe manner. In the impacted Availability Zone, EC2 Instances, EBS Volumes, and other AWS Services may continue to experience elevated error rates and latencies for some workflows. Customers will continue to see some of their affected EC2 instances and EBS volumes as impaired until we achieve full recovery. We will provide an update by May 8, 1:30 AM PDT, or sooner if we have additional information to share.May 07 10:11 PM PDT: We are observing early signs of recovery. We continue to work towards restoring temperatures to normal levels and bring impacted racks back online in the affected Availability Zone (use1-az4) in the US-EAST-1 Region. We have been able to get additional cooling system capacity online, which has allowed us to recover some affected racks and are actively working to recover additional racks in a controlled and safe manner. In the impacted Availability Zone, EC2 Instances, EBS Volumes, and other AWS Services may continue to experience elevated error rates and latencies for some workflows until full recovery is achieved. We will provide an update by 11:30 PM PDT, or sooner if we have additional information to share.May 07 8:06 PM PDT: We are actively working to restore temperatures to normal levels in the affected Availability Zone (use1-az4) in the US-EAST-1 Region, though progress is slower than originally anticipated. Since our last update we have made incremental progress to restore cooling systems within the affected AZ, which will not be visible to external customers but are required for the restoration of affected services. In the impacted Availability Zone, EC2 Instances, EBS Volumes, and other AWS Services are also experiencing elevated error rates and latencies for some workflows. As part of our recovery effort, we have shifted traffic away from the impacted Availability Zone for most services. We recommend customers utilize one of the other Availability Zones in the US-EAST-1 Region, as existing instances in other AZs remain unaffected by this issue. If immediate recovery is required, we recommend customers restore from EBS Snapshots and/or replace affected resources by launching new replacement resources in one of the unaffected zones. We will provide an update by 10:00 PM PDT, or sooner if we have additional information to share.May 07 6:47 PM PDT: We continue to work towards mitigating the increased temperatures to its normal levels in the affected Availability Zone (use1-az4) in the US-EAST-1 Region. Other AWS services that depend on the affected EC2 instances and EBS volumes in this Availability Zone, may also experience impairments. We have weighed away traffic for most services at this time. We recommend customers utilize one of the other Availability Zones in the US-EAST-1 Region at this time, as existing instances in other AZ’s remain unaffected by this issue. Customers may experience longer than usual provisioning times. We will provide an update by 7:45 PM PDT, or sooner if we have additional information to share.May 07 5:53 PM PDT: We continue to investigate instance impairments to a single Availability Zone (use1-az4) in the US-EAST-1 Region. We have experienced an increase in temperatures within a single data center, which in some cases has caused impairments for instances in the Availability Zone. EC2 instances and EBS volumes hosted on impacted hardware are affected by the loss of power during the thermal event. Other AWS services that depend on the affected EC2 instances and EBS volumes in this Availability Zone, may also experience impairments. We will continue to provide updates as recovery continues.May 07 5:25 PM PDT: We are investigating instance impairments in a single Availability Zone (use1-az4) in the US-EAST-1 Region. Other Availability Zones are not affected by the event and we are working to resolve the issue.



Source link

By sushil

Leave a Reply

Your email address will not be published. Required fields are marked *