Cloud World

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 20 July 2011

Postmortem: Java App Engine outage, July 14, 2011

Posted on 12:24 by Unknown

Summary




Last week, we posted about a limited outage on July 14, 2011. Now that our internal postmortem is complete, we thought you would also like to get more detail about what went wrong and what we are going to do to ensure this doesn't happen again.






Root Cause and Analysis




The main lesson learned is to improve our live traffic testing as a relatively minor bug triggered a corner case for some of our customers. The bug was in a new release of the infrastructure in the App Engine Java execution environment. During development, testing, and qualification, this bug was essentially hidden from view because it only manifested itself under specific load patterns. During the outage, requests to affected applications would fail with errors when traffic was routed to affected instances. Application logs would have shown affected instances experienced high latency, error rates, or were not reachable from the Internet. This could have been caught by letting the live traffic testing run longer.




In order for live traffic testing to work properly, we need to improve our monitoring as well. In this case, having more points from which to do black box monitoring would have helped immensely. We are currently working on much broader monitoring for App Engine and will be integrating more extensive black box testing in upcoming quarters.




Once again, we’d like to point out that we could have done a much better job of communicating issues to all of you. While we strive to strike a balance between letting you know about major issues and not bothering you about the day-to-day operations; we clearly should have communicated this incident to you sooner. Rest assured you’ll be better informed of issues in the future.







Timeline




July 14, 2011 - 11:30 AM US/Pacific - The new Java execution environment is released to production.




July 14, 2011 - 5:00-6:00 PM US/Pacific - The previously scheduled Master/Slave read-only maintenance period occurred.




July 14, 2011 - 8:00-9:30 PM US/Pacific - Monitoring shows error rates and latency for Java applications using the Master/Slave datastore are slowly increasing across the entire system. Investigation reveals that the new Java execution environment is malfunctioning.




July 14, 2011 - 9:30 PM US/Pacific - Rollback of the Java execution environment to the previous version begins. Latency and error rates begin to fall.




July 14, 2011 - 11:30 PM US/Pacific - Rollback of the Java execution environment to the previous version completes. Java Master/Slave applications are functioning normally.






Remediation





  • Faster notification on our status site and downtime-notify mailing list

  • More live traffic stress tests for new releases

  • Better black box monitoring to detect small impacts more quickly





[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.






Posted by Wesley Chun, Google App Engine Team
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Bridging Mobile Backend as a Service to Enterprise Systems with Google App Engine and Kinvey
    The following post was contributed by Ivan Stoyanov , VP of Engineering for Kinvey, a mobile Backend as a Service provider and Google Cloud ...
  • Tutorial: Adding a cloud backend to your application with Android Studio
    Android Studio lets you easily add a cloud backend to your application, right from your IDE. A backend allows you to implement functionality...
  • 2013 Year in review: topping 100,000 requests-per-second
    2013 was a busy year for Google Cloud Platform. Watch this space: each day, a different Googler who works on Cloud Platform will be sharing ...
  • Easy Performance Profiling with Appstats
    Since App Engine debuted 2 years ago, we’ve written extensively about best practices for writing scalable apps on App Engine. We make writ...
  • TweetDeck and Google App Engine: A Match Made in the Cloud
    I'm Reza and work in London, UK for a startup called TweetDeck . Our vision is to develop the best tools to manage and filter real time ...
  • Scaling with the Kindle Fire
    Today’s blog post comes to us from Greg Bayer of Pulse , a popular news reading application for iPhone, iPad and Android devices. Pulse has ...
  • Who's at Google I/O: Mojo Helpdesk
    This post is part of Who's at Google I/O , a series of guest blog posts written by developers who are appearing in the Developer Sandbox...
  • A Day in the Cloud, new articles on scaling, and fresh open source projects for App Engine
    The latest release of Python SDK 1.2.3, which introduced the Task Queue API and integrated support for Django 1.0, may have received a lot ...
  • SendGrid gives App Engine developers a simple way of sending transactional email
    Today’s guest post is from Adam DuVander, Developer Communications Director at SendGrid. SendGrid is a cloud-based email service that deliv...
  • Qubole helps you run Hadoop on Google Compute Engine
    This guest post comes form Praveen Seluka, Software Engineer at Qubole, a leading provider of Hadoop-as-a-service.  Qubole is a leading pr...

Categories

  • 1.1.2
  • agile
  • android
  • Announcements
  • api
  • app engine
  • appengine
  • batch
  • bicycle
  • bigquery
  • canoe
  • casestudy
  • cloud
  • Cloud Datastore
  • cloud endpoints
  • cloud sql
  • cloud storage
  • cloud-storage
  • community
  • Compute Engine
  • conferences
  • customer
  • datastore
  • delete
  • developer days
  • developer-insights
  • devfests
  • django
  • email
  • entity group
  • events
  • getting started
  • google
  • googlenew
  • gps
  • green
  • Guest Blog
  • hadoop
  • html5
  • index
  • io2010
  • IO2013
  • java
  • kaazing
  • location
  • mapreduce
  • norex
  • open source
  • partner
  • payment
  • paypal
  • pipeline
  • put
  • python
  • rental
  • research project
  • solutions
  • support
  • sustainability
  • taskqueue
  • technical
  • toolkit
  • twilio
  • video
  • websockets
  • workflows

Blog Archive

  • ►  2013 (143)
    • ►  December (33)
    • ►  November (15)
    • ►  October (17)
    • ►  September (13)
    • ►  August (4)
    • ►  July (15)
    • ►  June (12)
    • ►  May (15)
    • ►  April (4)
    • ►  March (4)
    • ►  February (9)
    • ►  January (2)
  • ►  2012 (43)
    • ►  December (2)
    • ►  November (2)
    • ►  October (8)
    • ►  September (2)
    • ►  August (3)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  April (4)
    • ►  March (5)
    • ►  February (3)
    • ►  January (5)
  • ▼  2011 (46)
    • ►  December (3)
    • ►  November (4)
    • ►  October (4)
    • ►  September (5)
    • ►  August (3)
    • ▼  July (4)
      • App Engine 1.5.2 SDK Released
      • Postmortem: Java App Engine outage, July 14, 2011
      • Java App Engine outage, July 14, 2011
      • Wanted: App Engineers
    • ►  June (3)
    • ►  May (8)
    • ►  April (2)
    • ►  March (5)
    • ►  February (3)
    • ►  January (2)
  • ►  2010 (38)
    • ►  December (2)
    • ►  October (2)
    • ►  September (1)
    • ►  August (5)
    • ►  July (5)
    • ►  June (6)
    • ►  May (3)
    • ►  April (5)
    • ►  March (5)
    • ►  February (2)
    • ►  January (2)
  • ►  2009 (47)
    • ►  December (4)
    • ►  November (3)
    • ►  October (6)
    • ►  September (5)
    • ►  August (3)
    • ►  July (3)
    • ►  June (4)
    • ►  May (3)
    • ►  April (5)
    • ►  March (3)
    • ►  February (7)
    • ►  January (1)
  • ►  2008 (46)
    • ►  December (4)
    • ►  November (3)
    • ►  October (10)
    • ►  September (5)
    • ►  August (6)
    • ►  July (4)
    • ►  June (2)
    • ►  May (5)
    • ►  April (7)
Powered by Blogger.

About Me

Unknown
View my complete profile