Cloud World

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 20 July 2011

Postmortem: Java App Engine outage, July 14, 2011

Posted on 12:24 by Unknown

Summary




Last week, we posted about a limited outage on July 14, 2011. Now that our internal postmortem is complete, we thought you would also like to get more detail about what went wrong and what we are going to do to ensure this doesn't happen again.






Root Cause and Analysis




The main lesson learned is to improve our live traffic testing as a relatively minor bug triggered a corner case for some of our customers. The bug was in a new release of the infrastructure in the App Engine Java execution environment. During development, testing, and qualification, this bug was essentially hidden from view because it only manifested itself under specific load patterns. During the outage, requests to affected applications would fail with errors when traffic was routed to affected instances. Application logs would have shown affected instances experienced high latency, error rates, or were not reachable from the Internet. This could have been caught by letting the live traffic testing run longer.




In order for live traffic testing to work properly, we need to improve our monitoring as well. In this case, having more points from which to do black box monitoring would have helped immensely. We are currently working on much broader monitoring for App Engine and will be integrating more extensive black box testing in upcoming quarters.




Once again, we’d like to point out that we could have done a much better job of communicating issues to all of you. While we strive to strike a balance between letting you know about major issues and not bothering you about the day-to-day operations; we clearly should have communicated this incident to you sooner. Rest assured you’ll be better informed of issues in the future.







Timeline




July 14, 2011 - 11:30 AM US/Pacific - The new Java execution environment is released to production.




July 14, 2011 - 5:00-6:00 PM US/Pacific - The previously scheduled Master/Slave read-only maintenance period occurred.




July 14, 2011 - 8:00-9:30 PM US/Pacific - Monitoring shows error rates and latency for Java applications using the Master/Slave datastore are slowly increasing across the entire system. Investigation reveals that the new Java execution environment is malfunctioning.




July 14, 2011 - 9:30 PM US/Pacific - Rollback of the Java execution environment to the previous version begins. Latency and error rates begin to fall.




July 14, 2011 - 11:30 PM US/Pacific - Rollback of the Java execution environment to the previous version completes. Java Master/Slave applications are functioning normally.






Remediation





  • Faster notification on our status site and downtime-notify mailing list

  • More live traffic stress tests for new releases

  • Better black box monitoring to detect small impacts more quickly





[Edit] Clarification: no HR datastore apps were affected. Overall, the outage resulted in a 1.9% error rate, affecting approximately 0.005% of all App Engine traffic at peak.






Posted by Wesley Chun, Google App Engine Team
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • A Day in the Cloud, new articles on scaling, and fresh open source projects for App Engine
    The latest release of Python SDK 1.2.3, which introduced the Task Queue API and integrated support for Django 1.0, may have received a lot ...
  • Tutorial: Adding a cloud backend to your application with Android Studio
    Android Studio lets you easily add a cloud backend to your application, right from your IDE. A backend allows you to implement functionality...
  • Outfit 7’s Talking Friends built on Google App Engine, recently hit one billion downloads
    Today’s guest blogger is Igor Lautar, senior director of technology at Outfit7 (Ekipa2 subsidiary), one of the fastest-growing media enterta...
  • Bridging Mobile Backend as a Service to Enterprise Systems with Google App Engine and Kinvey
    The following post was contributed by Ivan Stoyanov , VP of Engineering for Kinvey, a mobile Backend as a Service provider and Google Cloud ...
  • TweetDeck and Google App Engine: A Match Made in the Cloud
    I'm Reza and work in London, UK for a startup called TweetDeck . Our vision is to develop the best tools to manage and filter real time ...
  • New Admin Console Release
    Posted by Marzia Niccolai, App Engine Team Today we've released some new features in our Admin Console to make it easier for you to mana...
  • Qubole helps you run Hadoop on Google Compute Engine
    This guest post comes form Praveen Seluka, Software Engineer at Qubole, a leading provider of Hadoop-as-a-service.  Qubole is a leading pr...
  • The new Cloud Console: designed for developers
    In June, we unveiled the new Google Cloud Console , bringing together all of Google’s APIs, Services, and Infrastructure in a single interfa...
  • Pushing Updates with the Channel API
    If you've been watching Best Buy closely, you already know that Best Buy is constantly trying to come up with new and creative ways to...
  • Google BigQuery goes real-time with streaming inserts, time-based queries, and more
    Google BigQuery is designed to make it easy to analyze large amounts of data quickly. This year we've seen great updates: big scale JOI...

Categories

  • 1.1.2
  • agile
  • android
  • Announcements
  • api
  • app engine
  • appengine
  • batch
  • bicycle
  • bigquery
  • canoe
  • casestudy
  • cloud
  • Cloud Datastore
  • cloud endpoints
  • cloud sql
  • cloud storage
  • cloud-storage
  • community
  • Compute Engine
  • conferences
  • customer
  • datastore
  • delete
  • developer days
  • developer-insights
  • devfests
  • django
  • email
  • entity group
  • events
  • getting started
  • google
  • googlenew
  • gps
  • green
  • Guest Blog
  • hadoop
  • html5
  • index
  • io2010
  • IO2013
  • java
  • kaazing
  • location
  • mapreduce
  • norex
  • open source
  • partner
  • payment
  • paypal
  • pipeline
  • put
  • python
  • rental
  • research project
  • solutions
  • support
  • sustainability
  • taskqueue
  • technical
  • toolkit
  • twilio
  • video
  • websockets
  • workflows

Blog Archive

  • ►  2013 (143)
    • ►  December (33)
    • ►  November (15)
    • ►  October (17)
    • ►  September (13)
    • ►  August (4)
    • ►  July (15)
    • ►  June (12)
    • ►  May (15)
    • ►  April (4)
    • ►  March (4)
    • ►  February (9)
    • ►  January (2)
  • ►  2012 (43)
    • ►  December (2)
    • ►  November (2)
    • ►  October (8)
    • ►  September (2)
    • ►  August (3)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  April (4)
    • ►  March (5)
    • ►  February (3)
    • ►  January (5)
  • ▼  2011 (46)
    • ►  December (3)
    • ►  November (4)
    • ►  October (4)
    • ►  September (5)
    • ►  August (3)
    • ▼  July (4)
      • App Engine 1.5.2 SDK Released
      • Postmortem: Java App Engine outage, July 14, 2011
      • Java App Engine outage, July 14, 2011
      • Wanted: App Engineers
    • ►  June (3)
    • ►  May (8)
    • ►  April (2)
    • ►  March (5)
    • ►  February (3)
    • ►  January (2)
  • ►  2010 (38)
    • ►  December (2)
    • ►  October (2)
    • ►  September (1)
    • ►  August (5)
    • ►  July (5)
    • ►  June (6)
    • ►  May (3)
    • ►  April (5)
    • ►  March (5)
    • ►  February (2)
    • ►  January (2)
  • ►  2009 (47)
    • ►  December (4)
    • ►  November (3)
    • ►  October (6)
    • ►  September (5)
    • ►  August (3)
    • ►  July (3)
    • ►  June (4)
    • ►  May (3)
    • ►  April (5)
    • ►  March (3)
    • ►  February (7)
    • ►  January (1)
  • ►  2008 (46)
    • ►  December (4)
    • ►  November (3)
    • ►  October (10)
    • ►  September (5)
    • ►  August (6)
    • ►  July (4)
    • ►  June (2)
    • ►  May (5)
    • ►  April (7)
Powered by Blogger.

About Me

Unknown
View my complete profile