StartPad.org » Blog » mckoss's blog » Google AppEngine - A Story of Failure and Redemption

Google AppEngine - A Story of Failure and Redemption

appengine_lowres.gif

I'm a Google AppEngine developer (Go2.me), as well as a user of a friend's AppEngine application (puzzazz.com).  On Saturday morning, I noticed that I could not Sign In to any AppEngine application - I was greeted with a very opaque "Server Error".  This looked like a very serious problem - so I waited a few hours thinking that Google would be "on it", and tried again ... still broken.

Now I start looking for the official Google report on this bug.  There are several ways that Google uses to communicate with developers on AppEngine:

  1. AppEngine Service Status Page
  2. Google Developer Group (Forum)
  3. Google Service Downtime Notification Group
To my dismay, none of these pages had ANYTHING to say about the service outage.  Doing some more investigation, I found someone report on the developer group that they could not log in on iPhone.  I had been trying to log in on my Android (Google) phone; it had not occurred to me that an error like this would be specific to mobile phone access, but it was.  I could log in normally via a browser, but not via a mobile phone.

With no real way to reach a person at Google, I had to just wait and hope that Google would recognize this error and fix it.  Finally, on Monday afternoon, a Google engineer responded on the forum that they were looking into the issue.  Within hours, the problem was fixed.

Mature, well run, engineering organizations have several mechanisms to combat failures like this one:
  1. The use of automated testing before releasing software that tests the broad range of capabilities (including emulation of Sign In via a mobile web browser, for example).
  2. A well defined error reporting and response system to tell users when problems are identified, and give an estimate of their repair.
While you could argue that AppEngine is a "beta" product, I was surprised and disappointed by this incident for several reasons:
  1. Apparently every AppEngine application lost the ability for mobile (iPhone/Android) users to log in via Google Sign in.
  2. This problem persisted for 60+ hours.
  3. There seemed no clear way for developers to report this to Google (there were two bug reports on the Google AppEngine Group - but that's not guaranteed to be monitored or create an Issue Ticket).
  4. There were no errors logged in our error logs on AppEngine (though 500 errors were clearly being returned and so Google should have "known" about this as soon as they started happening).
  5. At no time was this issue ever acknowledged on the App Engine service status page.
I decided to send my thoughts on this to the engineers who fixed this problem.  I was pleasantly surprised to hear back from Chris Beckmann:

Thanks for reaching out to us. I'm one of the product managers for App Engine.

We take the login bug seriously and it was the topic of significant discussion at our teamwide engineering meeting today. As you can imagine, many of Google's external services are actually composed of the efforts of many underlying teams, and in this case a change made by another team affected login for App Engine apps. Generally speaking, there are several methods for catching changes that break another service including static tests, however, in this case, it managed to slip by the other team undetected. 

Concretely, we're working with the other Google team to make sure they incorporate additional tests specific to App Engine, as well figuring out some additional monitoring within the App Engine service to discover these kind of errors. From a monitoring perspective, logins are relatively rare compared to overall traffic, and since this outage affected only mobile logins, the increase in 500s didn't immediately raise the alarm.

You also touched on an important question regarding why this wasn't picked up earlier as there was a thread on the group. Just by way of explanation, the problem really emerged over the weekend when there's fewer internal folks monitoring the group, so it's more difficult to separate real problems from typical discussions or other noise. That said, we're working on making improvements to how we react to community input, regardless of the day or hour at which it is received. Our best option right now is to post to the group with a link to the Issue Tracker (http://code.google.com/p/googleappengine/issues/list) so that other developers can verify whether they are also experiencing these problems and escalate in general.

I hope that addresses some of your concerns. Thanks for reaching out to us and feel free to contact me with any other feedback or questions. 

What a great response.  While recognizing a failure in their systems, they reassured me that they are taking the problem seriously, and are working hard to address it to avoid repeating the mistake in the future.  I also get the feeling that they really are the high quality development organization that I would expect from Google, and that they are trying to do things "right".  Chris even agreed to let me share his email publicly.

This just goes to show the power of good customer service, and treating customers with respect and openness can go a long way in building trust and loyalty in their brand.

I remain, a huge Google fan!

I've used half-dozen APIs from Google over the years, I also used Amazon, Microsoft, and other services APIs. By far, Google is the worst of them all. From poor documentation, to erratic behavior and, worse of them all, by changing the format/response of the API under you.

Google has a long way to go before they can create a serious developer following.

One more complaint about Google's API is there unnecessary complexity. They are not putting the "consumer hat" when they are developing those.

That's not been my experience at all. AppEngine API's seem very simple, well documented, and not at all overly complicated. I'd like to see examples of what you're talking about.
Another big-time failure today. Even the server status page is down! The only info coming out on the AppEngine downtime today is in the Service Downtime Notification group.

AppEngine is sure having some growing pains. Yet more failures of the service. Especially embarrassing is that the Service Status Page has been down for 12 hours - it doesn't even respond right now.

-------

During the past 18 hours, App Engine has experienced two brief outages:

Yesterday, from approximately 6:20 pm to 6:30 pm PDT (ten minutes total), there was a partial outage affecting 50% of incoming requests; in particular, traffic from the West Coast of North America was most affected.

Early this morning, from 6:30 am to 6:55 am PDT (twenty five minutes total), there was a full outage for the datastore, during which all operations timed out.

No customer code or data was affected during these incidents.

Additionally, the App Engine Status Site has been unreachable for the past 12 hours. We are actively working on resolving this issue.

Mike Repass App Engine Team

http://groups.google.com/group/google-appengine-downtime-notify/browse_thread/thread/88300ea785305c?hl=en