“In aviation, a chain of events, often called the error chain, is a term referring to the concept that many contributing factors typically lead to an accident, rather than one single event” (from Wikipedia)

Good news.   YouMail is back in the Android Market, or will be shortly.

Bad news. What led to its removal is this chain of events – break the chain, and there would have been no issue.

  • First, a subset of the YouMail Android applications have a real problem.  It looks like 15,000 users who went straight from 1.8.3 (an old version) to 2.0.45 (the one that was in the market and was taken down) got into a situation where the app is polling our servers continuously (the polling time got set to zero).   Of course, this leads to a host of issues for those clients, such as bad battery life, and a boatload of transactions, eating up network bandwidth.   T-mobile saying that we disrupted their network is fair, thuogh we were unknowingly causing that.
  • Second, T-mobile did try to reach out to us that they were seeing an issue.   Unfortunately, it was in way that was almost guaranteed to be ineffective, and is probably not how businesses should communicate.   As far as we can tell, one of their engineering team sent an e-mail to our free customer support e-mail address in early November, and one of the support team basically replied it’s fixed in next release and treated it as resolved, not reporting it to anyone else.   With 1000s of e-mails/week from over two million registered users, random users weekly threatening to pull us from various stores, and lots of users with tmobile.com email addresses,  it was easy for this one message to get lost in the shuffle.
  • Third, after almost 30 days with no response from us, T-mobile went to Google with charts showing the traffic our bad apps were generating, said we were unresponsive, and that the traffic was growing quickly.   Google then immediately cut us off – without ever sending us an e-mail beforehand, or providing us anyway to contact someone at T-mobile. That left us wondering what the heck was going on – and having a hard time figuring it out.

As we worked through the appeal process yesterday afternoon, we eventually were given a contact for T-mobile.  Within a 30 minute very cordial phone call we understood the issue, and we had a joint plan that included T-mobile testing the app, our getting our beta into the store, and us driving users of the problematic app to upgrade to fix the issue.

Interestingly, we knew we had an issue, but not the scale of it – an unfortunate side effect of having a really scalable infrastructure.  On top of that, the newest release (which has been in beta for a couple of weeks) not only fixes the issue, but had been tightly optimized to minimize battery use, network use, and memory use.   We just hadn’t pushed it because we weren’t aware it was becoming that big an issue, and we were working on polishing areas like contact upload where the app wasn’t up to snuff (the main area where did have a boatload of complaints).

So, anyways, the problem is effectively getting resolved, and our users are being pushed to upgrade as fast as possible.

We think there are some lessons learned for all of us.

  • For startups like YouMail. First, paranoia isn’t always good, despite Andy Grove’s advice.   When you’re the mouse, it’s easy to believe the elephants all around you are out to squash you when they’re just lumbering along.  But that’s not necessarily true, and it’s probably not what you should assume.   And second, when you actually find success and start having millions of people on your apps, you have to get to a better balance of innovating versus blocking and tackling, making sure that releases don’t have these issues and/or fixing them faster when they occur through smaller, more iterative updates
  • For the carriers. When there’s an issue, don’t assume you’re being ignored.  Just a bit more effort on T-mobile’s part to communicate with us would have led to this getting fixed right away – all it ultimately involved was uploading a new APK we had sitting around anyways.   Our phone number is on the web site (800-374-0013), and we pay attention to tweets, and comments on our blog and facebook page.   While there’s an argument that it’s not the carrier’s responsibility, when you’re dealing with applications at scale, who start having a million or more users depending on them daily, it’s probably fair to do more than drop an e-mail.   But to make it easy, we’ve added an e-mail address on our contact us page specifically for carriers.   And we are working to get contacts at each of the major carriers, so they know who to talk to if something like this ever happens again.
  • For the app stores. First, the app stores need to have a way to take down an app by carrier. It’s a problem that a single carrier – even with a legitimate issue on their network – can shut the app down on other networks.   In the meantime, it would have been easy enough for the app store folks to simply “uncheck” the T-mobile box (or have us do it) while the problem was being dealt with. Second, there needs to be a way to treat apps that become popular and the basis of real businesses a bit differently than apps that haven’t gone anywhere (for what it’s worth, according to Flurry we have over 500k active Android users, and we’ve had nearly a million downloads of our Android app alone from the various stores).   For example, had Google simply notified our developer address prior to the takedown  “a carrier is upset, here’s their email address”, we would have contacted T-mobile immediately.   The shoot first, ask questions later process doesn’t work – at least not for apps at scale. It’s too reminiscent of the old days where someone was accused of being an e-mail spammer, a message went to a “postmaster” e-mail address that wasn’t well monitored, and they suddenly found themselves on a blacklist without understanding why.

Not a fun day for anyone yesterday – but hopefully our experience and lessons learned will prove valuable to all in preventing these types of situations going forward.