Do You Have 100% Uptime?

Jason Cohen 2.17.2013

“Hey WP Engine, how does your uptime stack up against other WordPress hosts?”

“What about industry-wide?”

“Do you guys have the most uptime in the industry?”

Questions about uptime are some of the most commonly asked when folks evaluate WP Engine’s managed WordPress hosting platform, or any hosting solution for that matter. Nobody wants downtime, and customers who are paying for a hosting solution have a right to ask and be informed about a company’s history of downtime.

In particular, people want to know who has the least uptime, or if there is a company that has achieved 100% uptime. The reality is that 100% uptime, while the goal that every company sets its sights on, is a perfection unattainable.

In the past 10 days, four well-known WordPress hosting providers all had similar amounts of downtime, all for different reasons, all with different datacenters. A rash of downtime across four major WordPress hosting providers inside the same 2-week period is uncommon, but it indirectly helps answer the question, “How does your uptime stack up against [insert other hosting provider]?

Here’s what happened:

On February 12th, ServerBeach San Antonio physically cut a fiber line between their datacenter in San Antonio and their Dallas POP. Physically severing this line caused a few hours of downtime for 1% of our customers, as well as brief downtime for millions of Automattic’s WordPress.com customers as they gracefully switched over to another datacenter.
On February 2nd, Page.ly experienced just under 2 hours of downtime because of a hardware failure at Firehost’s Dallas datacenter. The downtime affected all of Page.ly’s customers in the Firehost datacenter.
On February 16th, Zippykid had several hours of downtime for hundreds of customers because of human error between ZippyKid and Rackspace.
WordPress VIP had two bouts of downtime in the past week, affecting and referenced by TechCrunch (see link). The second bit of downtime was due to a code bug that was pushed into production (but swiftly remedied).

What can we learn from these situations?

Uptime is never 100%. A world of factors conspire against 100% uptime, and can potentially disrupt the flow of bits from the server to your browser. But despite the number of factors, most hosting companies are at or above 99.9% uptime.

There isn’t a single hosting provider with 100% uptime. Amazon AWS, one of the most robust operations, is (famously) not 100%. GMail isn’t. Facebook isn’t. Twitter definitely isn’t. Rackspace isn’t. ServerBeach isn’t. FireHost isn’t. We could keep naming folks, but every single hosting provider, including WP Engine, hasn’t achieved 100% uptime in a meaningful time-scale (like years).

Are all these companies “stupid?” Are each of these companies unable to hire top system engineering talent? TechCrunch had some choice things to say about this in their post after 15 minutes of downtime (referenced above). Since none of these (industry-leaders) are 100% uptime, does that mean that they don’t care?

Of course not.

As we mentioned, most of these companies maintain an over-99% uptime rate. They often reach 99.99% uptime, and sometimes a bit more.

So what’s the difference, in terms of cost and technical complexity, between 99% and 99.9% uptime? What about the difference between 99.9% and 99.99%?

First off, 99.9% uptime sounds like a lot, but it’s nearly nine hours of downtime per year.

Can you imagine how you’d react if you had nine straight hours of downtime on your site? Not well.

99.99% uptime is still a non-trivial 50 minutes of downtime per year.

Every “9” you add to uptime (e.g. 99%, 99.9%, 99.99%) is not only an order of magnitude more uptime, it’s often a multiple more complex and expensive. At some point, trying to eliminate a few minutes of downtime now and then means doubling or tripling the cost of the service.

But lets break that cost down for a moment.

For example, to avoid hardware and software downtime for a single server, you can have several other servers in a cluster that the first can fall back on. Running multiple servers instead of a single one multiplies the cost by X number of servers. Nearly any hosting company will have some quantity of redundant servers, but some providers have a practice of adding more additional customers than there are secondary servers. When one server goes down, if the remaining servers don’t have enough capacity for 100% of the traffic, the cluster still goes down, despite the precaution.

But almost all the examples above were data center failure rather than single-server failure. To combat that, you need servers in entirely different data centers, once again with sufficient capacity to handle 100% of the traffic alone, which means another 2x the cost which was already 2-3x.

To avoid all the issues above is at least 6x more expensive in hardware alone, not to mention significantly more human and administrative effort. Plus, anytime you add more components as a redundant measure to prevent downtime, then ironically enough, each additional component increases the likelihood that one of your system’s components will have trouble at any given time.

In order to add redundant measures as a hosting provider, you have to add infrastructure. More infrastructure means more complexity, and adds more potential for trouble, which you then need to take steps to mitigate!

So what does all this mean? That we shouldn’t try? That we should just say “it’s hard, too bad” when things fail? That we shouldn’t continue investing in infrastructure and technology and techniques that our customers individually could never afford to pull off by themselves? Of course not, and in fact that’s exactly what WP Engine, and all the members of the hosting industry listed here and elsewhere, do. We’re always shooting for 100% uptime, and we always go into battle mode when there is the slightest blip of downtime.

It’s our responsibility to hold ourselves to ever-increasingly high standards.

That’s why only 1% of our customers had trouble the other day, not the other 99%. That’s pretty good! But, we immediately began working to bring everyone back online, and then make improvements for the next time. If we don’t continuously improve, that 99%+ could slip. And next time it would be awesome if only 0.1% of our customers had trouble. The bar must always be moved higher.

But perfection is unattainable, for WP Engine and everyone else hosting WordPress, or anything else on the Internet. We can ask better questions than, “Which hosting has perfect, 100% uptime?”

Instead, we can ask:

“What is the track record of a given host?”
“How are incidents handled?”
“How often do they happen?”
“Are they for silly reasons or for understandable reasons?”
“Is there enough staff to continuously improve and to handle events when they happen?”
“Are there multiple datacenter options or just one?”
…And so on.

Those are the questions that matter the most when you evaluate a particular hosting platform and compare it to another. Everyone is going to have as little downtime as possible, and each of the previous questions get answers to the question, “What are you doing to make sure you can mitigate this issue with zero downtime next time?”

How the company chooses to answer that question goes a long way to let you know that your websites, and therefore your business, are indeed in good hands.

For another perspective on this, Uri Budnik wrote a detailed post on the RightScale Blog titled, “Lessons Learned from Recent Cloud Outages.“

Comments

Brian Krogsgard says
February 18, 2013 at 11:20 am
This is a good post for educational purposes.
I think the fatigue a lot of people feel (in general, not WP Engine specific) is that the reality of downtime doesn’t often mesh with the sales pitch.
Hosts sell uptime, and inevitably have downtime. Setting the expectation of “always up” and then failing is where disappointment for a customer sets in. How is a regular customer supposed to know it’s impossible? And at the same time, how can a host compete by saying, “We’ve got downtime, just like everybody else!” without losing business? Especially when the standards is to toss in all those 9s for % uptime.
Perhaps a little question mark to explain what 99.9% vs 99.99% uptime means (like hours down per month/year) would benefit, along with a link to an educational article like this, so that customers can be confident that their provider is a leader in the industry, but still fallible.
Without education, 99.9% and especially 99.99% looks like 100% to the average person. So therefore, even hitting a goal of 99.99% is a failure during the 0.01% in the customer’s mind.
I know it’s a catch 22, but maybe a nudge in the right direction, by education on the sales page, can help change expectations, and result in happier customers.
Reply
Jon Brown says
February 18, 2013 at 12:38 pm
Great post and it doesn’t even cover a few other recent major outages at non-WordPress specific hosts like BlueHost which just had a UPS failure take most sites offline for nearly an hour and Dreamhost which IIRC had a name server fail recently.
All hosts will have downtime, for me the service response is what makes the difference. Can the host effectively communicate to affected sites what is going on, why and an ETA on a fix. Many hosts fail horribly at that moment of crisis (GD, NS & 1&1 I’m looking at you).. BH and DH handled it extremely well the last round with twitter updates and a separate easy to find and access status blog, as does WPE with the rare outage.
Reply
Brent Logan says
February 18, 2013 at 2:38 pm
It’s important to distinguish between “uptime” and “availability” when comparing web hosts. When Apache crashes and my site goes down, but the server is still “up” does that count as downtime? My understanding is that most (all?) hosts would say that is not downtime.
I suspect the same sort of factors and increasing costs come in to play to increase the nines of availability.
Reply
- Jason Cohen says
  February 20, 2013 at 5:14 pm
  That’s a great point. We consider downtime as “the site doesn’t come up when you go to its domain.” Because of course that’s how your viewers would define it!
  Of course when it comes to *operational* metrics, then it’s useful to separate things like hardare, specific services, clusters, data center, network outside the data center, etc., because that helps answer “what is the problem” or “where are the bottlenecks.”
  Reply
Mark Garcia says
February 18, 2013 at 5:11 pm
I don’t necessarily agree with the thoughts behind another datacenter handling 100% of the traffic and thus 2x the cost. Let me explain…
A common dilemma in the evolution from a startups success starts at the initial deployment and architecture of the infrastructure. The constraints are set by a budget where popularity of the service does not yet dictate the need for a robust architecture. In other words, we all take a proof of concept idea and push it to a single datacenter where services are made redundant within the confines of this singular world.
The problem stems from a continuous build within this singular world where the application architecture is not yet designed to handle distributed traffic yet. The better plan is to not put all your eggs in one basket and devise an infrastructure that has traffic settled between 3 or more datacenters, where each datacenter handles anywhere from 33% or less of your overall traffic.
It is actually cheaper to handle 33% of your traffic in 3 datacenters than it is to handle 200% of your traffic in 2 datacenters. If you loose 1 of 3 datacenters, then you are looking at a degraded web experience, where latency increases, but you are still ‘up’.
There are more sophisticated ways to splay traffic, especially when you are in the business of handling high traffic’d sites. An alternative is to implement responsive routing, where traffic to a set of high profiled sites get pinned to 1 of 2 remaining datacenters that are up.
The comes back to the common dilemma, where the problem lies in the amount of work required to have an ecosystem operate cohesively within multiple datacenters. It’s a formidable challenge, that takes a lot time and money to pull yourself forward. I worked through these issues at 2 previous companies during my life and it is a much more sustainable model than replicate existing infrastructure for only being used in .01% of the time.
Reply
- Jason Cohen says
  February 20, 2013 at 5:23 pm
  That’s not quite right.
  Suppose as you say you have three datacenters A, B, and C, each actively serving 1/3 of the traffic. Further suppose that the hardware allocation is at 80% capacity, and of course the hardware in each location is identical (because each is serving an identical — and balanced — traffic load.
  Now suppose data center A goes down. Now B and C are serving 50% of traffic each, instead of the normal 33%. That’s 50% more traffic than they usually serve. Since they were already at 80% capacity, they are now at 120% of capacity, and they will fail.
  *Not* higher latency, but outright out of capacity.
  The solution there of course is you can’t run at 80% capacity. Supposing they were at 50% capacity in the first place, then a 1-center failure results in 75% capacity, which is OK.
  But of course, to run at 50% capacity instead of 80% means adding more hardware to all three! Specifically it takes 60% more hardware.
  Bottom line is, I agree with your basic premise which is that at large scale you might as well distribute across data centers, and that’s better than “2x” or “3x” more. However it’s still a lot more that it would otherwise be.
  Furthermore, it is VERY FEW sites on Earth who can afford to do three-center multi-master development and hardware management. What about everyone else?
  It is the “everyone else” who I’m actually addressing in this blog post. In those cases — where the question is going from one server or one cluster to multi data-center — presumably you agree with this too.
  Reply
Harsh Agrawal says
February 19, 2013 at 6:37 am
That’s a really detailed explanation Jason..and at one point I thought due to hardware issues or human errors 100% uptime is impossible.. Though with the backup machine…it could be very much achievable… The only thing is price….which will surely adds up for such service….
Reply
- Jason Cohen says
  February 20, 2013 at 5:24 pm
  Exactly! This is an instance where “throwing more money at it” can add 9’s to the uptime.
  Then the question simply is — what trade-off of money and 9’s is appropriate for your particular use case?
  Of course we would argue that for many folks — certainly not all! — our service (and others!) are a smart trade-off.
  Reply
- Sushant says
  September 10, 2015 at 1:32 pm
  Yeah….You Are Right Bro
  Reply
Zach Russell says
February 19, 2013 at 9:13 am
Jason,
I love the insight you provide in this article about up-time. I have used many other hosting providers, some shared, but mostly dedicated servers, and have had such problems with downtime, of course it never seemed to be the hosts fault, but I wouldn’t expect them to take accountability for their service to go down during times like the switch to IP v6.
For anyone who is wondering, I use WP Engine for ALL of my websites now, and have for several months. From a hosting perspective, I haven’t experienced any downtime. Their customer support is the best in the industry, and they are accountable for all websites hosted with them. If your on the fence, go to WP Engine.
Reply
DaveZ says
February 26, 2013 at 9:22 am
I recognize it’s not apples to apples, but Zippy charges $25/mo for 100,000 pageviews and Page.ly charges $64 for 200k while WPEngine is $100 for 100k. As a blogger on a budget, I have a hard time wrapping my mind around that.
Reply
Brian says
February 21, 2014 at 1:30 pm
Great post! I couldn’t agree more with what you said, but as hosting industry veteran, I would like to add that you would never want to combine the application and database tiers on a conventional server or single VM, no matter what the cost savings. This is an administrative nightmare and a recipe for disaster. All you need is some database intensive websites or applications and the server becomes overwhelmed. At the very least, you would want an app server and database server for each customer pod. I’m sure that this is the “great debate” that rages each and every day at hosting companies around the world. Do we add more hardware to address infrastructure issues or more system admins and DBAs to address problems associated with overpopulated hardware. Thanks for allowing me to comment. Again, great post!
Reply
- Kirby Prickett says
  February 24, 2014 at 12:08 pm
  Thanks for your feedback Brian.
  We do everything we can to not have overpopulated hardware, but you make a very valid point.
  – Kirby
  Reply
Tyler says
July 20, 2015 at 4:39 pm
“Uptime is never 100%”
If this is true, how have LinkedIn and Twitter maintained 100% uptime for the last 1000 hours? (I started the monitors on June 7th)
I started monitoring the URLs of my profiles on those sites via uptimerobot for study and point of reference, to compare against the monitoring of my personal websites and client’s websites.
It will be interesting to see how long it takes before one of them kicks back anything other than “200 (OK)”.
Reply
- Tyler says
  July 20, 2015 at 5:29 pm
  I was thinking about this further and re-read this article (as well as others), and I think I understand now. The “never” qualifier did specify “in a meaningful time-scale (like years)” so compared to my example which is roughly 1.5 months, that would not qualify as a significant sample, by this article’s definition. Fair enough.
  What I’ve been pondering lately is how to determine if a host will be an improvement from one’s current host, short of actually switching to them, crossing our fingers and seeing how it goes.
  My current host has more downtime than I’d like, and I’m not paying much so I’d be willing to pay more for less frequent downtime, but how can we reliably quantify beforehand what the average frequency and length of a given host’s downtime will be?
  Reply
rahul says
September 10, 2015 at 1:30 pm
According to me it is not possible to get 100% Uptime
Reply
Satish Kumar Ithamsetty says
November 14, 2015 at 10:43 am
100% uptime… is it possible..
Reply
Jenny says
January 19, 2016 at 2:56 am
good information..
Reply
Shanaya Sharma says
January 22, 2016 at 5:31 am
I have 2 websites with wp engine and most of the time i have received 100% Uptime. This is far better than Hostgator. I have one website on Hostgator which has 99% uptime thinking shifting it to WP Engine Soon.
Reply
Emran says
May 1, 2017 at 8:14 am
The best hosting wp engine and up time yeah as you said 100 percent. But I am waiting to grow more traffic after that 100 percent shifted to wp engine.
Reply

Contact Sales

Do You Have 100% Uptime?

More WordPress news from WP Engine

More WordPress news from WP Engine

Comments

Leave a Reply Cancel reply