“Hey WP Engine, how does your uptime stack up against other WordPress hosts?”
“What about industry-wide?”
“Do you guys have the most uptime in the industry?”
Questions about uptime are some of the most commonly asked when folks evaluate WP Engine’s managed WordPress hosting platform, or any hosting solution for that matter. Nobody wants downtime, and customers who are paying for a hosting solution have a right to ask and be informed about a company’s history of downtime.
In particular, people want to know who has the least uptime, or if there is a company that has achieved 100% uptime. The reality is that 100% uptime, while the goal that every company sets its sights on, is a perfection unattainable.
In the past 10 days, four well-known WordPress hosting providers all had similar amounts of downtime, all for different reasons, all with different datacenters. A rash of downtime across four major WordPress hosting providers inside the same 2-week period is uncommon, but it indirectly helps answer the question, “How does your uptime stack up against [insert other hosting provider]?
Here’s what happened:
- On February 12th, ServerBeach San Antonio physically cut a fiber line between their datacenter in San Antonio and their Dallas POP. Physically severing this line caused a few hours of downtime for 1% of our customers, as well as brief downtime for millions of Automattic’s WordPress.com customers as they gracefully switched over to another datacenter.
- On February 2nd, Page.ly experienced just under 2 hours of downtime because of a hardware failure at Firehost’s Dallas datacenter. The downtime affected all of Page.ly’s customers in the Firehost datacenter.
- On February 16th, Zippykid had several hours of downtime for hundreds of customers because of human error between ZippyKid and Rackspace.
- WordPress VIP had two bouts of downtime in the past week, affecting and referenced by TechCrunch (see link). The second bit of downtime was due to a code bug that was pushed into production (but swiftly remedied).
What can we learn from these situations?
Uptime is never 100%. A world of factors conspire against 100% uptime, and can potentially disrupt the flow of bits from the server to your browser. But despite the number of factors, most hosting companies are at or above 99.9% uptime.
There isn’t a single hosting provider with 100% uptime. Amazon AWS, one of the most robust operations, is (famously) not 100%. GMail isn’t. Facebook isn’t. Twitter definitely isn’t. Rackspace isn’t. ServerBeach isn’t. FireHost isn’t. We could keep naming folks, but every single hosting provider, including WP Engine, hasn’t achieved 100% uptime in a meaningful time-scale (like years).
Are all these companies “stupid?” Are each of these companies unable to hire top system engineering talent? TechCrunch had some choice things to say about this in their post after 15 minutes of downtime (referenced above). Since none of these (industry-leaders) are 100% uptime, does that mean that they don’t care?
Of course not.
As we mentioned, most of these companies maintain an over-99% uptime rate. They often reach 99.99% uptime, and sometimes a bit more.
So what’s the difference, in terms of cost and technical complexity, between 99% and 99.9% uptime? What about the difference between 99.9% and 99.99%?
First off, 99.9% uptime sounds like a lot, but it’s nearly nine hours of downtime per year.
Can you imagine how you’d react if you had nine straight hours of downtime on your site? Not well.
99.99% uptime is still a non-trivial 50 minutes of downtime per year.
Every “9” you add to uptime (e.g. 99%, 99.9%, 99.99%) is not only an order of magnitude more uptime, it’s often a multiple more complex and expensive. At some point, trying to eliminate a few minutes of downtime now and then means doubling or tripling the cost of the service.
But lets break that cost down for a moment.
For example, to avoid hardware and software downtime for a single server, you can have several other servers in a cluster that the first can fall back on. Running multiple servers instead of a single one multiplies the cost by X number of servers. Nearly any hosting company will have some quantity of redundant servers, but some providers have a practice of adding more additional customers than there are secondary servers. When one server goes down, if the remaining servers don’t have enough capacity for 100% of the traffic, the cluster still goes down, despite the precaution.
But almost all the examples above were data center failure rather than single-server failure. To combat that, you need servers in entirely different data centers, once again with sufficient capacity to handle 100% of the traffic alone, which means another 2x the cost which was already 2-3x.
To avoid all the issues above is at least 6x more expensive in hardware alone, not to mention significantly more human and administrative effort. Plus, anytime you add more components as a redundant measure to prevent downtime, then ironically enough, each additional component increases the likelihood that one of your system’s components will have trouble at any given time.
In order to add redundant measures as a hosting provider, you have to add infrastructure. More infrastructure means more complexity, and adds more potential for trouble, which you then need to take steps to mitigate!
So what does all this mean? That we shouldn’t try? That we should just say “it’s hard, too bad” when things fail? That we shouldn’t continue investing in infrastructure and technology and techniques that our customers individually could never afford to pull off by themselves? Of course not, and in fact that’s exactly what WP Engine, and all the members of the hosting industry listed here and elsewhere, do. We’re always shooting for 100% uptime, and we always go into battle mode when there is the slightest blip of downtime.
It’s our responsibility to hold ourselves to ever-increasingly high standards.
That’s why only 1% of our customers had trouble the other day, not the other 99%. That’s pretty good! But, we immediately began working to bring everyone back online, and then make improvements for the next time. If we don’t continuously improve, that 99%+ could slip. And next time it would be awesome if only 0.1% of our customers had trouble. The bar must always be moved higher.
But perfection is unattainable, for WP Engine and everyone else hosting WordPress, or anything else on the Internet. We can ask better questions than, “Which hosting has perfect, 100% uptime?”
Instead, we can ask:
- “What is the track record of a given host?”
- “How are incidents handled?”
- “How often do they happen?”
- “Are they for silly reasons or for understandable reasons?”
- “Is there enough staff to continuously improve and to handle events when they happen?”
- “Are there multiple datacenter options or just one?”
- …And so on.
Those are the questions that matter the most when you evaluate a particular hosting platform and compare it to another. Everyone is going to have as little downtime as possible, and each of the previous questions get answers to the question, “What are you doing to make sure you can mitigate this issue with zero downtime next time?”
How the company chooses to answer that question goes a long way to let you know that your websites, and therefore your business, are indeed in good hands.
For another perspective on this, Uri Budnik wrote a detailed post on the RightScale Blog titled, “Lessons Learned from Recent Cloud Outages.“
Brian Krogsgard says
This is a good post for educational purposes.
I think the fatigue a lot of people feel (in general, not WP Engine specific) is that the reality of downtime doesn’t often mesh with the sales pitch.
Hosts sell uptime, and inevitably have downtime. Setting the expectation of “always up” and then failing is where disappointment for a customer sets in. How is a regular customer supposed to know it’s impossible? And at the same time, how can a host compete by saying, “We’ve got downtime, just like everybody else!” without losing business? Especially when the standards is to toss in all those 9s for % uptime.
Perhaps a little question mark to explain what 99.9% vs 99.99% uptime means (like hours down per month/year) would benefit, along with a link to an educational article like this, so that customers can be confident that their provider is a leader in the industry, but still fallible.
Without education, 99.9% and especially 99.99% looks like 100% to the average person. So therefore, even hitting a goal of 99.99% is a failure during the 0.01% in the customer’s mind.
I know it’s a catch 22, but maybe a nudge in the right direction, by education on the sales page, can help change expectations, and result in happier customers.
Jon Brown says
Great post and it doesn’t even cover a few other recent major outages at non-WordPress specific hosts like BlueHost which just had a UPS failure take most sites offline for nearly an hour and Dreamhost which IIRC had a name server fail recently.
All hosts will have downtime, for me the service response is what makes the difference. Can the host effectively communicate to affected sites what is going on, why and an ETA on a fix. Many hosts fail horribly at that moment of crisis (GD, NS & 1&1 I’m looking at you).. BH and DH handled it extremely well the last round with twitter updates and a separate easy to find and access status blog, as does WPE with the rare outage.
Brent Logan says
It’s important to distinguish between “uptime” and “availability” when comparing web hosts. When Apache crashes and my site goes down, but the server is still “up” does that count as downtime? My understanding is that most (all?) hosts would say that is not downtime.
I suspect the same sort of factors and increasing costs come in to play to increase the nines of availability.
Jason Cohen says
That’s a great point. We consider downtime as “the site doesn’t come up when you go to its domain.” Because of course that’s how your viewers would define it!
Of course when it comes to *operational* metrics, then it’s useful to separate things like hardare, specific services, clusters, data center, network outside the data center, etc., because that helps answer “what is the problem” or “where are the bottlenecks.”
Mark Garcia says
I don’t necessarily agree with the thoughts behind another datacenter handling 100% of the traffic and thus 2x the cost. Let me explain…
A common dilemma in the evolution from a startups success starts at the initial deployment and architecture of the infrastructure. The constraints are set by a budget where popularity of the service does not yet dictate the need for a robust architecture. In other words, we all take a proof of concept idea and push it to a single datacenter where services are made redundant within the confines of this singular world.
The problem stems from a continuous build within this singular world where the application architecture is not yet designed to handle distributed traffic yet. The better plan is to not put all your eggs in one basket and devise an infrastructure that has traffic settled between 3 or more datacenters, where each datacenter handles anywhere from 33% or less of your overall traffic.
It is actually cheaper to handle 33% of your traffic in 3 datacenters than it is to handle 200% of your traffic in 2 datacenters. If you loose 1 of 3 datacenters, then you are looking at a degraded web experience, where latency increases, but you are still ‘up’.
There are more sophisticated ways to splay traffic, especially when you are in the business of handling high traffic’d sites. An alternative is to implement responsive routing, where traffic to a set of high profiled sites get pinned to 1 of 2 remaining datacenters that are up.
The comes back to the common dilemma, where the problem lies in the amount of work required to have an ecosystem operate cohesively within multiple datacenters. It’s a formidable challenge, that takes a lot time and money to pull yourself forward. I worked through these issues at 2 previous companies during my life and it is a much more sustainable model than replicate existing infrastructure for only being used in .01% of the time.
Jason Cohen says
That’s not quite right.
Suppose as you say you have three datacenters A, B, and C, each actively serving 1/3 of the traffic. Further suppose that the hardware allocation is at 80% capacity, and of course the hardware in each location is identical (because each is serving an identical — and balanced — traffic load.
Now suppose data center A goes down. Now B and C are serving 50% of traffic each, instead of the normal 33%. That’s 50% more traffic than they usually serve. Since they were already at 80% capacity, they are now at 120% of capacity, and they will fail.
*Not* higher latency, but outright out of capacity.
The solution there of course is you can’t run at 80% capacity. Supposing they were at 50% capacity in the first place, then a 1-center failure results in 75% capacity, which is OK.
But of course, to run at 50% capacity instead of 80% means adding more hardware to all three! Specifically it takes 60% more hardware.
Bottom line is, I agree with your basic premise which is that at large scale you might as well distribute across data centers, and that’s better than “2x” or “3x” more. However it’s still a lot more that it would otherwise be.
Furthermore, it is VERY FEW sites on Earth who can afford to do three-center multi-master development and hardware management. What about everyone else?
It is the “everyone else” who I’m actually addressing in this blog post. In those cases — where the question is going from one server or one cluster to multi data-center — presumably you agree with this too.
Harsh Agrawal says
That’s a really detailed explanation Jason..and at one point I thought due to hardware issues or human errors 100% uptime is impossible.. Though with the backup machine…it could be very much achievable… The only thing is price….which will surely adds up for such service….
Jason Cohen says
Exactly! This is an instance where “throwing more money at it” can add 9’s to the uptime.
Then the question simply is — what trade-off of money and 9’s is appropriate for your particular use case?
Of course we would argue that for many folks — certainly not all! — our service (and others!) are a smart trade-off.
Sushant says
Yeah….You Are Right Bro
Zach Russell says
Jason,
I love the insight you provide in this article about up-time. I have used many other hosting providers, some shared, but mostly dedicated servers, and have had such problems with downtime, of course it never seemed to be the hosts fault, but I wouldn’t expect them to take accountability for their service to go down during times like the switch to IP v6.
For anyone who is wondering, I use WP Engine for ALL of my websites now, and have for several months. From a hosting perspective, I haven’t experienced any downtime. Their customer support is the best in the industry, and they are accountable for all websites hosted with them. If your on the fence, go to WP Engine.
DaveZ says
I recognize it’s not apples to apples, but Zippy charges $25/mo for 100,000 pageviews and Page.ly charges $64 for 200k while WPEngine is $100 for 100k. As a blogger on a budget, I have a hard time wrapping my mind around that.
Brian says
Great post! I couldn’t agree more with what you said, but as hosting industry veteran, I would like to add that you would never want to combine the application and database tiers on a conventional server or single VM, no matter what the cost savings. This is an administrative nightmare and a recipe for disaster. All you need is some database intensive websites or applications and the server becomes overwhelmed. At the very least, you would want an app server and database server for each customer pod. I’m sure that this is the “great debate” that rages each and every day at hosting companies around the world. Do we add more hardware to address infrastructure issues or more system admins and DBAs to address problems associated with overpopulated hardware. Thanks for allowing me to comment. Again, great post!
Kirby Prickett says
Thanks for your feedback Brian.
We do everything we can to not have overpopulated hardware, but you make a very valid point.
– Kirby
Tyler says
“Uptime is never 100%”
If this is true, how have LinkedIn and Twitter maintained 100% uptime for the last 1000 hours? (I started the monitors on June 7th)
I started monitoring the URLs of my profiles on those sites via uptimerobot for study and point of reference, to compare against the monitoring of my personal websites and client’s websites.
It will be interesting to see how long it takes before one of them kicks back anything other than “200 (OK)”.
Tyler says
I was thinking about this further and re-read this article (as well as others), and I think I understand now. The “never” qualifier did specify “in a meaningful time-scale (like years)” so compared to my example which is roughly 1.5 months, that would not qualify as a significant sample, by this article’s definition. Fair enough.
What I’ve been pondering lately is how to determine if a host will be an improvement from one’s current host, short of actually switching to them, crossing our fingers and seeing how it goes.
My current host has more downtime than I’d like, and I’m not paying much so I’d be willing to pay more for less frequent downtime, but how can we reliably quantify beforehand what the average frequency and length of a given host’s downtime will be?
rahul says
According to me it is not possible to get 100% Uptime
Satish Kumar Ithamsetty says
100% uptime… is it possible..
Jenny says
good information..
Shanaya Sharma says
I have 2 websites with wp engine and most of the time i have received 100% Uptime. This is far better than Hostgator. I have one website on Hostgator which has 99% uptime thinking shifting it to WP Engine Soon.
Emran says
The best hosting wp engine and up time yeah as you said 100 percent. But I am waiting to grow more traffic after that 100 percent shifted to wp engine.