Maximizing Cloud Uptime

For enterprises, the cloud can be as much of a problem as an opportunity. If employees can’t access the cloud, or if the data centers and other cloud infrastructure suffer an outage, productivity and sales can grind to a halt. Wireless is the latest wild card: By 2016, 70 percent of cloud users will access those applications and services via wireless, Ericsson predicts. Wireless is even more unpredictable than fiber and copper, so how can enterprises ensure that wireless doesn’t jeopardize their cloud-based systems?

Bernard Golden — author of Virtualization for Dummies and CEO of HyperStratus, a cloud computing consultancy — recently spoke with Intelligence in Software about the top pitfalls and best practices that CIOs and IT managers need to consider when it comes to maximizing cloud uptime.


Q: What are the top causes of cloud service unreliability? What are the weak spots?


Bernard Golden: There are issues that are common when using any outside resource, and resulting questions you need to ask to identify the weak spots: Does the network go down between you and the provider? In terms of the external party’s infrastructure operations, how robust is their computing infrastructure environment? You might have questions about their operational practices in support: Do they get patches so things don’t crash?

If cloud computing is built on virtualization, and virtualization implies being abstracted from specific hardware dependence, have you designed your application so it’s robust in the face of underlying hardware failure? That’s more about whether you’ve done your proper application architecture design. Many people embrace cloud computing because of its ability to support scaling and elasticity. Have you designed your application to be robust in the face of highly variable user loads or traffic?


Q: What can enterprises do to mitigate those problems? For example, what should they specify in their service-level agreements (SLAs)?


B.G.: There’s a lot of discussion about SLAs from cloud providers, but really it’s every link in the chain that needs to be evaluated. Do you have a single point of failure? Maybe you need two different connectivity providers.

Some people put a lot of faith in SLAs. We tend to caution people: At the end of the day, SLAs are great. They’re sort of like law enforcement: It doesn’t prevent crime, but it responds to it. It’s not, ‘I’ve got an SLA, so my system will never go down.’ Rather, an SLA means that the vendor pledges to have a certain level of availability.

So you have to evaluate, what do I expect is the likelihood that they’re going to be able to accomplish that? You need to make a risk evaluation. For example, there was a big outage at Amazon in April 2011. Many early-stage startups use Amazon as their infrastructure, so a number of them went down until Amazon was able to fix that.

There were other companies that had evaluated the risk of something like that happening in designing their application architectures and their operational processes. They said, ‘The importance of this application is such that we need to take the extra time and care and investment to design our overall environment so that we’re robust in the event of failure.’

Whatever you get from the SLA will never make up for the amount of lost business in the case of a failure.

Q: Sometimes there’s also a false sense of security, such as when an enterprise buys connectivity from two different providers to ensure redundant access to the cloud. But it could turn out that provider No. 1 is reselling provider No. 2’s network, and a single fiber or copper cut takes out both links.

B.G.: You get two apologies instead of one. That’s a really good point. You can characterize that as incomplete homework.

Q: Business users and consumers are increasingly using wireless to access cloud services. What can enterprises do to minimize the risk that wireless’ vagaries will disrupt that access?

B.G.: That strikes me as very challenging, depending on the type of wireless. For example, an internal Wi-Fi network, you could mitigate against those kinds of risks pretty well, and they’re probably not a lot worse than if you had wired Ethernet.

Out in the field, if you’re talking about somebody using a smartphone or tablet connected over 3G, I don’t know that there’s much a company can do about that. You could evaluate who has the best 3G network, but you’re always going to face the issue of overloads or dead spots.

Q: That goes back to your point about doing your homework. For example, an enterprise might choose to get wireless service from Sprint because it resells 4G WiMAX service from Clearwire. So if Clearwire’s network is unavailable in a particular market, the enterprise’s employees still can get cloud access over 3G, which is a completely separate, Sprint-owned network. The catch is that those options are pretty rare.

B.G.: It is, unfortunately. It would be great if there were more WiMAX.

Lots of times, people over-assess the risks of the cloud while under-assessing the risks of whatever the alternative might be. The fact is that most organizations don’t have redundant connectivity to their data center from two different providers from two different sides of the buildings. They’re not as careful with their own stuff as they insist someone else is.

Q: Or they’ll do it right for their headquarters, but then not be as diligent for their satellite offices.

B.G.: Absolutely. What happens a lot is that people make intuitive risk assessments. When it comes time to make that evaluation, it’s, “Well, we’ve got to support the headquarters, but we don’t have enough budget for those remote offices.” Now what they do is say, “If you’re in a remote office and it goes down, just go down to Starbucks.”

We always tell our clients that cloud providers, in terms of what they bring to the table, are probably going to be as good as best practices or much better than what’s available because those things are core competencies for them. Most IT organizations are cost centers that everybody is always asking: “How can we squeeze this? How can we put this off?”

Major cloud providers don’t have that option. They can’t say, ‘We didn’t upgrade to the latest Microsoft patch because that would require us to move to the newest service pack.’ They just can’t do that from a business perspective.

Photo Credit: @iStockphoto.com/adventtr

by Tim Kridel