It’s common for hosting and cloud service providers to get support calls about the performance of a hosted application being ‘too slow’. It is in fact so common that many tenders for new hosted or SaaS applications ask detailed questions about the compute, storage, and network capabilities of the hosting platform.
I have seen RFIs that wanted to know specifically how long it took for a database query to be answered, or what the SLA was for a specific screen refresh while using the application. I also often see questions about network latency between the provider’s data centre and the customer location.
All of these questions are intended to try to get a feel for how long users will have to sit waiting for the system to complete a task. Given that those users might well be on the phone to a customer, you can certainly understand why it isn’t acceptable for them to be waiting ages for the system to respond; after all, we’ve all been on the receiving end of that problem at some stage.
So, what’s the issue?
The problem with trying to pin down the response time of a system and put the blame on the hosting provider, is that they don’t control the whole process. While the best providers can, and do, do everything in their power to ensure that the systems they are hosting applications on are as performant as possible, there are elements that are outside of their control.
One issue is that the software developers and the infrastructure hosting specialists (whether the same company or separate), are different teams, so there may be a disconnect between the way that the software is designed to be run and the way that it has been implemented. I’m not the only person who has spent a considerable part of my career trying to minimise this sort of confusion through detailed consulting with both teams – it’s a major part of any cloud-based architect’s role.
The next challenge to providing a guarantee of performance is the connectivity between the hosting/cloud provider and the customer. If that connection is via the Internet, then all bets are off! There really isn’t any way to ensure that all of the multiple hops that make up any connection across a public network are performing optimally all the time.
What can be done?
Software should now be designed whilst taking this into account by retrying communication requests rather than passively awaiting a response, however some applications built on legacy technology stacks still haven’t been redesigned with the unreliability of the network in mind. If there is a direct connection between the data centre and the customer premises, the situation is definitely better, but of course that connectivity comes at an additional cost.
If the hosting provider doesn’t provide the WAN links, then there’s inevitably potential for a ‘blame game’ between them and the carrier, with both sides likely to claim the poor performance is the other’s problem.
Most customers would think that once they have addressed the problems above, they are guaranteed to get acceptable performance from the hosted or SaaS application. Unfortunately, I’ve been involved with a number of support escalations where everything listed above was eliminated as the cause, but users were still seeing unacceptable system response times.
How can this be resolved?
In a couple of specific cases the customer was concerned enough about the system performance that they were talking about invoking non-performance contract termination clauses. In both these cases I went on-site and sat with users and timed how long it took for an application to perform some standard actions. I was able to confirm that the software (both were running the same solution, which was performing completely acceptably for dozens of other hosted customers) was significantly underperforming in both locations.
Following this, the hosting company and software provider committed to investigate the whole service ‘from data centre to desktop’, using WAN monitoring tools, performance monitoring, a review of the database, plus – a master stroke as it happens – a Viavi Observer Gigastor, which would be put on the LAN to see whether it was possible to identify any local problems.
If you haven’t come across the GigaStor it is effectively a Sky+ box for your network; it captures every packet that crosses the network and stores it for later analysis. The agreement was that the cost of the investigation would be borne between the software provider and the hosts, unless it turned out that the problem wasn’t related to the software or service, in which case the customer would pay for it. This in-depth investigation and monitoring would run for a full month to ensure that any peaks in load were taken into account.
What was the outcome?
It probably doesn’t come as any surprise to learn that in both cases the LAN was identified as being the problem; after all, both the software and infrastructure support teams had spent a lot of time investigating the issue before the agreement to investigate the whole chain of communication was reached, so the engineers and the software team were reasonably confident the problem wasn’t in those respective areas.
The root causes were different (a NIC on a local server ‘chattering’ in one case and a problem with a network switch in the other), but the interesting thing was that once the problems were identified, users on both sites started to report that they were suddenly getting better performance from other systems as well. It was a shame that the reports of other systems being slow weren’t considered as being potentially linked before this point as that would have saved everyone time, but sometimes it is hard to get an overview of how complex systems and services interact.
What can be learnt from this?
This really emphasises the point that if you buy business software, next time you are considering if a cloud solution will perform as required, you need to consider the whole solution – from desktop to data centre. Advanced can help you build the right environment for your needs and manage the process end-to-end, freeing up your resources. Learn more about our services here or contact firstname.lastname@example.org to get started.