Sunday, 14 September 2014

Clouds for performance testing

Cloud computing has been a buzz word in IT few years ago, and now it is rapidly becoming industry standard rather than some new thing. Clouds have matured quite a lot since they got public. Now, I do not know the full history of clouds, but in the last few months I had an opportunity to work with few of them and to assess them for my very particular purpose: performance testing in the cloud. What you need for performance testing is a consistent performance (CPU, memory, disks, network) and if that's a cloud - also an opportunity to quickly and easily bring environment up and down.  

These are the clouds I tried out:
  • HP Cloud
  • Rackspace
  • AWS
  • MS Azure
Without going much into details (to avoid breaking any NDAs), lets just say, that in each cloud I deployed a multi-tiered web application using either puppet master or (in case of Azure where I only really looked at the vanilla Cassandra database, see explanation below) internal cloud tooling. Then I loaded the solution and monitored resource utilisation, response times, throughput, and I noted down any problems that got in the way.

And here’s what I’ve got.

HP Cloud. The worst cloud I’ve seen. The biggest problems I’ve encountered are the following:
  • Unstable VM hosts: two times a VM we used suddenly lost the ability to attach disks, which practically caused DB server to die, and us to lose extra day on creating and configuring a new DB server.
  • Unstable network: ping time between VMs inside the cloud would occasionally jump from 1ms to 16-30ms.
  • High steal CPU time - which means that VM would not get the requested CPU time from the host. During testing it got as high as 80% on the load generating nodes, 69% at the database server, 15% on the application servers.
There were also minor inconveniences such as:
  • It is impossible to resize a live VM: you’ll need to destroy it, and then to recreate, if you need to add RAM or CPU.
  • There is no option to get dedicated resources for a VM.
  • Latest OS versions were not available in the library of images, which means that if you need a new OS version, you’ll have to install it manually, create a customised VM image, and pay separately for each license.
  • Sometimes HP Cloud would have a maintenance that puts VMs offline for several hours.

HP Cloud was the starting point of my cloud investigation, and it was obvious we cannot use it for performance testing. So next I moved to Rackspace - another Openstack provider, more mature and powerful than HP Cloud. More expensive, as well. In Rackspace I didn’t have any problems with steal CPU time, nor with resizing VMs on the fly. It was a stable environment allowing to do benchmarking and load testing. However, it also had a bunch of problems:
  • Sometimes a newly provisioned VM wouldn’t have any network connectivity but through Rackspace web console. Far more often a new VM wouldn’t have network connectivity for a limited amount of time (2-5 minutes) after the provisioning, which caused our Puppet scripts to fail and thus caused a lot of trouble in provisioning test environments. Rackspace tech support has been aware of the issue, but they weren’t able to fix it in the time I was on a project (if they fixed it later, I wouldn’t know).
  • There were occasional spikes in the ping times up to 32 ms.
  • Hardware in Rackspace wasn’t up to our standards: CPU we got didn’t have a lot of cache, so our application would stress out CPU much more than on the hardware we used “at home”. That practically meant that to get the performance we wanted we’d need at least twice as much hardware, which was quite expensive.

After Rackspace we moved onto AWS (my colleagues did more stuff on AWS, than me, thus “we”), and we were amazed at how good it was. In AWS we didn’t have any of the problems we had in Openstack. AWS runs on good hardware (including SSD disks), allows to pay for dedicated resources (but we didn’t have to do it, because even non-dedicated VMs gave consistent results with zero steal time!), shows consistent small ping times between VMs, has a quite cool RDS service for running Amazon-managed easy-to-control relational database servers.

Yet, AWS is not cheap. So we thought we'd quickly try MS Azure to see if it can provide comparable results for a lower price. Because I was to compare Azure vs AWS in few specific performance-related areas (mostly I was interested to see CPU steal times, disk and network performance), I ran few scalability tests for the Cassandra database. Cassandra is a noSQL database, that is quite easy to install and start using. What was cool for my purposes, it has a built in performance measuring tool named cassandra-stress. It's a fast to setup and extremely easy to run test, and also Puppet just wouldn't work with Azure, so instead of the multi-tiered web application I went with Cassandra scalability test.

MS Azure wasn’t actually that bad, but it is nowhere near AWS as an environment for running high loads:
  • The biggest problem seemed to be network latency. Where AWS was doing perfectly fine, Azure had about 40% failures on timeouts on high loads. Ping times between nodes during tests were as high as 74 ms at times (compared to 0.3 ms in AWS under similar load). From time to time my SSH connection to this or that VM would break for no apparent reason.
  • Concurrently provisioning VMs from the same image is tricky: part of the resources is actually locked during VM creation, and no other thread can use it. That caused few "The operation cannot be performed at this time because a conflicting operation is underway. Please retry later.” errors when I was creating my environment.
  • Unlike AWS, Azure doesn’t allow you to use SSD, which means a lower disk IO performance. Also in Azure there are limitations on the number of IOPS you can have per storage account (though to be fair there is no practical limitation on how many storage accounts you can have in your environment). Even using RAID-0 of 8 disks didn’t allow me to reach the performance we easily had in AWS without a RAID.
  • For some reason (I am not entirely sure it was MS Azure fault) CPU usage was very uneven between the Cassandra nodes, even though the load on each node was pretty much the same.
  • I was not able to use Puppet because the special Puppet module for MS was out of sync with the Azure API.
This being said, Azure is somewhere near Rackspace (if not better) in terms of performance, and is quite easy to use. For a non-technical person who wants a VM in the cloud for personal use I’d recommend Azure.

For running performance testing in the cloud, AWS is so far the best I've seen. I also went through few of Amazon courses, and it looks to me like the best way to utilise AWS powers is to write an application that would use AWS services (such as queues and messages) for communicating between nodes.

As a summary: from my experience I would recommend to stay away from the HP Cloud, to use MS Azure for simple tasks, to use AWS for complicated time-critical tasks. And if you are a fan of Openstack - Internet says Rackspace is considered to be the most mature of the Openstack providers and to run the best hardware.