Monday, 13 July 2015

Security ramble

All the recent hacks where mass amounts of personal data has been exposed made me wonder, whether in time public perception of privacy and data security will change.
What I mean is people nowadays seem very much surprised and distressed whenever their data gets stolen, be it photos from iCloud, or SSN and address info from governmental databases, or PHIs from health and insurance providers. It's almost like your average Joe or Jane do not expect it to ever happen... but data gets stolen all the time. And I don't see any reason for these hacks to stop in the near future.

Wouldn't it be more reasonable to assume that every information storage system will be hacked, and any data will be stolen? This assumption will give you state of mind and tools to concentrate on active monitoring and mitigation plan, whereas nowadays it looks like people mostly concentrate on preventing the hack (some big hacks went unnoticed for many months!).

I would much prefer if people responsible for the systems where my personal data is stored:

  • Assumed they are gonna be hacked.
  • Made sure when it happens they will notice (automated smart monitoring systems).
  • Made sure it is complicated and/or expensive to use stolen data to harm me (block bank accounts, make it possible to cancel ID easily, make it hard to make sense of my PHI without some key that is also easy to cancel/revoke, make sure devices that can physically harm me have inbuilt protection against that physical harm - e.g. e.g. it shouldn't be possible to program heart pacifiers to murder its carrier).
  • Worked on making the attack expensive (we are gonna be hacked, but it will be annoying, frustrating and expensive process for a hacker) and long (store unrelated data in different disconnected places, so you have to do a separate hack for each of the pieces).
And I myself am assuming my data can be stolen at any point, so I am trying to behave with that assumption in mind:
  • There are no private emails or photos that, if made public, will harm me - I do not put stuff that can harm me in the internet. I don't say shitty things about people behind their backs. I do not lie. Not that I naturally feel the need to do all that stuff, but assuming you can get exposed at any moment does provide additional motivation to withhold from being a dick.
  • My money are stored in different places, and my cards are not connected to my savings.
  • My most important email account is behind a 2fa authentication, and it is connected to my phone, so if it is compromised, I will notice, and I can block it fast.
  • And last but not the least, I am mentally prepared it can all fail me. If that happens it will mess me up a bit and create some hassle to block/change/restore cards, accounts and IDs, but it will not be the end of the world.

Thursday, 9 April 2015

Oracle troubles and findings: tuning experience

I am not an Oracle DBA. But with performance testing I more often than not end up creating the whole environment, which means setting up Oracle server as well. Since I am not testing Oracle server specifically, I usually only tune it enough for it to not be the bottleneck. Lucky for me Oracle server is actually pretty cool compared to applications I test, and only gets to be a bottleneck in scalability testing where it serves multiple application servers. Still... recently I've bumped into a set of correlated problems that led me to tuning effort on the Oracle server itself.
Sponsored by Internet - meaning that all that I've done was googled in the internet and applied with fingers crossed.

So, here it goes.

First problem was the following: when I went from 4 application server nodes to 6 application server nodes, response times high rocketed. It was obviously an infrastructural problem, and after a while Oracle was the only culprit. Unlike usually, CPU usage wasn't that high on the oracle, so I had to dig a bit deeper, and guess what I found: concurrency issues such as "cursor pin S on X"!

To avoid doing hard parses, oracle puts any new query and it's execution plan into a shared cursors tree. Access to that tree is controlled by mutex pins algorithm. Only one session can grab a mutex pin for a specific cursor at a time. Also, similar queries are being put as leaves with a common root, and my understanding is the whole root is being pinned during any updates in the tree...

Anyway, it was happening for two reasons:
1. Application under test was using queries with literals where it should've been using prepared statements.
2. My oracle version ( had known issues around shared cursors tree.

So I've updated to and set CURSOR_SHARING=FORCE. What this option does is it replaces all literals in all queries by system variables, which effectively means that all the queries that only differ in literals are now treated as the same query, they have the same cursor, and cursors tree doesn't need to be constantly updated. This took care of concurrency issues. It also created another problem.

Suddenly one of my other queries which was never a problem before, a very simple and well behaved query, became a huge bottleneck. It would take thousands of CPU cycles to execute where before it was tens of cycles! This one took days of my time, numerous experiments that slightly improved the situation but didn't solve the main issue, and in the end I had to go to DBAs for help.

Turned out that innocent "1=2" in that query (which was there because the query was dynamically generated with optional conditions) was replaced by something like ":SYS_0=:SYS_1", and that meant Oracle was grabbing those variables and evaluating the clause again and again for each row in a huge table (I would think it would do it once, understand it's FALSE and leave it at it - but no).
This was of course the result of CURSOR_SHARING=FALSE. I'll say in advance, that I got exactly the same behaviour with CURSOR_SHARING=SIMILAR.

The suggested fix in my case was either to switch to prepared statements everywhere so that we don't need to use CURSOR_SHARING=FALSE/SIMILAR, or to remove "1=2" from the query that suffered from that setting. Can't have it both.

Other useful tuning:

  • Increasing shared_pool_size.
  • Increasing session_cached_cursors.
  • Weirdly enough, locking statistics on selected columns helped.

Sunday, 1 February 2015

HAProxy balancing https backends

Recently I needed to configure load balancing in my environment, where I needed to balance between few https servers with sticky sessions enabled. I looked in the haproxy manual, I googled, I asked - and for days there was no making it work.

Most of the haproxy configuration examples out there are for the case when client connects to haproxy via https, and then haproxy decrypts it and balances requests between http backends. Few examples around https backends assumed that no sticky sessions are needed, so they all sit on top of tcp. To this day I have not found a guide or an example of how to configure what I need, so once I figured out how to do that, I thought I'd share.

So the way you do it is:
0) You need haproxy 1.5+. haproxy before that did not support https on its own.
1) A client connects to haproxy via https. There need to be a certificate+private key combination (that client would trust) on the haproxy server.
2) HAProxy decrypts the traffic and attaches a session cookie. If the cookie is already there, it knows where to send the request further.
3) HAProxy encrypts the traffic again before sending it to backend (where backend can decrypt it).
4) and the other way around.

And the configuration for that is:

  • For both backend and frontend you should have mode http.
  • In the bind line you need to add ssl cert <path to haproxy certificate + private key file>.
  • In the backend section you need to set load balancing algorithm - e.g. roundrobin or leastconn.
  • In the backend section you also need to set a cookie - e.g. cookie JSESSIONID insert indirect no cache.
  • For each server you need to say "ssl" after the ip, and then also set a cookie.
For me the one part I couldn't find in any guides was to put "ssl" in the server line (as well as in the bind line). I might have missed it somewhere in the not-so-helpful haproxy manual.

One thing I didn't go into was setting up a proper certificate on a backend servers in my environment, because of course in test environment they are self signed and all that. In order to work around it, just add another global setting to the haproxy settings: ssl-server-verify none.

And here's the example of the config file frontend & backend sections to make it work:

frontend  main
    mode http
    bind :443 ssl crt /etc/haproxy/cert.pem
    default_backend app

# round robin balancing between the various backends
backend app
    mode http
    balance     roundrobin
    option httpchk GET /concerto/Ping
    cookie JSESSIONID insert indirect nocache
    server  app1 ssl check cookie app1
    server  app2 ssl check cookie app2
    server  app3 ssl check cookie app3
    server  app4 ssl check cookie app4

Monday, 15 December 2014

We suck in security

And by “we” I mean humanity, at least the part of humanity that uses computers to create, store and share information. This is my main take from this year’s kiwicon 8. And this is the story of how I got there.
Disclaimer: I’m just gonna assume that presenters knew what they were talking about and use what they said shamelessly, and then give details in the further blog posts (or for now you can probably look at the posts of other attendees).
According to Rich Smith, Computer Security appears where Technology intersects with People. Makes sense. So lets look at the technology and at people separately.

1. Technology.
Just at this two-days conference vulnerabilities have been demonstrated in the very software that we rely on in keeping us safe: turns out well established Cisco firewall and most of the anti viruses are pretty easy to hack (for people who do that professionally). That’s like a wall with many holes. You get relaxed because wow wall, and next thing you know is all your sheep are stolen.

Firewalls and anti viruses are software, but the problem runs deeper. Protocols and languages! Internet has not been designed for security, and looks like all the crutches we built for it since don’t help as much as you would hope. JavaScript with its modern capabilities is always a ticking bomb in your living room, and now there is also WebRTC that was designed for peer-to-peer browser communications, and that helps tools like BeEF hide themselves. BeEF is stealth as it is, so maybe it doesn’t make an awful lot of difference, but you can see the potential: where earlier BeEF server would control a bunch of browsers directly, now it can also make browsers control other browsers. Thank you, WebRTC. You would think that technologies arising these days would try to be secure by design, but oh well…

Wanna go even deeper? Ian “MCP” Latter presented proof of concept for new protocol that allows to transfer information through screen and through programmable keyboard. He also demonstrated how exploitation framework built on top of these protocols allows perpetrator to steal information bypassing all the secure infrastructure around the target. The idea is you are not passing files, you are showing temporary pictures on the monitor screen, you capture this stream of pictures, and you decipher information from those pictures. Sounds like something out of “Chuck”, yet this is a very real technology. As its creator said, “By the nature of the component protocols, TCXf remains undetected and unmitigated by existing enterprise security architectures.”
Then there are also internet-spread vulnerabilities that got known in the last year, that for me as a bystander sound mostly like: a lot of people build their products on top of some third-party libraries. When those libraries get compromised, half of the internet is compromised. And they do get compromised.
Then there is also the encryption problem, where random numbers aren’t really as random as people using them think. But compared to all above it sounds like the least of our problems.
Okay, so technology isn’t as secure as we want, what about people?

2. People
People are the weakest link. Forget technology, even if we were to make it perfect, people would still get security compromised. And according to many speakers on the kiwicon, so far security area sucks in dealing with people. There is wide-spread default blame culture: when someone falls a victim to social engineering, they are getting blamed and fired. That is hardly how people learn, but that is exactly how you create atmosphere in which no one would go to security team when in doubt because of the fear of getting fired. Moreover, we don’t test people. We don’t measure their “security”, and we don’t know how to train them so that training would stick - because we don’t know what works, and what doesn’t.
So, we have problems with technology and with people. What else is bad?

There are plenty of potential attackers out there. Governments, enforcement agencies, corporations, individuals with various goals from getting money to getting information to personal revenge… they have motivation, they have skills and tools, and it is so much cheaper to attack than it is to defend (so called “Asymmetric defence”).
To make it even easier for attackers, targets don’t talk to each other. They don’t share information when they were attacked, to report such a thing is seen as to compromise yourself. And even when information is willingly shared, we don’t have good mechanisms to do that, so we do it the slowest way: manually. It might be easy enough in simple cases, but as @hypatia and @hashoctothorpe said, complex systems often mean complex problems, which would make them hard to describe and to share.

So, we suck in security. This is quite depressing. To make it a bit less depressing, lets talk about solutions that were also presented on the kiwicon (in some cases).

Most solutions were for the “People” part of the problem. Not one but three speakers talked about that.
The short answer is (in words of Etsy’s Rich Smith): ComSec should be Enabling, Transparent and Blameless.
The slightly longer answer is:
  • Build culture that encourages people to seek assistance from the Security specialists and to report breaches (don’t blame people, don’t try to fix people - fix the system).
  • Share information between departments and between organizations.
  • Proactive reach: for security team to reach to development and help them develop secure products.
  • Build trust.
  • Recognise that complex system will have complex problems.
  • Do realistic drills and training, measure the impact of training and adjust it.
People are reward driven and trustful by default. What makes it a problem is that people are thus highly susceptible to social engineering methods which are many. This can’t be fixed (do we even want it fixed?), but at least we can make it super easy to ask professionals for help without feeling threatened.

Okay, so situation in the People area can be improved (significantly if everyone were to follow Etsy’s culture guidelines) - at least for some organizations. What about the Technology area? Well… this is what I found in the presentations:
  • Use good random numbers.
  • Compartmentalize (don’t keep all eggs in one basket, don’t use flat networks, don’t give one user permissions to all servers, etc.).
  • Make it as expensive as possible for attackers to hack you: anti-kaizen for attackers, put bumps and huge rolling stones in their way, make it not worth the effort.
  • Know what you are doing (e.g. don’t just use third-party libraries for your product without verifying how secure they are).
  • …?
This is depressing, okay. In fact, I’m gonna stop here and let you feel how depressing it is. And then in the next posts I’ll write about more cheerful things. Kiwicon was really a lot of fun and epicness (I was in a room full of my childhood heroes, yeeey!). And there was a DeLorean. Doesn’t get much more fun than that. :-D

Wednesday, 19 November 2014

R for processing JMeter output CSV files

So, I'm working as a performance engineer, and I run a lot of tests in JMeter. Of course most of those tests I run in non-GUI mode. To get results out of non-GUI mode there are two basic ways:
  1. Configure Aggregate report to save results to a file. Afterwards load that file to a GUI JMeter to see aggregated results.
  2. Use a special JMeter plugin to save aggregated results to a database.
I already wrote about the second way, so today I'll write about the first one. There are probably better ways to do it, but until recently this was how I processed results:
  1. Run a test, have it save results to a file.
  2. Open that file in Agregate Report component in GUI JMeter to get aggregated results.
  3. Click "Save Table Data" to get new csv with aggregated results.
  4. Edit that new CSV to get rid of samplers I am not interested in (mostly the ones that I didn't bother to name - e.g. separate URLs that compose a page), and to also get rid of the columns I am not interested in.
  5. Sort the data in the CSV by Sampler - this is because I run many tests, and I need to compare results between runs. For that reason I create a spreadsheet and copy response times and throughput data to that spreadsheet, adding more and more columns for the table with rows labeled as samplers. Whatever, works for me.
  6. Copy results from csv to a big spreadsheet and graph the results.
At some point I used macros and regexps in Notepad++ to do stage 4. Then my laptop died and I lost it, couldn't be bothered to write it again, even though it was big help. Still, even with the macro there were a lot of manual steps just to get to meaningful results.

But hey, guess what, I've been learning stuff recently - in particular Data Science and programming in R. So I used little I know and created this little script in R to do steps 2-5 above for me.

Now all I have to do is to place JMeter output files in a folder, start R Studio (which is a free tool, and I have it anyway) (you can probably do it with pure R, no need in R Studio even), set working directory to the folder with files and run the script. Script goes through all csv files in the folder (or you can setup whatever filenames list you want in the script), and for each file:
  • Calculates Median, 90% Line and Throughput per minute for each sampler
  • Removes the samplers starting with "/" - i.e. samplers I didn't bother to give proper names, so I am probably not very interested in their individual results.
  • Removes delays (that's the thing with our scripts - we use Debug sampler usually named as "User Delay" to set up a realistic load model).
  • Orders results by sample name.
  • Saves results to a separate file.
One button and voilĂ  - all the processing done. Now I only need to copy data to my big spreadsheet and graph it as I choose.

Script is in githubfeel free to grab and use.

Sunday, 14 September 2014

Clouds for performance testing

Cloud computing has been a buzz word in IT few years ago, and now it is rapidly becoming industry standard rather than some new thing. Clouds have matured quite a lot since they got public. Now, I do not know the full history of clouds, but in the last few months I had an opportunity to work with few of them and to assess them for my very particular purpose: performance testing in the cloud. What you need for performance testing is a consistent performance (CPU, memory, disks, network) and if that's a cloud - also an opportunity to quickly and easily bring environment up and down.  

These are the clouds I tried out:
  • HP Cloud
  • Rackspace
  • AWS
  • MS Azure
Without going much into details (to avoid breaking any NDAs), lets just say, that in each cloud I deployed a multi-tiered web application using either puppet master or (in case of Azure where I only really looked at the vanilla Cassandra database, see explanation below) internal cloud tooling. Then I loaded the solution and monitored resource utilisation, response times, throughput, and I noted down any problems that got in the way.

And here’s what I’ve got.

HP Cloud. The worst cloud I’ve seen. The biggest problems I’ve encountered are the following:
  • Unstable VM hosts: two times a VM we used suddenly lost the ability to attach disks, which practically caused DB server to die, and us to lose extra day on creating and configuring a new DB server.
  • Unstable network: ping time between VMs inside the cloud would occasionally jump from 1ms to 16-30ms.
  • High steal CPU time - which means that VM would not get the requested CPU time from the host. During testing it got as high as 80% on the load generating nodes, 69% at the database server, 15% on the application servers.
There were also minor inconveniences such as:
  • It is impossible to resize a live VM: you’ll need to destroy it, and then to recreate, if you need to add RAM or CPU.
  • There is no option to get dedicated resources for a VM.
  • Latest OS versions were not available in the library of images, which means that if you need a new OS version, you’ll have to install it manually, create a customised VM image, and pay separately for each license.
  • Sometimes HP Cloud would have a maintenance that puts VMs offline for several hours.

HP Cloud was the starting point of my cloud investigation, and it was obvious we cannot use it for performance testing. So next I moved to Rackspace - another Openstack provider, more mature and powerful than HP Cloud. More expensive, as well. In Rackspace I didn’t have any problems with steal CPU time, nor with resizing VMs on the fly. It was a stable environment allowing to do benchmarking and load testing. However, it also had a bunch of problems:
  • Sometimes a newly provisioned VM wouldn’t have any network connectivity but through Rackspace web console. Far more often a new VM wouldn’t have network connectivity for a limited amount of time (2-5 minutes) after the provisioning, which caused our Puppet scripts to fail and thus caused a lot of trouble in provisioning test environments. Rackspace tech support has been aware of the issue, but they weren’t able to fix it in the time I was on a project (if they fixed it later, I wouldn’t know).
  • There were occasional spikes in the ping times up to 32 ms.
  • Hardware in Rackspace wasn’t up to our standards: CPU we got didn’t have a lot of cache, so our application would stress out CPU much more than on the hardware we used “at home”. That practically meant that to get the performance we wanted we’d need at least twice as much hardware, which was quite expensive.

After Rackspace we moved onto AWS (my colleagues did more stuff on AWS, than me, thus “we”), and we were amazed at how good it was. In AWS we didn’t have any of the problems we had in Openstack. AWS runs on good hardware (including SSD disks), allows to pay for dedicated resources (but we didn’t have to do it, because even non-dedicated VMs gave consistent results with zero steal time!), shows consistent small ping times between VMs, has a quite cool RDS service for running Amazon-managed easy-to-control relational database servers.

Yet, AWS is not cheap. So we thought we'd quickly try MS Azure to see if it can provide comparable results for a lower price. Because I was to compare Azure vs AWS in few specific performance-related areas (mostly I was interested to see CPU steal times, disk and network performance), I ran few scalability tests for the Cassandra database. Cassandra is a noSQL database, that is quite easy to install and start using. What was cool for my purposes, it has a built in performance measuring tool named cassandra-stress. It's a fast to setup and extremely easy to run test, and also Puppet just wouldn't work with Azure, so instead of the multi-tiered web application I went with Cassandra scalability test.

MS Azure wasn’t actually that bad, but it is nowhere near AWS as an environment for running high loads:
  • The biggest problem seemed to be network latency. Where AWS was doing perfectly fine, Azure had about 40% failures on timeouts on high loads. Ping times between nodes during tests were as high as 74 ms at times (compared to 0.3 ms in AWS under similar load). From time to time my SSH connection to this or that VM would break for no apparent reason.
  • Concurrently provisioning VMs from the same image is tricky: part of the resources is actually locked during VM creation, and no other thread can use it. That caused few "The operation cannot be performed at this time because a conflicting operation is underway. Please retry later.” errors when I was creating my environment.
  • Unlike AWS, Azure doesn’t allow you to use SSD, which means a lower disk IO performance. Also in Azure there are limitations on the number of IOPS you can have per storage account (though to be fair there is no practical limitation on how many storage accounts you can have in your environment). Even using RAID-0 of 8 disks didn’t allow me to reach the performance we easily had in AWS without a RAID.
  • For some reason (I am not entirely sure it was MS Azure fault) CPU usage was very uneven between the Cassandra nodes, even though the load on each node was pretty much the same.
  • I was not able to use Puppet because the special Puppet module for MS was out of sync with the Azure API.
This being said, Azure is somewhere near Rackspace (if not better) in terms of performance, and is quite easy to use. For a non-technical person who wants a VM in the cloud for personal use I’d recommend Azure.

For running performance testing in the cloud, AWS is so far the best I've seen. I also went through few of Amazon courses, and it looks to me like the best way to utilise AWS powers is to write an application that would use AWS services (such as queues and messages) for communicating between nodes.

As a summary: from my experience I would recommend to stay away from the HP Cloud, to use MS Azure for simple tasks, to use AWS for complicated time-critical tasks. And if you are a fan of Openstack - Internet says Rackspace is considered to be the most mature of the Openstack providers and to run the best hardware.

Sunday, 20 July 2014

Introvertic ramble on the trap of openspaces and office spaces in general

Hi, my name is Viktoriia, and I’m an introvert.

The weekend before last I spent two awesome days socializing with some of the best testers in New Zealand. After that I spent another three days trying to recover from all the joy. I was exhausted emotionally and physically, and had to spend full Sunday being sick and miserable because that’s how my body reacts to over-socialization - it goes to hibernate. Humans are not built to spend time in hibernate. That got me thinking…

Every day working in the office I get a bit more socialization that I would voluntarily choose to. And then when I get one little spike (like a testing conference), it becomes a butterfly that broke the cammel’s back.

Don’t get me wrong, co-location is awesome and critical for agile teams and all that. But there are also problems that come from the way we implement it (by placing everyone into these huge openspaces), and not only problems relevant to introverts exclusively:
  • The constant humming noise. Even if we forget about people who talk loud because that’s how they talk - the typing, and moving, and clicking, and talking, and whatelse is always there. Noise is stress. We even had it as a topic in school and university in Russia: even though human brain is pretty good with filtering out non-changing signals, human-produced complicated noise still makes it to do a lot of work to maintain those filters. Nervous system is always working extra hard just to save you the ability to concentrate. 
  • The cold going round. When someone is sick, everyone is sick. Someone is always sick. Sneezing and coughing never really stops. It’s like a kindergarden for IT - if you don’t have iron-made immune system, you are bound to go in and out of colds non-stop. Nothing serious, but pretty annoying. 
  • The temperature. Since we are all sharing the same space, we cannot possibly set temperature so that it’s good for everyone. For me it’s always freezing in the office. Judging from the number of people in jackets around, I guess I’m not the only one. 
  • The socialization itself. For introverts like me it’s additional stress just to be around this many people all the time. It makes it harder to concentrate, and it means that I’m always under just a little extra bit of stress. Immune system works badly when you are under stress, so that feeds into constantly being in and out of sickbay, which feeds into concentration problems again.
  • Commuting. This one applies to working from office in general, not just to openspaces. Every day so much time is being lost on getting from home to office and back. This makes roads overloaded, makes air worse, makes us all spend our precious time doing what really isn't necessary. Would be cool to free up roads for people who actually do have a good reason for being there. In IT in many cases it can be avoided - we have enough collaboration tools to go from 5 days a week working side by side to 1 day when everyone's physically in the office to align their actions and adjust plans as necessary and 4 days when everyone is where they choose to be, being online and connected via internet.
  • Multitasking. There have actually been research done* about the efficiency of office workers in different settings. It was shown that even extraverts work more efficiently and more creatively when they have a little bit of privacy (even if that’s a cubicle or a smaller room with just your team - but not the openspace). We also all know that exploratory testing recommends uninterrupted test sessions. The thing is, humans suck in multitasking. We can only really do one thing at a time. We can switch between tasks fast, that’s true, but imagine the overhead! When part of your resources is spent on ignoring the noise (I guess, headphones somewhat help, but in my experience you just get touched a lot when people want to talk to you), part on fighting the cold and part on switching between different tasks (passersby wanting to chat, for example) - you cannot possibly work at your fullest.
While having separate rooms for each team instead of openspace would make things much better, I personally would still prefer to have a choice to work from home. I found that few days when I was sick and worked from home turned out to be no less productive than an average day in the office, and most times even more productive. Always more comfortable.

It would be awesome to have an oportunity to work from home and be judged by results, not by hours in the chair. Especially since many IT companies seem to be already evaluating performance by results. Company I currently work for has a thorough system of logging and evaluating successes and results, and no one really sticks for hours as far as I know. Yet it is not a common practice to allow employees to work from home, aside from emergencies and special cases. I wish it was. One of the reasons I want to go to contracting in few years is to have an opportunity to live out of Auckland in a nice house with good internet and do all the work from there. In my book it beats both living in the center of Auckland to be near office and living outside of Auckland and spending few hours every work day on commuting. I'd rather work 9 hours from home than work 8 hours in the office and spend another hour on getting there and back.

*about research and more, there is an awesome book “Quiet: The Power of Introverts” by Susan Cain. It quotes and references quite a lot of scientific research in the area. I highly recommend it to anyone who’s interested in how people work.