Sponsored by Internet - meaning that all that I've done was googled in the internet and applied with fingers crossed.
So, here it goes.
First problem was the following: when I went from 4 application server nodes to 6 application server nodes, response times high rocketed. It was obviously an infrastructural problem, and after a while Oracle was the only culprit. Unlike usually, CPU usage wasn't that high on the oracle, so I had to dig a bit deeper, and guess what I found: concurrency issues such as "cursor pin S on X"!
To avoid doing hard parses, oracle puts any new query and it's execution plan into a shared cursors tree. Access to that tree is controlled by mutex pins algorithm. Only one session can grab a mutex pin for a specific cursor at a time. Also, similar queries are being put as leaves with a common root, and my understanding is the whole root is being pinned during any updates in the tree...
Anyway, it was happening for two reasons:
1. Application under test was using queries with literals where it should've been using prepared statements.
2. My oracle version (11.2.0.1) had known issues around shared cursors tree.
So I've updated to 11.2.0.4 and set CURSOR_SHARING=FORCE. What this option does is it replaces all literals in all queries by system variables, which effectively means that all the queries that only differ in literals are now treated as the same query, they have the same cursor, and cursors tree doesn't need to be constantly updated. This took care of concurrency issues. It also created another problem.
Suddenly one of my other queries which was never a problem before, a very simple and well behaved query, became a huge bottleneck. It would take thousands of CPU cycles to execute where before it was tens of cycles! This one took days of my time, numerous experiments that slightly improved the situation but didn't solve the main issue, and in the end I had to go to DBAs for help.
Turned out that innocent "1=2" in that query (which was there because the query was dynamically generated with optional conditions) was replaced by something like ":SYS_0=:SYS_1", and that meant Oracle was grabbing those variables and evaluating the clause again and again for each row in a huge table (I would think it would do it once, understand it's FALSE and leave it at it - but no).
This was of course the result of CURSOR_SHARING=FALSE. I'll say in advance, that I got exactly the same behaviour with CURSOR_SHARING=SIMILAR.
The suggested fix in my case was either to switch to prepared statements everywhere so that we don't need to use CURSOR_SHARING=FALSE/SIMILAR, or to remove "1=2" from the query that suffered from that setting. Can't have it both.
Other useful tuning:
- Increasing shared_pool_size.
- Increasing session_cached_cursors.
- Weirdly enough, locking statistics on selected columns helped.