jump to navigation

Oracle Diagnostic Tools – Some Random Thoughts (Hammer To Fall) February 26, 2008

Posted by Richard Foote in Oracle Cost Based Optimizer, Oracle General, Performance Tuning, Richard's Musings.

There’s been a bit of discussion lately and some interesting opinions aired, such as those by Daniel Fink , Doug Burns, and Alex Gorgachev regarding the whole issue of the usefulness or otherwise of various Oracle diagnostic tools (such as Statspack, AWR, ADDM, Extended Tracing, etc).

Are they really useful in diagnosing problems or do they simply provide a whole bunch of numbers that most folk can’t readily decipher ?

Boy, I could go on and on with this one …

IMHO, there’s no doubt that many (most even ?) people use many of these “diagnostic tools” incorrectly and use their view and perception of the data as some kind of justification for what really amounts to nothing more than a guess when attempting to determine the root cause and solution of a performance issue.

Look at all that time spent performing sequential reads, boy you need faster disks. Buffer Busy Waits are in your top 5 wait events, obviously need to add more freelists somewhere. Your session’s running slow eh, well statspack is showing enqueue waits times are really high, it’s probably that. If not, it might have something to do with those log file sync times (whatever they are).

I have no doubt about Mogens Norgaard’s assertion (as referred to by Daniel Fink) that if you take 2 people of the same skill in separate rooms and arm them with the same Statspack report, they would both come up with different suggestions on what needs to be done. I’ve lost count of the number of threads in forums that start with someone complaining about “performance”, posting a statspack report and ending up with 10 different people making 10 different suggestions.

I’ll digress somewhat, but this is a story I use to try and explain to my kids what it is I actually do.

Mum says she’s heading off to the local shops to get some milk for our breakfast cereal. The shops are close, a few minutes drive away and I expect she’ll be back in 10 minutes or so. 2 hours later, she finally gets home, drops off the milk and leaves again in a huff before we can ask what happened. Kids, my job is to find out what took her so long.

It’s at this point the kids say that we don’t normally have cereal, but toast anyways and if I’m angry at mum for taking so long I should have gone and got the milk myself, but I try to get the discussion back on track …

I know there was plenty of fuel in the car, the weather was fine and clear and that the traffic generally wasn’t too busy at the time. There’s also usually plenty of milk at the shop. Now I’ve been listening to the radio and have a pretty good idea of the overall traffic and weather conditions. There was a truck that overturned nearby, resulting in some pretty major traffic delays, causing lots of folk to run late.

So I “guess” it must have been the truck that likely caused the delay, right ?

The kids suggest it was probably someone she met and she just got caught up having a long chat. Then again, maybe the car broke down or maybe she didn’t have enough money and had to go to town to get to the bank or maybe …

The point of course is that we simply don’t know. Any diagnosis based on the data at hand would be nothing but a guess. Yes there maybe some historical additional reference we could use (like she sometimes gets caught up with someone for a chat) but it would still be a guess to assume it was the issue this time.

There maybe some “global” events taking place that could be a factor (such as the overturned truck, or really bad weather) but again, there’s nothing to directly suggest it’s the actual issue. 

Yet how often do people when trying to determine the cause of specific performance issues turn to global, generalised, averaged to death statistics, charts and reports to diagnose the problem ? How often do people “hope” that because the weather looks bad, that it’s a likely cause of the problem. Or because the amount of traffic is more than a certain threshold or slower than a certain ratio that maybe that’s the cause of the issue.

How often do people make these (sometimes educated) guesses, get it wrong, apply a solution that makes no difference, take another stab, wrong again, yet another guess and yet another guess until eventually either they hit the right diagnosis and solution or the problem goes away (only to come back at some other point in time) …

A key issue is that many of these fancy diagnosis tools, reports and lists of statistics are used inappropriately. It’s a potential diagnostic tool but often the wrong tool for the problem at hand. A really really sharp and shiny and fancy looking saw when the problem is a “protruding nail” that needs to be hit with a hammer.

Another problem is that there’s generally no process or methodology associated with looking at these reports and charts and database wide statistics. As such they’re open to interpretation and differing views. Most databases can be improved somehow in all sorts of different ways but what is the precise reason for a specific (or general) database performance issue. Everyone is slightly sick or imperfect in some way or another (with the possible exception of David Bowie), but what exactly is it that’s killing us …

However the key problem is that often, very often, most of these diagnostic tools, reports and flashing screens don’t actually provide the necessary information to make an educated diagnosis. The answer to the problem is simply not to be found in statspack, or in the v$ views or on the radio or in the weather report. 10 people can look at a statspack report and 10 different solutions can improve the database in 10 different ways but none of them may necessarily solve the specific database performance issue that’s causing stress to the business.

The actual issue is buried and hidden and drowned out in a sea of numbers, averages, statistics and people who are all predominately able to get their milk in a timely manner.

Not that a “saw” is a totally useless tool. Jonathan Lewis recently referenced a rather nice article by Connie Green on how a saw can be used effectively for slicing through some issues. The traffic report can be a useful source of information.

However, the Oracle marketing machine has certainly been promoting many of these “shiny saws” to the point where many see them as being the tools of choice, no matter the performance issue at hand. Oracle says they provide great, “useful” information so they must be really really useful in solving my performance issues. The answer to my problems has got to be in here somewhere right, if I just know where to look ?

The problem is not necessarily with the diagnostic tools themselves but in the manner in which they’re often used and attempted to be applied to specific issues. A saw is not particularly useful at driving in a nail. You need a hammer for that …

Back to my little story.

Kids, imagine if we had a camera on mum and saw exactly what she was doing for the entire time she was away getting the milk. We could actually see where all the time was spent, see exactly what happened in the 2 hours she was away. We can account for every moment and see exactly what happened during the 1 hour and 50 minutes of “lost” time.

Then we could see that in fact, the car worked fine, she took another route to the local shops bypassing the truck, got the milk at the counter straightaway. However, when she got back to the car she had a problem unlocking the car door as the key was quite bent and got to the point where she just couldn’t open the door.

In fact, out of the 2 hours, 1 hour and 50 minutes was spent frustratingly trying to open the car door.

So it was a “locking” problem all along 😉

No guesses. No assumptions. No ifs and maybes. We know exactly the root cause of the problem.

Therefore no wasted effort and time filling the car up with petrol, no need to drive the longer way around to miss that interchange, no need to demand she stop chatting at the shops (thank goodness), none of which if applied would have actually resolved the issue, none of which would have prevented the same problem from reoccurring again next time …

And that of course is the information Extended Tracing can provide. IMHO, if only this hammer were used more often, if this tool was considered more often to knock in that “protruding nail”, if people posted an extended trace file more frequently, then it would be a big step in the right general direction in correctly diagnosing problems.

Is a 10046 event, DBMS_SUPPORT, DBMS_MONITOR, etc. perfect ? No, of course not. Although there are constant improvements with most releases, it can be difficult to setup in some environments and configurations, it can be difficult to interpret and piece together, it can tell you what the issue might be without telling why, it may only tell you that the problem isn’t Oracle related (although that in itself can be useful), it requires the issue to generally be repeatable, it has overheads, etc. etc.

However, in most scenarios, when applied appropriately, it can provide the necessary information to diagnose the exact cause of performance issues. In most scenarios, it can take the guess work out of the equation and save time by driving one directly to the correct diagnosis, first time.

I’ll add this point in as well. Most people working on other software solutions trying to resolve performance issues, would faint with disbelief at the level of instrumentation available in Oracle. No it’s not perfect, but boy, things could be a lot lot worse.

My general recommendation is this. When you want to determine and diagnose the cause of specific or general database performance issues, consider extended tracing as the first tool within the toolkit. You want to know why the milk took so long, ensure you have a camera available and record what happens.

If you want to be on the lookout for low hanging fruit, if you want to have a global view of the general road conditions, of the weather, of the fuel left in the tank, and proactively see what areas in a database may benefit from some attention, then look at using “saws” such as Statspack, ADDM, etc.

IMHO, if the Oracle diagnostic tools were used more appropriately, if more people read Connie Green’s article, if more people investigated and applied the use of extended tracing, then I’m sure the perception of their usefulness would increase as well.

Meanwhile, I’m going to use a hammer to see if I can get this damn key straightened out …

BTW, the kids think I’m some kind of private investigator who follows and monitors people all day long so I guess I need to try again in explaining what it is I actually do …