jump to navigation

Oracle Diagnostic Tools – Some Random Thoughts (Hammer To Fall) February 26, 2008

Posted by Richard Foote in Oracle Cost Based Optimizer, Oracle General, Performance Tuning, Richard's Musings.
trackback

There’s been a bit of discussion lately and some interesting opinions aired, such as those by Daniel Fink , Doug Burns, and Alex Gorgachev regarding the whole issue of the usefulness or otherwise of various Oracle diagnostic tools (such as Statspack, AWR, ADDM, Extended Tracing, etc).

Are they really useful in diagnosing problems or do they simply provide a whole bunch of numbers that most folk can’t readily decipher ?

Boy, I could go on and on with this one …

IMHO, there’s no doubt that many (most even ?) people use many of these “diagnostic tools” incorrectly and use their view and perception of the data as some kind of justification for what really amounts to nothing more than a guess when attempting to determine the root cause and solution of a performance issue.

Look at all that time spent performing sequential reads, boy you need faster disks. Buffer Busy Waits are in your top 5 wait events, obviously need to add more freelists somewhere. Your session’s running slow eh, well statspack is showing enqueue waits times are really high, it’s probably that. If not, it might have something to do with those log file sync times (whatever they are).

I have no doubt about Mogens Norgaard’s assertion (as referred to by Daniel Fink) that if you take 2 people of the same skill in separate rooms and arm them with the same Statspack report, they would both come up with different suggestions on what needs to be done. I’ve lost count of the number of threads in forums that start with someone complaining about “performance”, posting a statspack report and ending up with 10 different people making 10 different suggestions.

I’ll digress somewhat, but this is a story I use to try and explain to my kids what it is I actually do.

Mum says she’s heading off to the local shops to get some milk for our breakfast cereal. The shops are close, a few minutes drive away and I expect she’ll be back in 10 minutes or so. 2 hours later, she finally gets home, drops off the milk and leaves again in a huff before we can ask what happened. Kids, my job is to find out what took her so long.

It’s at this point the kids say that we don’t normally have cereal, but toast anyways and if I’m angry at mum for taking so long I should have gone and got the milk myself, but I try to get the discussion back on track …

I know there was plenty of fuel in the car, the weather was fine and clear and that the traffic generally wasn’t too busy at the time. There’s also usually plenty of milk at the shop. Now I’ve been listening to the radio and have a pretty good idea of the overall traffic and weather conditions. There was a truck that overturned nearby, resulting in some pretty major traffic delays, causing lots of folk to run late.

So I “guess” it must have been the truck that likely caused the delay, right ?

The kids suggest it was probably someone she met and she just got caught up having a long chat. Then again, maybe the car broke down or maybe she didn’t have enough money and had to go to town to get to the bank or maybe …

The point of course is that we simply don’t know. Any diagnosis based on the data at hand would be nothing but a guess. Yes there maybe some historical additional reference we could use (like she sometimes gets caught up with someone for a chat) but it would still be a guess to assume it was the issue this time.

There maybe some “global” events taking place that could be a factor (such as the overturned truck, or really bad weather) but again, there’s nothing to directly suggest it’s the actual issue. 

Yet how often do people when trying to determine the cause of specific performance issues turn to global, generalised, averaged to death statistics, charts and reports to diagnose the problem ? How often do people “hope” that because the weather looks bad, that it’s a likely cause of the problem. Or because the amount of traffic is more than a certain threshold or slower than a certain ratio that maybe that’s the cause of the issue.

How often do people make these (sometimes educated) guesses, get it wrong, apply a solution that makes no difference, take another stab, wrong again, yet another guess and yet another guess until eventually either they hit the right diagnosis and solution or the problem goes away (only to come back at some other point in time) …

A key issue is that many of these fancy diagnosis tools, reports and lists of statistics are used inappropriately. It’s a potential diagnostic tool but often the wrong tool for the problem at hand. A really really sharp and shiny and fancy looking saw when the problem is a “protruding nail” that needs to be hit with a hammer.

Another problem is that there’s generally no process or methodology associated with looking at these reports and charts and database wide statistics. As such they’re open to interpretation and differing views. Most databases can be improved somehow in all sorts of different ways but what is the precise reason for a specific (or general) database performance issue. Everyone is slightly sick or imperfect in some way or another (with the possible exception of David Bowie), but what exactly is it that’s killing us …

However the key problem is that often, very often, most of these diagnostic tools, reports and flashing screens don’t actually provide the necessary information to make an educated diagnosis. The answer to the problem is simply not to be found in statspack, or in the v$ views or on the radio or in the weather report. 10 people can look at a statspack report and 10 different solutions can improve the database in 10 different ways but none of them may necessarily solve the specific database performance issue that’s causing stress to the business.

The actual issue is buried and hidden and drowned out in a sea of numbers, averages, statistics and people who are all predominately able to get their milk in a timely manner.

Not that a “saw” is a totally useless tool. Jonathan Lewis recently referenced a rather nice article by Connie Green on how a saw can be used effectively for slicing through some issues. The traffic report can be a useful source of information.

However, the Oracle marketing machine has certainly been promoting many of these “shiny saws” to the point where many see them as being the tools of choice, no matter the performance issue at hand. Oracle says they provide great, “useful” information so they must be really really useful in solving my performance issues. The answer to my problems has got to be in here somewhere right, if I just know where to look ?

The problem is not necessarily with the diagnostic tools themselves but in the manner in which they’re often used and attempted to be applied to specific issues. A saw is not particularly useful at driving in a nail. You need a hammer for that …

Back to my little story.

Kids, imagine if we had a camera on mum and saw exactly what she was doing for the entire time she was away getting the milk. We could actually see where all the time was spent, see exactly what happened in the 2 hours she was away. We can account for every moment and see exactly what happened during the 1 hour and 50 minutes of “lost” time.

Then we could see that in fact, the car worked fine, she took another route to the local shops bypassing the truck, got the milk at the counter straightaway. However, when she got back to the car she had a problem unlocking the car door as the key was quite bent and got to the point where she just couldn’t open the door.

In fact, out of the 2 hours, 1 hour and 50 minutes was spent frustratingly trying to open the car door.

So it was a “locking” problem all along ;)

No guesses. No assumptions. No ifs and maybes. We know exactly the root cause of the problem.

Therefore no wasted effort and time filling the car up with petrol, no need to drive the longer way around to miss that interchange, no need to demand she stop chatting at the shops (thank goodness), none of which if applied would have actually resolved the issue, none of which would have prevented the same problem from reoccurring again next time …

And that of course is the information Extended Tracing can provide. IMHO, if only this hammer were used more often, if this tool was considered more often to knock in that “protruding nail”, if people posted an extended trace file more frequently, then it would be a big step in the right general direction in correctly diagnosing problems.

Is a 10046 event, DBMS_SUPPORT, DBMS_MONITOR, etc. perfect ? No, of course not. Although there are constant improvements with most releases, it can be difficult to setup in some environments and configurations, it can be difficult to interpret and piece together, it can tell you what the issue might be without telling why, it may only tell you that the problem isn’t Oracle related (although that in itself can be useful), it requires the issue to generally be repeatable, it has overheads, etc. etc.

However, in most scenarios, when applied appropriately, it can provide the necessary information to diagnose the exact cause of performance issues. In most scenarios, it can take the guess work out of the equation and save time by driving one directly to the correct diagnosis, first time.

I’ll add this point in as well. Most people working on other software solutions trying to resolve performance issues, would faint with disbelief at the level of instrumentation available in Oracle. No it’s not perfect, but boy, things could be a lot lot worse.

My general recommendation is this. When you want to determine and diagnose the cause of specific or general database performance issues, consider extended tracing as the first tool within the toolkit. You want to know why the milk took so long, ensure you have a camera available and record what happens.

If you want to be on the lookout for low hanging fruit, if you want to have a global view of the general road conditions, of the weather, of the fuel left in the tank, and proactively see what areas in a database may benefit from some attention, then look at using “saws” such as Statspack, ADDM, etc.

IMHO, if the Oracle diagnostic tools were used more appropriately, if more people read Connie Green’s article, if more people investigated and applied the use of extended tracing, then I’m sure the perception of their usefulness would increase as well.

Meanwhile, I’m going to use a hammer to see if I can get this damn key straightened out …

BTW, the kids think I’m some kind of private investigator who follows and monitors people all day long so I guess I need to try again in explaining what it is I actually do …

About these ads

Comments»

1. Donald K. Burleson - February 26, 2008

Hi Richard,

>> I have no doubt about Mogens Norgaard’s assertion (as referred to by Daniel Fink) that if you take 2 people of the same skill in separate rooms and arm them with the same Statspack report, they would both come up with different suggestions on what needs to be done.

I disagree! Granted, no single report provides enough information, but well-trained, degreed engineers who take multiple SP reports rarely miss the root cause of a performance problem.

2. Richard Foote - February 26, 2008

Hi Don

What if the root cause isn’t listed in any of the multiple SP reports ?

What if the specific issue that’s impacting key users in the finance area isn’t one of the top x SQL statements, isn’t a result of one of the top 10 database wait events, doesn’t cause a significant percentage of any latch or sort or LIO.

Don, what’s causing the finance area to have such slow performance ?

3. joel garry - February 26, 2008

It is quite common for car keys to have embedded microchips these days, watch it with that hammer! Kinda like MMON bugs hammering the system…

The finance area is performing poorly because they are using T-SQL through ODBC instead of Oracle analytics, of course. :-)

Last night’s Terminator: Sarah Connor Chronicles had mum smashing the monitoring camera with a piece of the hospital bed…

ObBowie: We are the goon squad and we’re coming to town, beep-beep!

4. Marco Gralike - February 26, 2008

Extended Tracing is also just a tool.

As long as the “human” factor remains, you are bound for errors.

5. Tony - February 26, 2008

Having just finished a three day HOTSOS seminar dedicated to extended trace file contents, I found this posting especially good. Trying to infer issues from aggregate data like Stats Pack is misleading and using ADDM or AWR misses data ( for some reason some of the important waits are considered Idle events). The use of extended trace data allows for consistent, repeatable, trainable and targeted issue resolution. If this isn’t the goal of problem resolution shouldn’t it be?

6. Gary - February 26, 2008

Extended tracing has drawbacks too. Mostly, you need to be able to put it in place BEFORE the problem occurs, but you don’t want it on constantly. So you can only use it when you have a situation that you can repeat the problem.
If you are in the throes of a problem, especially an occasional one that happens at unpredicatable times, then the V$ views may be a better first call. And after the problem has died down, Statspack is often the only ‘forensic’ evidence of what happened.
The first talent is being able to recognise which tool is appropriate for the situation at hand. The screw requires a screwdriver, the nail needs a hammer and sometimes you get to use the saw.

7. Noons - February 27, 2008

“The screw requires a screwdriver, the nail needs a hammer and sometimes you get to use the saw.”

Absolutely. And first, we need to make sure what we’re gawkin at is a screw, not a nail.

“Bowie” is IMHO absolutely right in his assertion that we need to first find out WTH is it that we’re dealing with!

As I pointed out in Doug’s blog, it’s easy to provide an example of a common, day to day bad performance instance that is as simple as pie to fix, yet can sidetrack all these you-beaut tools for days on end.

All because folks will insist on putting on the goggles of such tools instead of stepping back and finding out what is really happening!

10046 traces are great, but first let’s make sure if the problem isn’t just bad timing in the execution of a statement.

8. Strumenti di analisi delle prestazioni in Oracle « Oracle and other - February 27, 2008

[…] innescata da un post di Daniel Fink cui ha replicato Alex Gorbachev, poi Doug Burns e poi Richard Foote che poi è quello da cui ho in realtà iniziato la […]

9. Chris Adkin - February 27, 2008

Nice Posting Richard,

I’ve come across the lots of postings from the expert Oracle community before I came across your article. “The Problem is Skew” from Cary Millsap, another article from Guy Harrison also presented at OOW2007. Although AWR ADDM are not perfect, the problem with externded tracing is its usefulness after the performance issue has ocurred, whereas ADDM and AWR are always on, unless you don’t have the diagnostic and tuning options. You could counter this by saying that the problem should be reproduced on a copy of a production system. However, there are lots of production databases out there for which there is insufficient tin at a lot of IT shops to do this. Also, if Oracle produced flawless dianostic tools, the output of which could be understood without any ambiguity and required little intelligence or effort to interpret and make recommendations from:-

a) A lot of tuning specialists would be out of work

b) DBAing would be a really boring job

So is the fact that ADDM and AWR has it’s limitations is not infallible does have it’s upsides.

10. cristiancudizio - February 27, 2008

it’s all correct, but remember one thing: on my 10.2.0.3 i count 878 wait events but they are not enought. I would say that maybe always a problem hidden an in such case what remains is “guesswork”

regards,
Cristian

11. Richard Foote - February 27, 2008

Hi Joel

Bugger, I knew I was in trouble when the little battery fell out ;)

12. Richard Foote - February 27, 2008

Hi Marco

Absolutely, we humans make errors all the time.

However, let’s give ourselves a bit of a chance and begin to use some of the tools appropriately.

13. Richard Foote - February 27, 2008

Hi Tony

Exactly !!

A little less guessing and a little more accuracy.

14. Richard Foote - February 27, 2008

Hi Gary

Absolutely and I made mention of some of these drawbacks. No, extended tracing isn’t perfect, far from it, but in many many scenarios, simply tracing a session would yield better, more accurate diagnostic information than the more commonly used diagnostic tools than many many people would ordinarily attempt to use.

That’s my basic point.

15. Richard Foote - February 27, 2008

Hi Noons

I was going to mention a screwdriver, but I thought I had the tool metaphor pretty well covered ;)

Yes, the good old explain plan is often all that’s needed. However, sometimes you need to trace the session to cature the SQL statements that are ultimately causing the issue.

Again, the problem with something like statspack for example, is that how do you determine what the problematic statements are if they use relatively few resources in the database wide scheme of things.

How do you guarantee that the user complaining on the phone is impacted by any of the database wide metrics laid out before you.

However, your point is a good one. Sit back, deep breath, gulp of coffee, what the heck should I do … ;)

16. Richard Foote - February 27, 2008

Thanks Chris.

I agree, if it all got too easy and simple, it would all be just too boring and unchallenging !!

It’ll be intertesting to know what percentage of information in (say) a 10g statspack report an average DBA would actually understand …

17. Richard Foote - February 27, 2008

Hi Critsian

I agree. However, I would suggest that the average “guess hit ratio” would be significantly higher with an extended trace than with say an ADDM report.

Not 100% but much much better on average than without it.

18. Doug Burns - February 28, 2008

There’s so much to agree with here so, as usual, I’ll try to stick to a couple of things I differ on.

Tony said …

ADDM or AWR misses data ( for some reason some of the important waits are considered Idle events).

I think it’s fair to say that Extended Tracing misses data too. Two small examples :-

a) Uninstrumented code (and there is some)
b) Unaccounted-for time as processes spend time in the run queue. I could easily show you a trace file where the majority of the time is unaccounted-for.

Now that’s not to say I don’t use tracing frequently, I have praised it to the heavens at times and “Optimizing Oracle Performance” discusses these issues, but I worry when anyone makes a method sound *almost* infallible. I think the most important thing that extended tracing gives us above Statspack is the session focus over a system-wide focus but ASH gives us that, too, and I don’t need to be able to decide the session in advance. It’s just a different angle on the same approach and I’d be wary of discounting it too quickly. It has strengths that Extended Tracing doesn’t, quite a few of them mentioned in other comments here.

Oh, and a trace file can be a much better source of information about what the problem is than why it’s occurring, but that’s true of all these tools. However, to give one example, can you be sure that although you can see why the key session is waiting that the session which is the source of the bottleneck isn’t one that you’re not tracing? Or are you going to trace every session?

What about the diagnosing the problem after it’s gone? Are you going to trace every session until the problem re-occurs? Statspack and AWR have their place and I get very worried when people discount them.

Seriously, I’ve probably used tracing more than any other tool, but I’d hate to be unquestioning about it.

Richard said …

It’ll be intertesting to know what percentage of information in (say) a 10g statspack report an average DBA would actually understand …

Ha-ha! Great question, great posting and comment thread. I’ve posted a link to it in the comments on my blog.

Like others here, I suppose I think all tools have their place and it’s about picking the most appropriate tool for the situation and then applying some skill and common sense. Sadly – no short-cuts.

Cheers. Doug

19. Doug Burns - February 28, 2008

I knew I’d forgotten something ….

Richard said …

if you take 2 people of the same skill in separate rooms and arm them with the same Statspack report, they would both come up with different suggestions on what needs to be done.

Don’t you think that’s true of extended trace files too? There would possibly be less variation, but I’m convinced it would still exist.

20. Richard Foote - February 28, 2008

Hi Doug

Thanks for all your considered comments.

If I was to summarise my thoughts on this I would say that in approximately 90% of situations, an appropriately collected trace would be both feasible and would provide the information to both determine and resolve a performance issue at the “first” attempt.

I would suggest the same would be true in about approximately 10% of situations relying on statspack or AWR.

However, I would say 90% attempts at diagnosing a problem would typically rely on the use of statspack / AWR etc.

Note in many many cases, using say statspack to diagnose a problem after it’s gone is going to be just as useless as not having a trace available because the information required is simply “hidden” in the report. If 99% of processes are fine, determining what’s causing the 1% of problems is not going to be possible if you’re looking at 100% of database statistics over that period.

Sure the percentages I quote may not be spot on, but I just don’t believe extended tracing is utilised in proportion to how useful it is as the primary basis of diagnosing performance issues.

I would suggest again, if you had 2 people of the same skill with the same statspack report, in 90%+ of cases they would come up with differing opinions.

Take 2 people with the same skills with the same extended trace file, in 90%+ of cases they would come up the the same opinions.

Take 2 people, 1 with statspack, 1 with the extended trace file and in 90%+ of cases, the person with the extended trace file would come up with the correct diagnosis faster and more effectively than the other.

21. Daniel Fink - February 28, 2008

Great posting and comment thread…

There is no doubt that the “perfect” diagnostic tool does not exist….and likely never will. Every single tool, method has flaws and it is absolutely critical that we know the flaws. Knowing that Statspack can see some conditions, but not others and Extended SQL Trace can see other conditions, but not a 3rd set enables us to use the tools (and our knowledge of the system/application) together to narrow our focus.

I have been very critical of Oracle tools/methods of late…mainly to generate talk about our current state of diagnosis and optimization. I hope that this leads to better tools, more complete/repeatable methods and overall better systems/applications/databases.

Criticism aside…at least we have some data, some tools, some knowledge at the Oracle level. How many other systems/applications do we all deal with that have almost no useful instrumentation?

The current data/tools/instrumentation are the worst thing around…except for not having any!

22. Richard Foote - February 29, 2008

Hi Daniel

I totally agree with you. Hopefully discussions such as these will help some people to consider these flaws more carefully when diagnosing an issue.

To use another favourite analogy of mine, the life of a DBA is not unlike that of a pilot. Both are generally highly trained professionals. Both spend a lot of time letting the plane / database look after itself, as it does what it does on “auto pilot” without any problems day after day after day.

Then occasionally, something goes wrong, sometimes things go really wrong with “alarms” flashing and with the plane / database in serious trouble.

It’s at those crtitical times when an airline / organisation, when the passengers / end business users really appreciate (or curse) the true quality of their pilot / DBA.

Of course, the DBA has it much harder than the pilot as the DBA is expected to not only fly the plane, but tune the engines and ensure the plane is in satisfactory working condition as well.

And we generally don’t get the fancy uniforms either …

23. Doug Burns - February 29, 2008

we generally don’t get the fancy uniforms either …

The campaign starts here!

24. Richard Foote - February 29, 2008

Ha ha !!

I wanted to be a pilot as a kid but my fear of heights and flying in general made such ambitions difficult to achieve :)

25. Doug Burns - March 1, 2008

I grew up on RAF camps so had my own fighter pilot fantasies. Sadly, leaving school at 16 is not the recommended route.

Still, imagine if I’d been a jet pilot and had missed out on being a DBA? What a tragedy that would have been. That’s why I’m also glad I wasn’t lured into being a famous rock star.

26. Richard Foote - March 4, 2008

Hi Doug

This is sad but true. My full name is Richard Anthony Foote. With those initials, guess where my father worked when I was born ;)

27. Robert - March 4, 2008

How true! However, unfortunately I seem to always end up working in environments which have a single property that makes tracing very hard: connection pooling. :-)

As Millsap & Holt say, the crucial bit is to get the scoping for trace data collection right – which is extremely hard with connection pooling. Did anybody find a better solution than to trace all active sessions for a while? Even then you might risk that the statement you want to look at goes to a newly created connection. So you would have to create a logon trigger plus enable tracing of all open sessions of a user, revert this after the test and sift through tons of trace data. Is there a better way?

28. Graham - March 5, 2008

Well I rarely respond to blogs, and I can’t even remember how I got to this one, but the subject matter seemed worthy of comment.
To start with Richard’s story. So mum was locked out of the car. Well assuming that we were monitoring she was doing what would we see? We would see her waiting for a lock (or actually an unlock). With extended SQL trace, we would see that wait. We would not however see why she was locked out. Richard stated that the key was bent, which was why she was delayed. However from the trace data, that would be a guess, we now that she waited for an unlock but not why. Other, equally valid guesses would be that she dropped the keys in the store and spent the time searching for them, her bag containing the keys may have been stolen and she was lucky enough that the thief was caught by the police and she got the bag back. We wouldn’t be able to tell the difference between these different reasons for being locked, or what all of the other parties did in the intervening time.
As Doug has mentioned above the ASH data that we have in 10g is very powerful. As it is on all of the time, for all of the sessions you have the data to diagnose performance problems the very first time they happen, even if you don’t happen to be watching the system at the time. Locks are actually a very powerful use case for ASH. When a session is blocked we record the blocking session , and of course we have a record of what that other session, or sessions are doing as well.
When I’m waiting for a lock, is it help by a batch job that has been busily processing while I have been waiting, or is it help by a user session that has been inactive for the last hour (lunchtime perhaps). These are two very different cases and the action of the DBA is probably different. With SQL trace, even if we happen to be tracing the blocked session at the correct time we can’t tell who the blocker is, or what they are doing.
It is true that ASH does not record every operation. But so far I have yet to see an issue that it couldn’t get me to at least the same place that SQL trace would have reached when analyzed by a competent practitioner.
And as for the analysis of AWR data, well ADDM does it in a completely repeatable manner and reaches the same conclusions every time. It isn’t just doing the analysis of the instance level data, it is also looking at the ASH data to give detailed analysis and often recommendations for solution based on the actual issue being seen. Not just a simple ‘add more freelists for buffer busy waits’ but recommendations based on what SQL statements (insert,update) and whether the contention is on the same or different blocks. The recommendations may be simple, but the diagnostic method is often far from trivial.
So if Statspack is a saw and Extended SQL trace is a hammer, the diagnostic data that is in 10g is a Swiss Army knife, and we are still learning to use the other 40 of its 42 uses.

29. Doug Burns - March 5, 2008

To Robert …

Is there a better way?

Two alternative answers

a) No, I’m not sure there is a better way. If there is, I’d like to know about it too because dealing with connection pooling is a constant pain these days; or

b) ASH. I can confirm that it’s ideal for this type of tricky situation but, as always, it won’t record every single event as extended tracing would.

To Graham …

I hope you don’t think that leaving your surname off your comment is going to obscure your intimate relationship to the design of ASH and ADDM? LOL

At least no-one can accuse you of being an Oracle marketeer, determined to shove 10g ‘Automatic Tuning’ features down our throats ;-)

30. Richard Foote - March 5, 2008

Robert, thanks, good points which I kinda referred to in my post. The problem of scope and making it repeatable are the two main issues. The 90-10 mileage may vary in different environments.

Graham, thanks very much for your input, please post more often :)

I agree with you that in my example, ASH may have been useful but I was determined to put my pun in the example !! The key point I was trying to make though is that the “Swiss Army Knife” that is the diagnostic data in Oracle has “42” differing little devices (btw, I never did work out what the pointed one with the little hole was for) that has to be sorted through, redundant information ignored where possible, missing information determined where possible and the “specific” issue, specifically diagnosed. These fancy knives can also be quite expensive too although that’s a separate argument entirely.

Yes tracing may only show up as just an enqueue wait in my example, but at least you know it’s definitely a locking issue and although as I mentioned it might be a case of “telling you what’s at issue without the why”, one’s knowledge of the application, the environment, etc. can fill in missing details in some cases, if necessary.

30 comments and still going strong. At least its got people thinking about it all, which is always a good thing.

31. Robert - March 5, 2008

Richard,

btw, I never did work out what the pointed one with the little hole was for

Do you mean the awl? http://en.wikipedia.org/wiki/Stitching_awl

Since it’s an army knife you can repair your boots with it – or so… :-)

Note: it’s a myth that an awl will also work for boot records.

32. Graham - March 6, 2008

Richard,
The data is the Swiss Army Knife, but the only tools that you need are ADDM at the instance level, and the the ASH report at the session level cf a Statspack deterministic analysis and tkprof/Hotsos profiler output.
Yes, there is a cost, but they are good value ;-).
And back to your analogy of the bent key, one common way of avoiding having to watch mum yourself all the time is to use a security camera, and guess what, it is sampling data. Movies at 30 frames a seconds are sampling data. Does it matter? Not to my eyes. If something happens once and last less than a second we may not sample it, but then again I have never been asked to tune something that happened one and took less than a second (or if I was I politely declined) If it is short duration and happens a lot we get plenty of samples. If it is long duration we get plenty of samples. I fact if it takes a sizable amount of DB time in executing we may be interested in tuning it, and we get plenty of samples. If it doesn’t take much DB time, we don’t have collection and storage overhead. How convenient. But DB time based tuning is another topic.

33. Richard Foote - March 6, 2008

Hi Robert, I just knew someone would tell me if I asked ;)

Hi Graham, thanks again for your perspective.

34. Paul - March 9, 2008

Along with the other tools mentioned here, I am a fan of the 10g OEM Top Activity screen, which I think is renamed “Average Active Sessions” in 11g. It’s a very useful feature, lots of good drilldown, great for drilling down on CPU users/abusers, good for viewing changes over time. etc. But for really high-execute count sessions, in particular sessions that execute many different sql statements many thousands of times in a short while ( complex pl/sql block usually ), Top Activity almost always shows the wrong sql statement as being the statement that is being executed. I say “wrong” because if I trace one of these sessions for 10 or 15 minutes then sort the tkprof by fchela exeela prsela, the statement that is visible in the Top Activity screen is almost never in the top 3 or 4 sal statements. I say “almost always” because I suppose it’s possible there might be cases where the correct statement is shown for one of the really busy sessions I observe from time to time. “select user from dual” is a query that I see with some frequency from certain kinds of sessions in Top Activity, however that statement never appears anywhere near the top of the sorted tkprof report.

A couple of thoughts – 1) I wonder if Oracle really bothers updating gv$session for every single statement from such a session – maybe it’s not that expensive to update that one row in gv$session and 2) if OEM Top Activity comes from v$active_session_history, why doesn’t it show a more realistic mix of sql for high-execute-count sessions? I’m no longer disturbed by the inaccuracy, I now know that I have to look at execute count for a problem session, if executions are happening at a high rate, then I might as well just go ahead and trace the session because I won’t be able to find the problem queries using OEM at the session level.

On the other hand from OEM I’ll get a nice graphic session timeline that I assume is useful in that I’m not aware of any distortions re session events and CPU, and drilldown on session statistics provides some feedback that is not present in the trace file. It’s particularly nice to see the stacked graphs of cpu use showing the top CPU consuming sessions, which might if you’re lucky show up in AWR reports and definitely will not show up in 10046 traces.

35. Alex Gorbachev - March 24, 2008

Richard,

Excellent points in the original post as well as comments and follow-ups. Thanks for accommodating it on your blog. :)

The life is full of tradeoffs. Extended tracing has predefined granularity based on what code is instrumented. It might be too much of overhead to keep it on all the time so ASH seems to be a great compromise.

But what about moving further down in terms of granularity of of collected data? Extended tracing doesn’t get to the details of LIO. There is an event to trace consistent reads but it’s not completely LIO and it’s missing the most important component – LIO elapsed time. Why extended trace doesn’t provide LIO tracing as well as other CPU intensive code paths – it’s probably too expensive for now. Again – tradeoff.

Going further, we have identified that db file sequential read is slow (well, we kind of know that 30ms is too long). What can we do? We can try to analyze database-wide IO load to see if we are killing the disks. We can ask storage admin about physical layout and what else can impact us but this is already an area of guesswork. Why would we go there? There is simply no way to trace this 8K block request from Oracle instance through OS layers, FC controllers, FC switch, front-end and back-end processors on SAN and then to physical spindle. Wouldn’t it be nice? I bet it would be unbelievably cool if we could have done it and with low overhead.

Unfortunately, it’s easy to say – no, it’s too expensive to trace physical IO this way until one bright mind come up with an idea how to find a compromise and collect this data with certain precision.

Oh! I’m dreaming of being able to turn IO tracing on just like 10046 and then maybe even have IOH (IO history) like ASH in Oracle.

36. Richard Foote - March 24, 2008

Hi Alex

Wouldn’t it be nice indeed !!

37. The Business of Business is Business « Julian Dontcheff's Database Blog - July 12, 2011

[…] etc. but one small bug can cause a so severe performance problem that all your efforts and diagnostic tools count for […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,913 other followers

%d bloggers like this: