So, you want to give Phat4Oracle a try ? How to use it ? What’s the best start ? I’m always following the same pattern in the search for the root cause of a performance problem.
Overview & drilldown
I’m looking at overviews and drilldowns in alternating sequence. Overview is always a chart, drilldown is a result set of a query/view most of the time. Let’s explain a little more:
First you start with an overview of your data. I always start with a chart of PHAT$CPUWAIT: column snap_time on the x-axis, and cpu_s and wait_s on the y-axis. This will not only give a lot of information, but you starting to get a ‘feeling’ of when and what is going on in your database:
* Can recognize a day/night pattern ?
* Can you spot the weekends ?
* Is there a weekly pattern ?
* How much CPU is spend in one hour on average ?
* Are those peaks equal to the maximum number of CPU seconds available ? = number of CPUs x 3600 s
* How much time is waited in one hour ?
* Where are the peaks in database usage ?
* And most important: are there exceptions to be seen in a repetitive pattern ? This might be the performance problem you are looking for. With a little luck you will be able to analyze & solve performace problems when getting no more information than “sometimes the performance sucks”.
So what you need is finding snapshots which show anomalities. What if you can’t find any ? You can’t always be lucky, you have to ask the users who are complaining about performance, when were there performance problems ? Find the corresponding snapshot id’s in PHAT$SNAPSHOT, and you can continue to the next step.
Ok, you know the ID’s of one or a few suspicious snapshots. Now it’s time to do a drilldown. My second drilldown will always be on PHAT$CPUWAIT_DETAIL. Interpreting this is nothing more than the good old YAPP. The views shows the amount of CPU and WAITS per event, you know what to do if you have studied YAPP. The first few items on top are the things to investigate, and don’t forget: one event might be the result of another one. It’s quite difficult to find the real root cause sometimes.
3: Overview again, verify
So, you have found a wait event, or maybe a lot of CPU usage at the top of your drilldown view. Now it’s time to verify if you have found something exceptional. So the next step will be one of the next two views most of the time: a CPU breakdown using PHAT$CPU, or PHAT$SYSTEM_EVENT. Make a graph of this. Does this show peaks at the moments of when performance problems were encountered ? Yes ? You are one the right track, continue to step 4. If not ? Mabe return to step 2 or 1:
* Find another snapshot to do a drilldown.
* Find another event to drilldown on.
* Nothing else to find ? Continue to step 4 anyway.
4: Drilldown again
What to drilldown to in this step depends a lot on what you have found in steps 2 and 3. Candidates can be:
* PHAT$SYSTEM_EVENT, if your instinct tells you so.
* PHAT$WAITSTAT, if you have found ‘buffer busy waits’ in step 3.
* PHAT$SEG_STAT, to find the segment having lots of i/o.
* PHAT$FILESTATXS, to find out which datafile gets to most i/o.
* PHAT$TEMPSTATXS, if you suspect a lot of TEMP usage.
* PHAT$REDOSTAT, to check out the performance on redolog writes.
* PHAT$LATCH, for investigation of latching.
* PHAT$SQL_SUMMARY_EX, search for the rotten SQL.
The last one is my #4 step most of the time. It is very important to choose a descending sort order depending on what is found in step 2. Lot of CPU usage ? Sort on CPU_TIME_MICRO. Lot of scatterd read ? Maybe sort on DISK_READS.
5: Verify again
This is a second verification. Is the suspicious segment/query/datafile/latch you have found in step 4 the correct one ? Make a chart of it. See how much time this item ‘costs’ compared to the overview from step 1.
Most of the time you keep on drilling down and verifying until come to the root cause of your performance problem:
* One or more queries having a design flaw.
* Queries running unnecessary and too often
* Queries which are suddenly using another execution plan (investigate using STATS$SQL_PLAN_USAGE)
* Not using bind variables.* Init parameter which needs a little adjustment.
* Problems in I/O, disk system is not performing as it should.
Most of the time my last step ends with PHAT$SQL_SUMMARY_EX, using both for drilldown (WHERE SNAP_ID=) and overview (WHERE HASH_VALUE=), and STATS$SQLTEXT to retrieve the SQL. After that some good old fashioned SQL tuning and analysis starts, mostly ending with a walk to the developer, grabbing a piece of wood underway.
So, it’s nothing more than alternating overview & drilldown. Use a graph for overview, run a query on a PHAT view for drilldown. Keep on repeating this until you hit the root cause. No success ? Take a step back. Keep your Oracle manual at hand for reference of wait events, system statistics, explanation of columns of V$ views which correspondent with the STATS$ and PHAT$ columns, etc. Don’t forget YAPP, have a non technical method at hand to avoid tuning until eternity.
I hope to make some examples of this all: ‘The Phat cases !’.