O'Reilly Book Excerpts: Optimizing Oracle Performance
Optimizing Oracle Performance: A Case Study
Editor's note: Cary Millsap states in his book that the real goal he had for writing Optimizing Oracle Performance was to show Oracle DBAs and developers how to "optimize the project that makes an Oracle system go faster." In selecting this first case study from Chapter 12 ("Case Studies") to excerpt here, he shows why he's made good on this goal. Here Cary explains how Method R, which he defines as a means for making any system faster, works in real-world ways to improve the overall value of your system to your business.
If you've read the whole book to this point, then, conceptually, you're ready to do performance improvement projects in a whole new way. However, being "conceptually ready" to do something and actually committing yourself to the experience of doing it are two completely different things. Because the new Method R contradicts convention, it can be especially difficult to commit to using it. Virtually every book, paper, newsletter, web page, software tool, consultant, colleague, and friend who has ever occupied your attention before this book has given you advice about "tuning" that radically contradicts what I'm telling you to do here. In an attempt to convince you to try Method R, this chapter contains some examples to show you how it works in practice. My hope is that by showing you how my colleagues and I respond to some patterns that we commonly see in our optimization work, you'll imagine more vividly how Method R can work for you.
Think of learning Method R in the same way that you learned to speak your native language. You began by observing other people performing the act you were trying to learn. Specifically, you did not learn how to speak as a child by studying syntax diagrams and declension tables. If you studied those things at all, you did so after you had gained significant familiarity with the language. Your teachers probably forced you to study them in school. Your educational system's motive for exposing you to the underlying rules of the language was to enable you to subject your use of language to a more formal analysis that would help you communicate more effectively later in life.
I want to help you learn Method R in a similar way. In the following sections, I present a few examples of common Oracle performance improvement projects that you can follow from beginning to end. My plan is that by studying these examples, you'll quickly notice the emergence of key behavior patterns that you'll be able to copy in situations that differ from my examples. If you evolve more deeply into Oracle performance improvement as a specialty, then you'll need the syntax rules and declension tables that I provide in Part II.
The case studies in this chapter derive from a variety of sources, including:
Hotsos Profiler customers who have submitted trace files to hotsos.com for analysis
Hotsos Clinic students who have brought a trace file to class for live, in-class analysis
Client visits performed by Hotsos performance specialists
Questions and answers appearing on public news groups
All the examples are real. I do not reveal the source of each case, nor do I identify the people involved, but every case described in this chapter came to us as a legitimate pain that had, until submission to Method R, evaded solution. As you will see as you gain experience with the method, correct use of Method R leads inevitably to one of two conclusions:
You find the root cause of a performance problem, and you're able to determine how much performance improvement you should expect.
Or, you become able to prove that improving performance for the user action under analysis is not economically justifiable.
Case 1: Misled by System-Wide Data
After several months of trying to arrange a sales call into a local account in Dallas, our hotsos.com sales staff finally got a phone call on a Monday. The people with whom we'd been trying to arrange a visit had finally reached their wits' end with a performance problem, and they were going to give us a shot at fixing it. So on Wednesday, we got our chance.
The scenario was similar to ones we had seen before. The company had been fighting a performance problem with a particular program's response time for several months. All of the in-house experts had given the problem their shot over those months. None of their attempts had resulted in any appreciable performance gains. They had finally reached the level of frustration with the problem that management had decided to invest a large chunk of cash into fixing the problem. So on the weekend prior to our phone call, the company upgraded their system's CPUs. The upgrade process went successfully, and of course everyone was excited--perhaps a bit nervous--to see the improvement on Monday.
To their horror, the performance of the slow program was actually worse after the very expensive upgrade. Not just "seemed worse," was worse; measurably worse. So, on Monday we got a phone call. Our invitation said, "Come in and show us what you can do." Two days later, we got in our car and drove across town. This is what we found:
The company used the Oracle Payroll product. It was configured in a conventional way, with batch jobs running on the database server, and dozens of browser-based users scattered throughout the building on a local area network (LAN).
PYUGENprogram performance had been hurting the business for several months. When we arrived, the
PYUGENprogram--a batch job--was able to process about 27 assignments per minute. Targeted performance was twice this throughput.
The customer used an internationally renowned, full-featured performance monitoring tool that queried data from fixed views like
V$LATCH. This tool showed that the system's bottleneck was waits for the Oracle
latch freewait event. The vast majority of
latch freewaits were waits for cache buffers chains latches.
The customer understood correctly that contention for cache buffers chains latches was a likely indication of inefficient (that is, high-LIO) SQL. However, the customer's application developers had analyzed the
PYUGENprogram and found no way to reduce the SQL's LIO count.
A recent upgrade of all twelve of the system's CPUs from 700 MHz to 1 GHz had made
PYUGENperformance measurably worse. Failure of the CPU upgrade to improve performance was the "final straw" motivating the customer to invite hotsos.com staff to come onsite.
By the time we arrived onsite, the customer had already
completed the user action targeting step by identifying
PYUGEN as the system's more important performance problem. Thus, our first step with the customer was to begin the collection of
properly scoped diagnostic data. At this customer, collecting Oracle extended
SQL trace data was straightforward because user action response time consisted
PYUGEN program execution. We used
our free tool, called Sparky, to manage the extended SQL trace activation and deactivation.
The execution that we traced consumed slightly more than half an hour, producing roughly 70 MB of extended trace data. After the program completed, we executed the Hotsos Profiler upon the data, producing the resource profile shown in Example 12-1.
Example 12-1. Resource profile for an Oracle Payroll program
Response Time Component Duration # Calls Dur/Call ----------------------------- ----------------- -------------- ------------ SQL*Net message from client 984.0s 49.6% 95,161 0.010340s SQL*Net more data from client 418.8s 21.1% 3,345 0.125208s db file sequential read 279.3s 14.1% 45,084 0.006196s CPU service 248.7s 12.5% 222,760 0.001116s unaccounted-for 27.9s 1.4% latch free 23.7s 1.2% 34,695 0.000683s log file sync 1.1s 0.1% 506 0.002154s SQL*Net more data to client 0.8s 0.0% 15,982 0.000052s log file switch completion 0.3s 0.0% 3 0.093333s enqueue 0.3s 0.0% 106 0.002358s buffer busy waits 0.2s 0.0% 67 0.003284s SQL*Net message to client 0.2s 0.0% 95,161 0.000003s db file scattered read 0.0s 0.0% 2 0.005000s SQL*Net break/reset to client 0.0s 0.0% 2 0.000000s ----------------------------- ----------------- -------------- ------------ Total 1,985.3s 100.0%
The data in the resource profile came as a surprise to everyone
who had worked on the project for the past few months. From the resource
profile alone, we could already determine beyond the shadow of a doubt that
latch free were virtually irrelevant in
their influence over total
PYUGEN response time. If
the company had been completely successful in eliminating
latch free waits from their system, it would have made only about a 1% difference in the runtime of this program.
TIP: This kind of thing happens frequently in our field work: you can not detect many types of user action performance problems by examining system-wide data. The data from
V$SYSTEM_EVENT was true; it was just irrelevant to the problem at hand. You cannot extrapolate detail for a specific session from aggregated system-wide data.
V$SYSTEM_EVENT view had indicated clearly that the top wait event was
message from client, but of course every good Oracle performance analyst knows that you have to discard all the SQL*Net events because they are "idle" events.
Roughly 50% of the total
PYUGEN response time was consumed by executions of
system calls to the SQL*Net mechanism. The occurrence of
events at the top of the resource profile motivated a quick re-check of the
collected data to ensure that the prominence of this between-calls event was
not the result of data collection error. It wasn't. The
SQL*Net message from client events and their durations were distributed uniformly throughout the trace file. These wait events were
the results of thousands of database calls. When you add in the effect of the
other SQL*Net event,
SQL*Net more data from client,
we had discovered the cause for over 70% of
PYUGEN's total response time.
Diagnosis and Response
You of course cannot ignore 70% of a program's response time,
even if people do call the motivating events "idle." Idle or not, this time
was part of someone's response time, so we needed to deal with it. If we
hadn't collected our statistics so carefully (with proper time scope and
proper program scope), then we would have seen probably much more
SQL*Net message from client time in our data. If you make that particular collection error, then you must
disregard the so-called idle wait time.
The top line of the resource profile was naturally the symptom
we investigated first. Because this was a prepackaged application, we expected
that the number of database calls would be difficult for us to manipulate, so
we let our attention wander to the duration-per-call column. Here, we found a
number that looked suspiciously LAN-like (on the order of 0.010 seconds, as
described in Chapter 10), not IPC-like (on the order of 0.001 seconds or
less). So we reconfirmed that the
program had indeed run on the database server host (with the
PYUGEN process's corresponding Oracle kernel process) by checking the
V$SESSION data collected automatically
by Sparky upon collection activation (Example 12-2).
Example 12-2. The data that Sparky obtained from
V$SESSION verified that the
PYUGEN process was in fact running on the same host as its Oracle kernel process.
Oracle instance = prod (188.8.131.52.0) host = dalunix150.xyz.com (OSF1 V5.1) program = PYUGEN@dalunix150.xyz.com (TNS V1-V2) (session 611) trace file = /prod/u001/app/oracle/admin/prod/udump/ora_922341.trc line count = 1,760,351 (0 ignored, 0 oracle error) t0 = Wed Sep 12 2001 14:10:27 (388941433) t1 = Wed Sep 12 2001 14:43:32 (389139973) interval duration = 1,985.40s transactions = 672 (672 commits, 0 rollbacks)
Sure enough, the hostname reported to the right of the
@ character in
exactly matched the
Node name reported in the
preamble of the SQL trace file.
definitely run on the same host as the database server. So why would the
PYUGEN program suffer from such large
SQL*Net message from client latencies? We examined the
system's tnsnames.ora file to find out. It turns out that to conserve
system administration effort, the system's managers had decided to use a
single TNS alias system-wide. The batch jobs were using the same TCP/IP
protocol adapter as the system's browser clients were using.
It was easy to devise a strategy that was perfectly acceptable
in terms of system administrative overhead. We could add a second alias to the
existing tnsnames.ora file. The second alias would be identical to the
existing alias except that it would have a different name, and it would use
(PROTOCOL=BEQ) instead of
(PROTOCOL=TCP). The customer would shut down the Oracle
Applications Concurrent Manager and restart it, specifying the new alias that
used the bequeath protocol adapter. The new
tnsnames.ora file could be pushed out to everyone on the system without
side effect. Everyone except for the person who started the Concurrent Manager
would use the same TNS alias as before.
Before implementing this change, the customer ran a simple test.
He executed and timed a
SELECT statement that would
require a few thousand database calls from a SQL*Plus session executed on the
database server itself. He ran it once through a session established with the
old alias that used the TCP/IP protocol adapter. He then ran the statement
again through a session established with the new alias that used the bequeath protocol adapter. The test showed that using the bequeath protocol adapter reduced
SQL*Net message from client latencies to less than 0.001 seconds. We could expect to eliminate at least 40% of the program's total
response time by executing this one change alone, as shown in Figure 12-1.
Figure 12-1. We could expect that reducing the per-call latency of SQL*Net message from client events from 0.010 seconds to 0.001 seconds would eliminate more than 40% of PYUGEN's response time
We actually had reason to expect better than a 40% improvement.
The second most important contributor to
response time was another SQL*Net event called
SQL*Net more data from client. The cause of this event was a sequence of parse calls that passed excessively long SQL text strings through SQL*Net from the
client to the server (instead of using stored procedure calls to accomplish
the same thing). The long SQL text strings wouldn't fit into a single SQL*Net
packet, so the Oracle kernel spent a considerable amount of time awaiting
second and subsequent SQL*Net packets during parse calls. Of course, because
Oracle Payroll was a prepackaged application, our short-term hopes for
reducing the number of executions of this event were dim. However, we had
reason to believe that because some of the
SQL*Net more data
from client latency was network
transport, the protocol adapter change would improve the performance of this
event's executions as well.
The bottom-line results were excellent. Payroll processing
performance improved from executing 27 assignments per minute to 61
assignments per minute. The proposed tnsnames.ora change took 15
minutes to test and about a week to navigate through change control. Our whole
engagement at the client lasted less than four hours. Of this time, two hours
were consumed installing Sparky (which required a Perl upgrade on the database
server host), and a little more than half an hour was consumed by letting the
PYUGEN program run with a level-8 extended SQL
trace turned on. The remaining hour and a half contained the whole meeting,
greeting, analysis, testing, and recap activities.
Oh yes... Why did the Payroll program get slower after the CPU upgrade? Not much Payroll program
time was spent consuming CPU service, so the upgrade had very little direct
positive effect upon
PYUGEN. Most of the program's
time was spent queueing for the network. Other programs ran at the same time
as this Payroll job. The CPU upgrade made them faster,
which intensified their number of network calls (which
remained unchanged after the upgrade) into a smaller time window. The result
was increased competition for the network during the Payroll run. Therefore,
every network I/O call the Payroll program made was a little slower than
before the CPU upgrade. The degradation in network response time overwhelmed
the small direct improvement of the CPU time reduction, resulting in a net
degradation of Payroll performance...not a good thing, because this Payroll
program had a higher business priority than everything else on the system.
This case is a classic illustration of the following important points:
Don't let your
V$data tell you what your system's problem is. Your business should decide that. The real performance problem in your system is whatever it is that is causing response time problems for your business's most important user action.
You can't extrapolate detail from an aggregate. You cannot necessarily determine what's wrong with an individual program by examining only the system-wide statistics for an instance.
Capacity upgrades are a riskier performance improvement activity than many people think. Not only can they waste a lot of money by being ineffective, they can actually degrade performance for the very programs you're trying to improve.
It's nearly impossible to find and repair your performance problems by executing the old trial-and-error approach. There are just too many things that might be your performance problem. Instead of checking everything that might be causing your performance problem, it's easy enough to simply ask your targeted user actions what is causing your performance problem.
Cary Millsap is the founder and president of Method R Corporation (http://method-r.com), a company devoted to genuinely satisfying software performance. He is the former Vice President of Oracle's System Performance Group and the cofounder of Hotsos, a company dedicated to Oracle system performance.
Return to the O'Reilly Network.