Bentley's Rules
Jon Bentley, a
distinguished computer scientist known for many publications,
published Writing Efficient Programs. In this out-of-print
classic, Bentley provides a unified, pragmatic treatment of program
efficiency, independent of language and host platform.
For ease of presentation, Bentley codified his methods as a set of
terse rules. The following is the list of Bentley's rules, edited from
Appendix C of Writing Efficient Programs and adapted to suit the
special needs and opportunities of the Origin 2000 architecture, SGI
version 7.x compilers, and the MIPS instruction set.
The time required for common operations on data can often be reduced
by augmenting the structure with additional information, or by changing
the information within the structure so it can be accessed more easily.
Examples:
-
Reference counters facilitate garbage collection.
-
A "hint" containing a likely, but possibly incorrect, answer.
For the Origin 2000, with a 100:1 ratio between instruction time and
memory-fetch time, it is more important to minimize secondary cache
misses than to minimize instruction execution. Special applications of
this rule include:
-
Reorganize data structures to make sure that the most-used fields fit
in a single, 128-byte, cache line, and less-used fields fall into
separate cache lines. (This of course presumes that each structure is
128-byte aligned in memory.)
In one example cited in
a paper
given at Supercomputing '96, most of the cache misses of one
program were caused by "a search loop that accesses a four-byte
integer field in each element of an array of structures...By storing
the integer items in a separate, shadow array, the secondary cache miss
rate of the entire application is reduced."
-
Reorganize structures so that a structure (or cache line) that can be
updated is used by at most one CPU at a time. The alternative, when
multiple CPUs share a cache line that can be updated, requires you to
use mutual exclusion, but in addition, when one CPU updates a cache
line, all the others must reload the data from memory.
The cost of recomputing an expensive function can be reduced by
computing the function only once and storing the results. Subsequent
requests for the function are handled by table lookup.
Precomputation is almost certainly a win when the replaced computation
involves disk access, or an IRIX system call, or even a substantial
libc function. Before precomputing an ordinary calculation, consider
that if retrieval of the precomputed result can lead to an additional
cache miss, the replaced computation must cost at least 100
instructions to break even.
Data that is accessed most often should be the cheapest to access.
However, caching can backfire when locality is not in fact present.
Examples:
-
The move-to-front heuristic for sequential list management is the
classic example.
-
In implementing a dictionary, keep the most common words in memory.
-
Database systems cache the most-recently used tuples as a matter of
course.
The MIPS R10000, Origin 2000 node, and IRIX kernel already cache at a
number of levels:
-
Primary (on-chip) caches for instructions and data.
-
Secondary (on-board) cache for recently-used memory lines of 128 bytes.
-
Virtual memory system keeps most-recently used pages of 16KB and up,
and also migrates pages to the nodes where they are most often
referenced.
-
Filesystems (XFS and NFS independently) cache megabytes of
recently-used records.
However, all these standard caches are shared by multiple processes
(from the standpoint of your program, the other processes are
"diluting" the cache with their data, driving out your data).
Also, system cache management knows nothing about your algorithms.
Application-dependent caching can yield huge payoffs, especially when
it avoids disk access.
Never evaluate an item until it is needed. Examples:
-
In building a table of Fibonacci numbers (or any other function), only
populate the table with numbers actually requested, as they are
requested.
-
Brian Kernighan reduced the run time of [the original troff] by twenty
percent by calculating the width of the current line when needed,
rather than after each character.
Dense storage representations can decrease storage costs by
increasing the time to store and retrieve data. Typical examples
include packing multiple binary integers into 32- or 64-bit words.
The memory complement of a typical Origin 2000 server is large enough
that you might think such bit-squeezing techniques should be relics of
the past. However, is file access time a significant factor in your
program's speed? (Use prof(1) to find out.) If the file format
can be squeezed 50%, sequential access time is cut by that much, and
direct access time is reduced due to shorter seek distances. Also, if a
key dataset does not fit into cache, and by packing a data structure
you can get all of it into cache, the extra instruction cycles to
unpack each item are repaid many times over.
The space required to represent a program can often be decreased by
the use of interpreters in which common sequences of operations are
represented compactly. A typical example is the use of a
finite-state machine to encode a complex protocol or lexical format
into a small table.
Finite-state machines, decision tables, and token interpreters can all
yield elegant, maintainable algorithm designs. However, keep in mind
that the coding techniques you use to implement an interpreter, in
particular computed gotos and vector tables, tend to defeat the
compiler's code manipulations. Don't expect the compiler to optimize or
parallelize the interpreter's central loop.
Many of these loop-transformation rules have been incorporated into the
SGI 7.x compilers. This makes it easier to apply them in controlled
ways using compiler options.
An invariant expression (one whose value does not depend on the loop
variable) should be calculated once, outside the loop, rather than
iteratively. But keep in mind:
-
Generally, the compiler is good at recognizing invariant expressions
that are evident in the code. Place expressions where it is most
natural to read or write them, and let the compiler move them for you.
Example:
for (i=0;i<Z;++i) { if (x[i]<(fact/div))...; }
At nonzero optimization levels the compiler will recognize fact/div
as invariant, and move it to a generated temporary in the loop
prologue. You should leave the code in this naive but readable form,
and not clutter it with yet another scratch variable.
-
Generally the compiler cannot recognize when a call to a
user function is invariant.
for (i=0;i<Z;++i) { if (x[i]<func(fact,div))...; }
When func(fact,div) does not depend on i, you
should move the call out of the loop:
for (i=0,t=func(fact,div);i<Z;++i) { if (x[i]<t)...; }
An exception is when you request function inlining. After the compiler
inlines the function body, it may then recognize that some or all of
the inlined code is invariant, and move those parts out of the loop.
An efficient inner loop should contain as few tests as possible, and
preferably only one. Try to combine exit conditions. Examples:
-
Add a sentinel value to the last item of an unsorted vector to avoid
the test for last index.
-
As a special case, use a sentinel byte value to signal the end of a
string of bytes, avoiding the test for last-byte-index.
This optimization can be especially helpful with the R10000. The R10000
can pipeline a single conditional branch, executing speculatively along
the most likely arm. However, two conditional branches in quick
succession can interfere with the pipeline and cause the CPU to skip
cycles until the branch condition values are available.
A large cost of some short loops is in modifying loop indexes. That
cost can often be reduced by unrolling the loop.
This optimization can bring large benefits, especially in a superscalar
CPU like the R10000. However, loop unrolling is a touchy, error-prone
operation to carry out by hand, and the resulting code is hard to read
and to maintain. Fortunately, the 7.x compilers have extensive support
for loop unrolling. You can control the amount of unrolling either with
a compiler option or loop-by-loop with directives. The compiler takes
care of the niggling details and the bookkeeping involved in handling
end-of-loop, while the source code remains readable.
A benefit of compiler-controlled loop unrollling is that the compiler
applies the optimizations of common subexpression elimination, constant
propogation, code motion out of loops, and function call inlining both
before and after unrolling the loop. The optimizations work together
for a synergistic effect. The software-pipelining optimization
(rearranging the order of the generated machine instructions in order
to maximize use of the R10000 execution units) also becomes more
effective when it has a longer loop body to work on.
When a large cost of an inner loop is devoted to trivial assignments
(other than maintenance of loop indices), the assignments can be
reduced by duplicating the loop body and renaming the variables in the
second copy.
As with simple loop unrolling, this important optimization is
well-handled by the compiler. Leave your code simple and readable and
let the compiler do this.
A fast loop should have no unconditional branches. An unconditional
branch at the end of a loop can be handled by "rotating" the
loop to put a conditional branch at the bottom.
This optimization might be needed in assembly-language code, but no
modern compiler should generate code for a loop with the test at the
top and an unconditional branch at the bottom.
If two nearby loops operate on the same set of elements, combine
their operational parts and use only one set of loop-control operations.
Example: convert
for (i=0;i<n;i++)
for (j=0;j<n;j++)
b[i][j] = a[i][j];
for (i=0;i<n;i++)
for (j=0;j<n;j++)
c[i][j] = b[i][j];
into the single loop
for (i=0;i<n;i++)
for (j=0;j<n;j++)
b[i][j] = a[i][j];
c[i][j] = b[i][j];
This technique can produce important benefits in the Origin 2000
because, first, the contents of array b are only passed through the
cache once, and second, the instructions to load each element of b
(from the second loop) are eliminated. However, it is not necessary for
you to perform this optimization manually. The 7.x compilers perform
loop fusion automatically at certain optimization levels. The compilers
can also perform the following optimizations:
-
Loop interchange -- exchanging the induction variables of inner and
outer nested loops so as to pass over memory consecutively.
-
Cache blocking -- breaking a single loop nest into a sequence of
smaller nests that operate on blocks of data that each fit in secondary
cache.
-
Loop fission -- when two inner loops are controlled by a single outer
loop, duplicating the outer loop and creating two simple loop nests
that can then be interchanged, blocked, or unrolled.
Each of these changes, if done manually, complicates the source code
with many temporary variables and insanely complicated loop control. In
general, write loop code in the simplest, most portable and
maintainable fashion, and then let the compiler tangle it behind the
scenes.
In a conditional expression, replace a costly expression with an
algebraically equivalent expression that is cheaper to evaluate.
Examples:
-
Not (sqr(X)>0) but (X!=0). If this also
allows you to avoid calculating sqr(X) unless it is
needed, it is an instance of Lazy Evaluation.
-
Not (not(A) .and. not(B)) but not(A .or. B)
(DeMorgan's Law).
Any such replacements that require insight are worth doing. Do not
spend time replacing simple manifest expressions with constants; the
compiler is good at that, and good at recognizing common subexpressions.
When a monotone function of several variables is to be tested for a
threshhold value, break off the test as soon as the result is known.
Short-circuit evaluation of a Boolean expression is a well-known case
(handled automatically by the compiler). Bentley gives a more
interesting example. Given a function to find the point in a list that
is nearest to a given point:
typedef struct point_s { double X, Y; } point_t;
#include <math.h>
int nearest_XY(int n, point_t given, point_t points[])
{
int j, close_n=-1;
double thisDist, closeDist=MAXFLOAT;
for (j=0; j<n; ++j)
{
thisDist = sqr(points[j].X-given.X) + sqr(points[j].Y-given.Y);
if (thisDist < closeDist)
{
closeDist = thisDist;
close_n = j;
}
}
return close_n;
}
This function as it stands short-circuits the distance computation (for
purposes of comparing two distances, it is sufficient to compare the
sum of squares and skip doing the square root). However, in many cases
the X-distance alone is enough to eliminate a point, which avoids the
need for the second multiply and an addition.
int nearest_XZ(int n, point_t given, point_t points[])
{
int j, close_n=-1;
double thisDist, closeDist=MAXFLOAT;
for (j=0; j<n; ++j)
{
thisDist = sqr(points[j].X-given.X);
if (thisDist < closeDist) /* possible */
{
thisDist += sqr(points[j].Y-given.Y);
if (thisDist < closeDist)
{
closeDist = thisDist;
close_n = j;
}
}
}
return close_n;
}
Logical tests should be arranged such that cheap tests precede
expensive ones, and so that ones most likely to succeed precede ones
more likely to fail.
In the absence of other information, the SGI compilers assume that the then
part of an if is more likely than the else, so it is
good practice to put the most likely code in the then part.
It is possible to run the program under speedshop (experiment
type ideal), producing a file of exact execution counts. This
file can be filtered through prof (format -feedback) and
the resulting file can be fed back into the next compilation. The
compiler will take branch execution counts from the feedback file, and
create an instruction schedule that reflects the exact probabilities of
each branch.
This is a convenient feature that can have good results on a
poorly-written program. However, the compiler does not take into
account the cost of a test that involves a function call, or a test
that can touch an unused cache line or page. Your insight is still
needed to order tests so as to avoid expensive operations when possible.
A logical function over a small, finite domain can be replaced by a
table lookup. Examples include using a table to classify characters
in lexical classes, and implementing an interpreter using a branch
table indexed by instruction code.
When the table is small and frequently-used, this technique is almost
always helpful. However, when the table is sparse, or much larger than
the executable code that it replaces, the effect of the table on the
cache could swamp the benefits.
Assignment to a Boolean variable and its later test can be replaced
by an if statement on the Boolean value. Generalizing,
assignment to any control variable over a small, finite domain can be
replaced by a case statement on the value that would have been
assigned.
Application of this rule saves three things: declaration of a variable
to hold a control value; assignment into the variable; and fetching
from the variable when the value is tested. The rule says that instead,
you should modify the code so the control value is used to direct
execution as soon as the value is calculated. In some cases you have to
duplicate code in order to apply the rule.
The 7.x compilers are good at reusing the stack space of local
variables after their last use, and also good at saving calculated
values in machine registers between their assignment and a nearby use.
When the sole use of a local variable is to hold a value between its
expression and a following test, the compiler (at -O2 and higher) is
quite capable of eliminating the variable and using only a register
temp.
Hence the only time you should distort your program's logic to apply
this rule is when the assignment is separated from its use by many
statements, or by a non-inlined procedure call, (A procedure call
usually makes the compiler spill register values to memory.)
The run time of a set of procedures that (nonrecursively) call
themselves can often be reduced by rewriting the procedures inline and
binding what were passed arguments.
This technique is extremely useful, but difficult to apply to an
existing program and still leave the program readable and maintainable.
Fortunately, the 7.x compilers contain extensive support for
interprocedural analysis (IPA) and automatic inlining. Given that, it
is best to write the program in a plain, naive way, not fretting about
frequent calls to small procedures, and to let the compiler inline the
small procedures.
Although inlining eliminates the instruction overhead of a function
call, it inflates the size of the object code. At a certain level of
inlining, the effect of the inflated, low-density code on cache and
virtual memory wastes as much time as it saves.
Keep in mind the maintenance implications of inlining procedures in
code you will distribute to other users. A library procedure that has
been inlined even once can never be field-replaced. You can send out a
revised version of the library, but programs that inlined the function
will never call the library version.
A procedure should be organized to handle the commonest cases most
efficiently, and all remaining cases correctly (if more slowly). When
designing the application, attempt to arrange the input conditions so
that the efficient cases are the commonest.
The Caching rule can be seen as an extension
of this rule, in which "the commonest case" is "whatever
case was most recently requested." Bentley cites a Julian-date
conversion function in which it was observed that 90% of all calls
presented the same date as the previous call. Saving the last result,
and returning it without recomputing when appropriate, saved time.
One way to apply the rule is to have a function test for the common
cases and return their results quickly, falling through to the more
involved code of the hard cases. However, such a function is likely to
be a bulky one to inline.
Instead, write two functions.
The public function is short, and efficiently handles the most
common cases.
When it is called with a case it cannot handle, it calls the
longer, private function that deals with hard cases.
The point of this design is that the public function is short
enough that you can let the compiler inline it automatically.
The common cases will be handled by the inlined code.
The compiler should not inline the less-used, longer function.
A multiphase algorithm in which the phases are linked by temporary
files (or arrays) can be reduced to a single-pass algorithm using
coroutines.
Coroutines are defined and described in detail in Knuth (Volume I) and
most other modern books on algorithmics. Under IRIX, you can write
coroutines in a natural way using any one of three models:
-
The UNIX model of forked processes that communicate using pipes.
-
The POSIX threads model using POSIX message queues to communicate.
-
The MPI (or PVM) model of cooperating processes exchanging messages.
A simple, readable solution expressed as a recursion can be
transformed into a more efficient iterative solution. You can:
-
Code an explicit recursion using a stack variable.
-
When the final action of a procedure is to call itself ("tail
recursion"), replace the call with a goto to the top of
the procedure. (Then rewrite the iteration as a loop.)
-
Cut off the recursion early for small arguments (for example using a
table lookup), instead of recurring down to the case of size 0.
The 7.x compilers recognize simple cases of tail recursion and compile
a goto to the top of the function, avoiding stack use. More complicated
recursion defeats most of the compiler's ability to modify the source:
recursive procedures are not inlined; a loop containing a recursion is
not unrolled; and a loop with a recursion is not executed in parallel.
Hence there is good reason to remove recursion, after it has served its
purpose as a design tool.
Structure the program to exploit the parallelism available in the
underlying hardware.
This rule is a no-brainer for the Origin 2000; parallel execution is
what we do best! However, you can apply parallelism using a variety of
programming models and at a wide range of scales from small to large.
Two examples among many:
-
Using the POSIX-compliant asynchronous I/O library, you can schedule
one or an entire list of file accesses to be performed in a parallel
thread, with all the file-synchronous waits being taken by a different
process.
-
The native integer in the R10000 has 64 bits (under either -n32 or -64
compiler option). You can pack many smaller units into long long
(or INT*8) variables and operate on them in groups.
Parallel algorithm design is a very active area in computer science
research. Execution of the following query:
+parallel +algorith* +(research | development | library)
on the Alta Vista search
engine produced about 6000 hits, including two libraries of
parallel algorithms at CMU, one for irregular problems
and another
one, and the ZRAM
library at ETH in Switzerland.
As many variables (including arrays and tables) should be
initialized prior to program execution.
This rule encompasses the use of compile-time initialization of
variables -- in C, by use of initializing expressions in declarations,
in Fortran by use of the PARAMETER clause -- and also the preparation
of large arrays and tables.
Compile-time initialization with manifest constants allows the compiler
greater scope. The compiler precalculates subexpressions that have all
constant arguments. When IPA is used, the compiler recognizes when a
procedure argument is always a given constant, and propogates the
constant into the procedure body. When functions are inlined, constants
are propogated into the inlined code, producing more constant
expressions.
Some programs spend significant time initializing large arrays, tables,
and other data structures. There are two common cases: clearing to
zero, and processing initial data from one or more files.
Initializing to Zero
A for-loop that does nothing but store zero into an array is a waste of
CPU cycles. Worse, in NUMA system like the Origin 2000, all memory is
by default allocated to the node from which it is first touched, which
means that the zero array will be allocated in the node where the
for-loop runs. So at the least, make sure that each parallel thread of
the program initializes the array elements that it uses, and no others.
In a C program, it is much more efficient to use calloc(3) to
allocate memory filled with zero. The page frames are defined but the
actual pages are not created and not filled. Zero-filled pages are
created as the program faults them in. A similar alternative is to use mmap(2)
to map an extent of the pseudo-device /dev/zero. The mapped
extent is added to the program's address space, but again, zero pages
are created and allocated as they are faulted.
In a Fortran program, it is somewhat more efficient to call the C
library function bzero() than to code a DO loop storing zero
throughout an array.
Initializing From Files
The time to read and process file data is essentially wasted, since it
does not contribute to the solution. Bentley recommends writing a
separate program that reads the input data and processes it into an
intermediate form that can be read and stored more quickly. For
example, if the initializing data is in ASCII form, a side program can
preprocess it to binary. When the side program uses the same data
structures and array dimensions as the main program, the main program
basically reads the file directly into the array space.
IRIX permits you to extend this idea. You can use mmap(2) to map
any file into memory. You can take all the initialization code and
isolate it in a separate program. This program maps a large block of
memory to a new disk file. It reads the input file data, processes it
in any way required, and builds a complete set of data structures and
arrays of binary data in the mapped memory; then unmaps it. This writes
an image of the processed data into the file.
When the main program starts up, it maps the initialized file into
memory, and its initialization is instantly complete! Pages of memory
are faulted in from the disk file as the program refers to them,
complete with the binary content as left by the initializing program.
Replace the evaluation of a costly expression by one that is
algebraically equivalent but less costly. Example: not ln(A)+ln(B)
but ln(A*B) (give a passing thought to the chance that A*B
might overflow).
Bentley mentions that the frequent need to test whether an integer J is
contained in the inclusive range (I,K), naively coded ((I<=J)&&(J<=K)),
can be implemented as ((J-I)<(K-I)). The value (K-I) can
be precomputed. (When I and K are constants or propogated from constant
expressions, the compiler precomputes it automatically.)
Algebraic identities are at the heart of strength reduction in loops.
The compiler can recognize most opportunities for strength reduction in
an expression that includes the loop induction variable as an operand.
However, strength reduction is a general principle, and the compiler
cannot recognize every possibility. Examine every expression in a loop
to see if the same value could not be calculated more cheaply by an
incremental change in the same value from the previous iteration of the
loop.
When the same expression is evaluated twice with no assignments to
its variable operands, save the first evaluation and reuse it.
This rule, application of which is almost second nature to experienced
programmers, is in fact applied well by the 7.x compilers. The
compilers look for common subexpressions before and after constant
propogation and inlining, and can often find subexpressions that are
not superficially apparent in the source code. The expression value is
held for its second use in a machine register if possible, or the
compiler may create a temporary stack variable to hold it.
When similar expressions are evaluated together, consider making a
new function that combines the evaluation of both.
Bentley cites Knuth as reporting that both sine and cosine of an angle
can be computed as a pair for about 1.5 times the cost of computing one
of them alone. The maximum and minimum values in a vector can be found
in one pass for about 1.5 times the cost of finding either.
When it is not practical for the function to return both values, it can
at least cache the paired answer (space-time rule
3) to be returned on request. In the case of sine/cosine, each of
the functions can test to see if the argument angle is the same as the
previous angle passed to either function. When it is not, call the
function that calculates and caches both sine and cosine. Return the
requested result.
In this regard, note that the 7.x compilers do recognize the dual
nature of integer division and remainder, so that the sequence {div
= a/b;rem = a%b;} compiles to a single integer divide.
Use the full word width of the underlying computer architecture to
evaluate binary values in parallel.
This is a special case exploiting parallelism.
The native integer in all current MIPS processors is 64 bits. The
standard Fortran Boolean functions (IOR, IAND, etc.) handle INTEGER*8
values.
|