Dr.Bob's Delphi, C++Builder and JBuilder Programming Books: Dr.Bob's First Book

Delphi Clinic

C++Builder Gate

Training & Consultancy

Delphi Notes Weblog

Dr.Bob's Webshop


	Dr.Bob's First Book: Borland Pascal Efficiency
See Also:	Dr.Bob's Programming Books for all programming book by Bob Swart

This (online) book is about Turbo and Borland Pascal programs and Efficiency.

Prologue
As some of you might already know, for quite some time now I've been working on a book about "Borland Pascal Performance Optimization". This book was due early 1994, but fate decided otherwise. Well-known publishers all over the world think the Pascal book market is currently not the place where money can be made, especially with a specialised book like mine.
Then Chris Frizelle came by with his idea about The Pascal Magazine, and offered me his help in publishing the book, in return for my help with this magazine. The regular column on efficiency was part of my contribution. The Pascal Magazine was published for eight issues, before it was completely replaced by Chris Frizelle's second magazine called The Delphi Magazine. And here is the electronic on-line version of Borland Pascal Efficiency.

Performance Optimisation
Let's define what we're talking about, first. Borland Pascal Performance Optimisation can generally be divided into four steps:
bottle-necks
algorithms & datastructures
language constructs
assembly

Step one consist of finding the bottle-necks of our application. In order to optimise a program, we must first find the bottle-necks. This seems obvious, but we will see enough examples where the bottle-necks are in unexpected places, and optimisation efforts have been applied to the wrong part of the code. One golden optimisation rule is that we only work at one bottle-neck at a time!
The second step, when we have found a performance bottle-neck, is to examine the used algorithms and datastructures. Again, it might seem obvious, but a well-tuned algorithm or efficient datastructure might make a difference of light and day compared to an inefficient one. Of course, the effect will only be noticeable if we substitute an efficient algorithm for an inefficient one if a bottle-neck was found in this particular part of the program. There is no sense in defining a super-fast hashing algorithm to find the command-line arguments of a program when we only have to access them once.
The third step is typically applied only after the second step has failed to obtain the desired efficiency result. We must make sure, however, that the bottle-neck is still at the same place. If so, we must examine the code in the bottle-neck place in order to find language constructs that can be exchanged for more efficient constructs. Examples are, among others, disk I/O, memory management and object oriented issues like constructors, destructors and virtual functions.
The last step is again only applied after the previous step failed. In step four we examine the bottle-neck code even deeper, and generate a mixed Pascal/Assembly listing in order to take a look at the generated code. During this last step, we try to rewrite the assembly code either as a special BASM routine or an InLine macro.
During all steps, it is important to be able to measure the results. One if we can show that the enhanced version is faster compared to the previous version, we know we have really optimised our program!

Anyway, enough intro for now. Sit back and enjoy, while we're going to delve into the fine art of profiling itself as it originally appeared in the issues of The Pascal Magazine.

Borland Pascal Efficiency (1) - Profiling Techniques
When optimising applications, the most important task is to find the bottle-neck in our code. We must never try to guess where performance bottlenecks are, but should always measure with some kind of tool. We could accidentally guess right, but using the profiling tools and techniques described in this chapter we'll be sure to have found all bottle-necks in our code!

Having the time of my App
Finding bottle-necks in your code can be done using profiling tools like Turbo Profiler. We'll implement two other timing methods in Turbo Pascal: using the System Clock, the technique of Reps Timer (timing the number of executions per clock-tick), and look at the TPTimer unit from TurboPower. Later in this book, we'll also examine the three profiling tools that come with Turbo Analyst.

Turbo Profiler
Several techniques must be applied in order to get the best out of Turbo Profiler. A top-down approach from application, to units, to routines, to lines, and so on, in order to find the biggest bottle-necks will be most beneficial. In order to see the mixed Pascal-Assembly Language source code in the Turbo Profiler Source Window, we must compile the program with {$D+,$L+} set on top of the source files for those units that we want to see debugging information. Also, we should generate a detailed .MAP file when compiling, by setting the Options | Linker | Detailed option on, and finally, we must specify stand-alone debugging on (in order to get the debugging information in the .EXE file). We can also specify these linker options for the TPC and BPC command-line compilers with -$D+ -$L+ -V -GD.

System Clock
Any given routine or piece of code can be timed using a small Turbo Pascal function which yields the current value of the system clock (at BIOS location $40:$6C). This information is useful to determine some general bottle-necks or time critical parts of our application.

 var
   StartTick,StopTick: Word;
 begin
   StartTick := MemW[Seg0040:$6C] { Tick };
   while StartTick = MemW[Seg0040:$6C] do
       { wait for end of TimerTick };
   StartTick := MemW[Seg0040:$6C] { Tick };
   { call function }
   StopTick := MemW[Seg0040:$6C] { Tick };
   writeln('Time: ',(StopTick-StartTick)
                    / 18.2:1:2,' seconds.')
 end.

Now, we can measure how much time general algorithms, routines and operations take. Note that we have only a little bit overhead in calling MemW[Seg0040:$6C] and assigning the result to StartTick and StopTick. We must make sure that the operations between the StartTick and StopTick will take at least several clock ticks to complete, since our measurements will be within a precision of one clock tick, and the resolution is only 18.2 ticks per second.

Reps Timer
Sometimes a given routine or piece of code is just to fast too be timed by the System Clock technique. Can this piece of code still be a bottle-neck? Yes it can, for example when called several thousand or million times. In these cases, even the smallest change counts. To measure these kind of fast routines or statements we cannot use the system clock timer, because its resolution isn't good enough.
The Reps Timer is a small routine which doesn't count the time of a piece of code, but just counts the number of times this piece can be executed in one (of more) clock ticks. While this number doesn't say much about a given piece of code, it can be used to compare several blocks of code against each other:

 var
   StartTick: Word;
   Reps: LongInt;
 begin
   Reps := 0;
   StartTick := MemW[Seg0040:$6C] { Tick };
   while StartTick = MemW[Seg0040:$6C] do
       { wait for end of TimerTick };
   StartTick := MemW[Seg0040:$6C] { Tick };
   repeat
     { call function }
     Inc(Reps);
   until StartTick <> MemW[Seg0040:$6C];
   writeln(Reps);
 end.

We can measure the number of executions of a function in one (or more) clock ticks, with a precision of one call. Consequently, we have to make sure the number of calls (Reps) is large.
We have some overhead from incrementing Reps every time in the loop, and from checking if StartTick is equal to MemW[Seg0040:$6C]. As a consequence, any code that does twice as many reps, is probably more than twice as fast.

TurboPower's TPTimer
As you've read before, sometimes a given routine or piece of code is just to fast to be timed by the System Clock technique. TurboPower has implemented a TPTimer unit, which is able to read events with microsecond accuracy.

 uses
   TPTimer;
 var
   Start,Stop: LongInt;
   Time: Real;
 begin
   Start := ReadTimer;
   { call function }
   Stop := ReadTimer;
   Time := ElapsedTime(Start, Stop);
   writeln('Execution took ',Time:1:2,' ms.);
 end.

The TPTimer unit provides us with a convenient means of timing events with microsecond resolution. It does this by reprogramming the timer chips, but these details are hidden in the implementation section of the unit.
Personally, I made two small modifications in this unit over the years. First, I removed ElapsedTimeString from the unit, because I didn't use it. Second, I put the RestoreTimer code all inside the ExitProc routine, and the InitializeTimer code all inside the initialization code of the unit, eliminating a function call. Thereby leaving the unit with only two interface routines: ReadTimer and ElapsedTime.

TPTimer Limitations
We have to keep in mind some limitations of the TPTimer unit. First, TPTimer cannot be used to time events longer than about 59.9 minutes, since LongInts are used to represent the time. Second, the process of reading the time using ReadTimer takes some time itself. The result of timing very short events will be inaccurate by the overhead introduced by reading the timer. As of version 2.0 the TPTimer unit tries as much as possible to compensate for this overhead, but nevertheless we should expect an error due to overhead of about 1-4 microseconds.
Nevertheless, TPTimer is a powerful profiling and timing tool, indispensable for my daily work.

Stop this Watch
We have examined some Public Domain timing tools for every day use in profiling Turbo Pascal applications. The Clock Tick and Reps Timer can be used for normal applications or functions, while the TPTimer is very useful for very fast functions or very accurate timings.
Later in this book, we'll examine the special profiling tools PET, MON and PEP in detail, that are part of the Turbo Analyst toolbox.

Borland Pascal Efficiency (2) - Performance Optimisation
In the first chapter, we examined profling techniques - a very important step in Borland Pascal Performance Optimization. This time, we'll use these techniques on an example program, and apply all other steps on the way as well.

Most programs consist of two different parts: a user interface and an engine. Often, it is the engine where the performance bottle-necks are found. Therefore, we will examine a special example program to illustrate the concepts of Borland Pascal Performance Optimization using almost no user interface code. The program itself can be incorporated into a Turbo Vision or OWL user interface, without much difference in performance.
The example program has three main features: it copies a file (arguments are input and output filenames), it upcases all characters while copying (from input to output), and it count words. All tests were performed on a 80386, 33Mhz, 4Mb machine, with and without using smartdrive on a 210Mb harddisk. All programs were compiled using Borland Pascal 7.01. The input file used was WINDOWS.H, a file of 158482 bytes found in the Borland C++ 4.0 include directory.

 program swart1;
 var
   TimerTick: ^Word;
   Ticks: Word;
 const
    words: LongInt = 0;
 var
   infile,outfile: Text;
   inword: Boolean;
   ch: Char;

 begin
   TimerTick := Ptr(Seg0040,$006C);
   Assign(infile,ParamStr(1));
   reset(infile);
   Assign(outfile,ParamStr(2));
   rewrite(outfile);

   Ticks := TimerTick^;
   while not eof(infile) do
   begin
     inword := False;
     while not eoln(infile) do
     begin
       read(infile,ch);
       if (ch <= ' ') then inword := False
       else
       begin
         if not inword then
         begin
           inword := True;
           Inc(words)
         end
       end;
       write(outfile,UpCase(ch))
     end;
     readln(infile);
     writeln(outfile)
   end;
   Ticks := TimerTick^ - Ticks;

   close(infile);
   close(outfile);
   writeln('Swart1: ',words,
           ' - ',Ticks:4,' ticks.')
 end.

Note that only the main while loop is included in the performance measurements. The first version of the example program runs in about 10 seconds. Our goal will be to write a program that is at least 10 times faster (i.e. able to run within one second).

Algorithm & Datastructures
The example program looks just fine, but is really very slow, because for each character that is read from infile and written to outfile we need a disk access (or at least we need to go to MS-DOS). We do not even need a profiling tool to prove this, because we can speed this program up by an initial factor three by only starting to use a readln and writeln algorithm, with String as the underlying datastructure.
Note that this will only work if we are sure that lines from the input file are no longer than 255 characters, because that is the maximum length of a Borland Pascal String.

 var
   Str: String;
   ...
   while not eof(infile) do
   begin
     readln(infile,Str);
     Inc(words,WordCount(Str));
     writeln(outfile,UpperStr(Str))
   end {eof};
   ...

Changing the algorithm from char-by-char to line-by-line has its drawbacks in that we now also have to write a function WordCount and UpperStr:

 function WordCount(Str: String): Word;
 var
   inword: Boolean;
   words,i: Word;
 begin
   inword := False;
   words := 0;
   for i := 1 to Length(Str) do
   begin
     if (Str[i] <= '') then
       inword := False
     else
     begin
       if not inword then
        { new word found }
       begin
         inword := True;
         Inc(words)
       end
     end
   end;
   WordCount := Words
 end {WordCount};

The actual implementation of these two functions does not differ much from the code used in the first example program.

 function UpperStr(Str: String): String;
 var
   tmp: String;
   i: Integer;
 begin
   tmp[0] := Str[0];
   for i:=1 to Length(Str) do
     tmp[i] := UpCase(Str[i]);
   UpperStr := tmp
 end {UpperStr};

However, when we finally run the two programs against each other, we see that it's been worth it: the performance has been increased with a factor three (when using smartdrive):

swart1 swart2

with smartdrive 118 43

w/0 smartdrive 341 289

From now on, most performance measurings will be done while using smartdrive, since smartdrive is installed on the vast majourity of machines. We must always remember that it is important to measure several times in order to be able to ignore possible incorrect timing results, and average measurements.
Note that is is very important that we use a configuration as clean as possible when doing the performance measurements, to avoid "noise". This means no TSRs and no device drivers unless they are absolutely necessary!

Turbo Profiler
Now, it is time to fire up Turbo Profiler, in order to find the next bottle-necks of the example program swart2.pas. Before we start to profile the program, we must specify the areas we want to profile. An area is a location in a program where statistics are collected. An area can be a single line or an entire routine. By default, Turbo Profiler marks every line in small programs, but we want only the lines in the main program to be tagged (we don't want timings to be done within routines, just totals per routine). Then, we selected the Execution Profile Window, and set Display to Both (time and calls). Finally, we specified the program arguments, and run the program.
Using Print | Options | File | swart2.out (statistics) and swart2.mod (module), we can print the results to file:



  Execution Profile
  Total time: 2.5589 sec
  % of total: 100%
         Run: 1 of 1


      Filter: All
        Show: Time and counts
        Sort: Frequency       Pass count: +++   Time: ***                         . 
  SWART2.67              5390  19% |+++++++++++++++
                   1.4233 sec  55% |**********************************************
  SWART2.65              5390  19% |+++++++++++++++
                   0.6650 sec  25% |*********************
  SWART2.66              5390  19% |+++++++++++++++
                   0.3258 sec  12% |**********
  SWART2.63              5391  19% |+++++++++++++++
                   0.0757 sec   2% |**
  SWART2.68              5390  19% |+++++++++++++++
                   0.0468 sec   1% |*


  Time  Counts
  0.0757 5391    63  while not eof(infile) do
                 64  begin
  0.6650 5390    65    readln(infile,Str);
  0.3258 5390    66    Inc(words,WordCount(Str));
  1.4233 5390    67    writeln(outfile,UpperStr(Str))
  0.0468 5390    68  end;

The statistics show us which areas are responsible for 95% of the execution time. Together with the output from the module, this shows us exactly which lines correspond to the bottle-necks.

Observations
We see immediately that 55% of the time is spend with the UpperStr and writeln statement. Another 25% of the time is spend with the readln statement. We can conclude that I/O is responsible for the biggest bottle-neck, immediately followed by the WordCount (12%) and UpperStr (only part of the 55%) functions.

Language Constructs
So far, we have changed the char-by-char algorithm to a line-by-line algorithm and obtained a threefold performance gain. Still, we found the I/O to be the major bottle-neck of our program. Time to look for more efficient I/O language constructs.

Disk I/O
We can speed this example program up if we use the Borland Pascal language construct SetTextBuf after the reset and rewrite part. The default buffer of text files is 128 bytes. If we increase this buffer, the performance increases too, because disk access will be postponed until the InBuf is empty or the OutBuf is full. We can keep using the same readln algorithm, and only have to increase the buffer:

 const
   BufSize = 4096;
 var
   InBuf,OutBuf:
   Array[1..BufSize] of Char;

 begin
   Assign(infile,ParamStr(1));
   reset(infile);
   SetTextBuf(infile,InBuf);

   Assign(outfile,ParamStr(2));
   rewrite(outfile);
   SetTextBuf(outfile,OutBuf);
   ...

What buffer size will be best? This is difficult to tell, and dependent on the type and size of harddisk used. Typically, a harddisk has a allocation unit which is a multiple of 2Kbytes (often 4Kbytes). We can show this by varying the size of the input and output buffers and measuring the performance of swart3.pas:

BufferSize ClockTicks

128 43

256 37

512 34

1024 31

2048 29

4096 28-29

8192 28

16384 28

32768 27-28

A buffer of 4 Kbytes seems to be sufficient for the moment, yielding a performance gain of about 35%.

WordCount
Now, it's time to examine the language constructs used in the other two bottle-necks. First, let's examine the WordCount function. The function itself is clear, and looks fine as is. However, the function header does not look good (see listing 3). We supply the input Str as a value parameter, which means that a local copy of Str is made for every call to WordCount. Since we don't change Str anyway, we don't need to waste time making a copy for every call. Instead, we should be passing a pointer to Str, use a var parameter, or - better yet - a const parameter:

 function WordCount(Const Str: String): Word;

Now, we actually pass a pointer to Str to the function WordCount. Within WordCount, we use the actual value of Str, but we may not change it. This leads to enormous performance gains, as can be seen in the next table (results are using the reps timer, as described in the first chapter, on WordCnt.PAS):

old WordCount new WordCount

143 278

UpperStr
Finally, we consider the UpperStr function. String functions are always something special to consider. String procedures are faster than functions, because functions must return string results via the slow stack. However, string procedures tend to be harder to read than their corresponding string functions.
The most obvious enhancement of the original UpperStr function, is the usage of a const string parameter, where a value parameter was used. Constant parameters are more efficient than value parameters because the compiler doesn't have to generate copies of the actual parameters upon entry to the function itself. For string parameters, this obviously saves a lot of time.
Also, we do not need a temporary variable tmp to store the string result (which must be copied to the function UpperStr result), but we can "write" directly into the UpperStr, as if it were a local variable of type String:

 function UpperStr(Const Str: String): String;
 var
   i: Integer;
 begin
   UpperStr[0] := Str[0];
   for i:=1 to Length(Str) do
     UpperStr[i] := UpCase(Str[i]);
 end {UpperStr};

Another way to optimize the performance of UpperStr, is by making it a procedure instead of a function. This way, we avoid copying the result at all, since we modify the actual parameter itself. We don't even have to copy the length byte and don't waste stack space.
We must, however, chance the main program, since we now must separate the UpperStr procedure and the writeln output function.

 procedure UpperStr(var Str: String);
 var
   i: Integer;
 begin
   for i:=1 to Length(Str) do
     Str[i] := UpCase(Str[i])
 end {UpperStr};

The two different enhancements of UpperStr can be timed against the previous function, yielding the following table (using the reps timer described in the first chapter):

UpperStr New UpperStr UpperStr Proc

73 144 218

We can measure the number of executions of a function in one (or more) clock ticks, with a precision of one call. Consequently, we have to make sure the number of calls (Reps) is large. We have some overhead from incrementing Reps every time in the loop, and from checking if Tick is equal to MemW[Seg0040:$6C]. As a consequence, any code that does twice as many reps, is probably more than twice as fast.
Incorporating procedure UpperStr, function WordCount and SetTextBuf techniques in swart3.pas yields the following timing results on our 155Kbytes input file WINDOWS.H:

swart1 swart2 swart3

with smartdrive 118 43 26

w/o smartdrive 341 289 N/A

The report from Turbo Profiler on swart3.pas is as follows:



  Time  Counts
  0.1097 5391  while not eof(infile) do
               begin
  0.3403 5390    readln(infile,Str);
  0.3130 5390    Inc(words,WordCount(Str)); 
  0.5843 5390    UpperStr(Str);
  0.3038 5390    writeln(outfile,Str)
  0.0045 5390  end;

The module statistics show us which areas are responsible for the major bottle-necks.

Observations
We see immediately that most of the time is spend with the UpperStr statement. Together, however, the readln and writeln statements take even more time! Again, we can conclude that I/O is responsible for the biggest bottle-neck, immediately followed by the UpperStr function.

Final I/O optimizations, BASM and InLines
The last step, after we have selected the most appropriate datastructure, algorithm and language construct, is to rewrite part of the Pascal code using BASM or InLine macros.

BlockRead and Write
Two other very useful procedures of Borland Pascal in this particular case are BlockRead and BlockWrite. Instead of reading and writing a line (or a character) at a time, we should read and write a block at a time. And instead of using an InBuf and an OutBuf, we should be using one single Buffer.
We could speed this up some more if we try to use a dynamic buffer sized a multiple of 4096 bytes (which is the general allocation unit on most harddisks), with a maximum of 60Kbytes (or MaxAvail). With the dynamic allocation of the buffer we use for BlockRead and BlockWrite we lose some time for GetMem and FreeMem. However, we gain a lot of speed and save a lot of stack space doing so.

 const
   BufSize: Word = 8 * 4096;

 var
   Buffer: PChar;

   ...
   GetMem(Buffer,BufSize);

   Ticks := TimerTick^;
   repeat
     BlockRead(infile,Buffer^,BufSize,Size);
     Inc(Words,WordCount(Buffer,Size));
     UpperFast(Buffer,Size); {listing 9}
     BlockWrite(outfile,Buffer^,Size)
   until Size < BufSize;
   Ticks := TimerTick^ - Ticks;

   FreeMem(Buffer,BufSize);
   ...

Timing on our 155Kbytes input file WINDOWS.H file using clock-ticks yields the following table:

swart1 118

swart2 43

swart3 36

swart4 20

Note that we had to rewrite WordCount and UpperFast to work on PChars rather than Strings.

UpperStr
The last step in performance optimization is rewriting part of the code using BASM code or InLine macros. BASM code stands for Built-in ASseMbler code, and is rather easy to incorporate into Borland Pascal program. InLine statements are no longer necessary, since the availability of BASM code. However, InLine macros can still be very useful, as will be shown.
The function UpperStr can be rewritten as a BASM procedure to gain speed. However, even more speed can be gained by rewriting the BASM procedure as an InLine macro. InLine macros are faster than BASM, as they eliminate the (far or near) call. However, InLine macros expand the code size of your application, since for every InLine "call", the actual code is placed directly into the executable. One of the most important guidelines for InLine macros is: don't make them too long, or don't call them much. Since we call UpperStr only once in the example program, we can make it as big as we would like:

 procedure UpperFast(Str: PChar; Size: Word);
 InLine(
   $8C/$DA/        {  mov   DX,DS          }
   $BB/Ord('a')/Ord('z')-Ord('a')/
                   {  mov   BX,'z'-'a'/'a' }
   $59/            {  pop   CX             }
   $5E/            {  pop   SI             }
   $1F/            {  pop   DS             }
   $FC/            {  cld                  }
   $D1/$E9/        {  shr   CX,1           }
   $73/$0B/        {  jnc   @Part1         }
   $AC/            {  lodsb                }
   $28/$D8/        {  sub   AL,BL          }
   $38/$F8/        {  cmp   AL,BH          }
   $77/$04/        {  ja    @Part1         }
   $80/$6C/$FF/
    Ord('a')-Ord('A')/
      {@Loop: sub   Byte Ptr[SI-1],'a'-'A' }
   $E3/$14/   {@Part1:jcxz  @Exit          }
   $AD/            {  lodsw                }
   $28/$D8/        {  sub   AL,BL          }
   $38/$F8/        {  cmp   AL,BH          }
   $77/$04/        {  ja    @Part2         }
   $80/$6C/$FE/Ord('a')-Ord('A')/
           {  sub   Byte Ptr[SI-2],'a'-'A' }
   $49/            {@Part2:dec   CX        }
   $28/$DC/        {  sub   AH,BL          }
   $38/$FC/        {  cmp   AH,BH          }
   $77/$EC/        {  ja    @Part1         }
   $EB/$E6/        {  jmp   @Loop          }
   $8E/$DA    {@Exit: mov   DS,DX          }
 ) {UpperFast};

Several techniques have been used here to make the InLine macro as fast as possible. A fast register-register compare with BX is used instead of a register-value compare. Also, to save some more clockcycles we save and restore DS to and from DX (instead of PUSHing and POPping DS).
Timings of the new swart4.pas on our 155Kbytes input file WINDOWS.H, using clock-ticks yields the following table:

swart1 118

swart2 43

swart3 26

swart4 11

Note that the program is almost twice as fast as the previous version, because the UpperStr routine was the biggest bottle-neck. This again shows that we should only work on the biggest bottle-necks, and not waste development time on minor bottle-necks (or even pieces of code that are no bottle-neck at all).
All in all, the program swart4 is about 10 times as fast as the swart1 program, as we intended to do!

Conclusions
Borland Pascal Performance Optimization has been divided into four steps: finding bottle-necks, selecting algorithms & datastructures, language constructs and writing BASM or InLine code. Each of these four steps has been illustrated with the example program (which was optimizated by a factor 10).
Next time, we'll examine a DPMI and Windows version of this program, and take a look at the profiling tools for DPMI from the excellent Turbo Analyst toolbox made by Turbo Power Software.

Borland Pascal Efficiency (3) - (16-bits) DPMI
Borland Pascal Performance Optimisation techniques include profiling code, selecting appropriate datastructures and algorithms, using efficient language constructs and writing BASM code or InLine macros. The DPMI DOS Protected Mode Interface adds special considerations to this list of topics.

Last Time on Efficiency
In chapter 2, we examined a real-world DOS application, and enhanced the performance by a factor 10.
To summarise the previous chapters: we wrote a small application to copy a file, upcase all characters while copying, and count the number of words copied. Not a typical application, that's for sure, but it does contain a lot of elements that are often used in real-world applications. The fourth optimised version of this program was indeed 10 times as fast as the original, using a BlockRead/BlockWrite loop, a function WordCount and a fast InLine macro UpperFast.
The final timings of the programs on our 155Kbytes input file WINDOWS.H, using clock-ticks gave the following table:

swart1 118

swart2 43

swart3 26

swart4 11

This time, we'll convert the program swart4 from a DOS real-mode to a DOS protected mode application. Doing that, we will experience some new problems and learn some DPMI-specific programming considerations.

Enter DPMI
Borland Pascal offers the ability to generate real mode DOS, protected mode DOS (DPMI) and Windows applications. Borland Pascal offers DPMI support with two files: DPMI16BI.OVL is the 16-bits DPMI 0.9 server, and RTM.EXE (run-time manager) is the 16-bit DOS extender. Borland Pascal DPMI programs or clients can access the server and run-time manager through the WinAPI unit.
Timing of the real-mode and DPMI version of the example programs on our 155Kbytes input file WINDOWS.H, yields the following table:

program real-mode DPMI

swart1 118 179

swart2 43 65

swart3 26 41

swart4 11 27

Why is the DPMI version always so much slower? In order to answer this question, we can do two things: profile the code (the most desirable option) or try to guess what went wrong, and fix that.
Remember that we've already speed up the I/O-operations and made a fast InLine macro UpperFast in our last real-mode example. We should expect the WordCount function to be a possible bottle-neck now.

DPMI & Profiling
Although Borland Pascal includes TDX.EXE to debug a protected mode program, there is no way to profile a DPMI program. The 'normal' Turbo Profiler will try to load the real-mode stub (that loads the DPMI Server), while the Turbo Profiler for Windows is unable to load a DPMI program at all.
So, we're out of luck? Not really, because we take a little side-step in this chapter, and examine some DPMI-able Profilers from the Turbo Analyst toolbox by TurboPower Software from Colorado Springs, USA (I'm sure you must have heard of them at least once before).
Turbo Analyst contains - among many, many other useful things - three different profilers: PEP, PET and MON. They might not be as 'visual' as the Turbo Profiler, but they certainly get the job done, and two of them can be used to profile a protected mode application!

MON
One way of profiling, is counting how many times a certain piece of code is active. MON does exactly that: it counts the number of times that each line of code in our program is executed. Besides information on how many times each line is executed, MON also shows us which lines are not executed at all! This is a very important side-effect of MON, that can help you develop new test and debugging strategies by detecting and eliminating obsolete or dead-code.
A sample output of MON (on a part of a program only) is as follows:



                          * Hit Count for All Statements *


  Statistics
  ================================================================================ 
  Total statements:          35  Statements in range:          35  100.00%
  Total hits:            489951  Hits in range:            489951  100.00%
  Highest:               181454  High in range:            181454  100.00%
  Not executed:               1  Number in range:               1  100.00%


  Hit Count     F:\TANALYST\PROFILER\SWART5.PAS            (SubRoutine: WordCount)
  ================================================================================


                    function WordCount(var Str: PChar; Size: Word): LongInt;
                    var words: LongInt;
                        i: Word;
  [         6]      begin
  [         6]        words := 0;
  [         6]        for i:=Size-1 downto 0 do
                      begin
  [    181454]          if (Str[i] <= ' ') then inword := False
                        else
                        begin
  [     91754]            if not inword then
                          begin
  [     17605]              inword := True;
  [     17605]              Inc(words)
                          end
                        end
  [    181454]        end;
  [         6]        WordCount := words
  [         6]      end {WordCount};

The use of MON is very easy, we only have to include the unit MON in our uses list, set the compiler directives to {$D+} for debug information and generate a .MAP file. The unit MON compiles for real-mode (no overlays) and DPMI targets, but not for Windows. MON gathers its information during execution of the program, using INT 1 (step mode) and 3 (breakpoint). After execution, we have to use another program, MONITOR, to view or print the gathered statistics (see listing 1). Since MON has to keep track of the execution of every statement in a program, this is a time-consuming process. Therefore, we should use MON only on small programs (the information tables are limited to 16K statements anyway).

PET
Another type of information that is often needed when looking for bottle-necks of a program, is the amount of time spend in a certain piece of code. PET is just our tool, as it measures the actual amount of time spent in each procedure and function with great accuracy (a resolution of 838 nanoseconds). Using PET is just like using MON, simply by including the unit PET in the uses list of the program. The accumulated results are however printed immediately after completion of execution. Like MON, PET can be compiled for DOS and DPMI, but not for Windows.
Let's examine the real-mode version of the final swart4.pas program from the previous chapter. But before we can do this, we have to make sure PET can measure every routine in the program. We have an InLine macro UpperFast which PET is unable to measure, because InLine macros are expanded and not called. Therefore, we have to write a little wrapper function UpperAnalyze, just for measuring purposes:

 function UpperAnalyze(Str: PChar; Size: Word);
 begin
   UpperFast(Str, Size)
 end {UpperAnalyze};

Note that we will now not only measure the time spend in the InLine macro UpperFast, but also the call overhead of UpperAnalyze itself:



  Data from    : F:\TANALYST\PROFILER\SWART4.PMF
  Generated    : Friday 8/12/94 at 19:04
  Sorted by    : Percent
  OverheadI    : 0.032 ms/call
  OverheadO    : 0.030 ms/call
  Command line : " d:windows.h output"


  Address Length Nest Accum Time  Time/Call  Calls Percent  Name
  --------- -----  -- ---------- ---------- -------  -----  ----------------- 
                    0    581.424    581.424       1  52.3%  MAIN
  0000:0000   112   0    423.537     84.707       5  38.1%  WordCount
  0000:0070    71   0    106.527     21.305       5   9.6%  UpperAnalyze
                      ----------            -------
                        1111.488 ms              11 calls
                        1112.170 ms computed with overhead
                        1112.116 ms actual running time

As we can see in listing 2, PET lists the total amount of time spent in each routine (left column), the average time spent per call (second column) and the number of calls made to the particular routine (third column). Using this information, we can make a decision which part of the program to examine further.
The major bottle-neck in this case, the MAIN routine, consists of everything but WordCount and UpperAnalyze, mainly allocating memory for the file buffers, and the BlockRead/BlockWrite loop.
Let's compare this result with a DPMI version of swart4.pas (compiled for a DPMI target):



  Data from    : F:\TANALYST\PROFILER\SWART4D.PMF
  Generated    : Friday 8/12/94 at 19:04
  Sorted by    : Percent
  OverheadI    : 0.215 ms/call
  OverheadO    : 0.219 ms/call
  Command line : " d:windows.h output"


  Address Length Nest Accum Time  Time/Call  Calls Percent  Name
  --------- -----  -- ---------- ---------- -------  -----  ----------------- 
                    0    937.743    937.743       1  60.3%  MAIN
  0001:0002   112   0    519.602    103.920       5  33.4%  WordCount
  0001:0072    71   0     98.401     19.680       5   6.3%  UpperAnalyze
                      ----------            -------
                        1555.746 ms              11 calls
                        1560.513 ms computed with overhead
                        1560.134 ms actual running time

Close examination of these two results in listing 2 and 3 leads to the conclusion that especially the time spend in MAIN has increased a lot. This leads to a suspicion that the bottle-neck may well be I/O-operations, again!
Note that our earlier thought about the bottle-neck - the function WordCount - would have completely wrong, especially for the DPMI version of swart4.pas! This again shows that the only way we can be sure of the place of the bottle-necks is to measure, measure and measure again!

PEP
A third profiling tool from the Turbo Analyst toolbox is PEP, which uses a statistical sampling technique to determine the use of program areas as well as memory areas (below 1Mb). By taking over the timer interrupt, PEP gains control over the machine at certain intervals, and just checks to see what section of code and memory was active at the time of the interrupt. The timer interrupt is tweaked so PEP gains control somewhere between 40 and 30,000 times a second (rather than 18.2).
The methodology used by the PEP profiler doesn't fit into the protected mode environment, primarily because of the arbitrary connection between protected mode selectors and physical memory addresses, so TurboPower put more effort into upgrading PET and MON to DPMI, and PEP only compiles for a DOS real-mode target.
However, for DOS real-mode it is a great tool that can give us information not only about the activity of the program itself, but also the amount of activity in the Turbo Pascal RTL, in DOS and the BIOS, as these memory areas are all included in the statistics.

DPMI & Efficiency
Back to our problem at hand: Why is the DPMI version always so much slower? As our measurements using PET have shown, it seems like the I/O-operations are again our major bottle-neck!
DOS and the BIOS are designed for real mode, and cannot run in protected mode. When a DPMI client calls DOS, the DOS extender handles the call, switches the processor to real mode, copies any buffers from protected mode to real mode, and relays the call to DOS. When the call is complete, the DOS extender switches the processor back to protected mode, copies the real mode buffers back to protected mode, and returns control to the DPMI client again.
On 80286 processors, the switch to real mode requires the processor to reset and takes a lot of time. Even on 80386+ machines the switches are noticeable. The first rule-of-thumb, in order to minimise mode switching overhead is: when you must switch modes, do as much as possible before returning!

For our example program, this means that we must do as much as possible when we're reading to or writing from files using BlockRead and BlockWrite (since these operations require real-mode DOS to read or write, and therefore require a mode switch of the processor). The only thing we can do about this is use a buffer as large as possible. This conclusion is not new, we already saw the effect when we varied the sizes of the input and output buffer in the third version of the example program.
The DOS extender copies any protected mode buffers to real mode buffers. This also takes time, a lot of time for large buffers! The second rule-of-thumb, to minimise copying overhead is: use transfer buffers allocated in conventional memory!
The DOS extender takes care of copying any buffers in memory above 1 Mbytes to buffers below 1 Mbytes. This allows the real mode interrupt to access the buffers. We can save time by allocating a buffer within conventional memory with GlobalDosAlloc, accessible from both real and protected mode code. The function GlobalDosAlloc, a Windows API available in DPMI from the WinAPI unit, allocates memory that is fixed and must be below 1 Mbytes so that real mode DOS can also access it. GlobalDosAlloc returns two different address bases, the LoWord is the protected mode selector, while the HiWord is the real mode segment, as can be seen in listing 10, using the LongRec datastructure. The DOS extender doesn't need to copy the buffer to and from conventional memory anymore (it's already there) and we can even call interrupts the DOS extender doesn't support directly. All without the overhead of copying buffers. The protected mode version of our example program with a dynamic real mode buffer runs about as fast as the real mode version, with or without smartdrive.

 {$IFNDEF MSDOS}
 uses
   WinAPI;

 type
   LongRec = record
               Selector,Segment: Word;
             end {LongRec};

   function XGlobalDosAlloc(Size: longint): pointer;
   var
     Long: LongInt;
   begin
     Long := GlobalDosAlloc(Size);
     XGlobalDosAlloc := Ptr(LongRec(Long).Selector, 0)
   end {XGlobalDosAlloc};
 {$ENDIF MSDOS}

 begin
 {$IFDEF MSDOS}
   GetMem(Buffer,BufSize);
 {$ELSE}
   Buffer := XGlobalDosAlloc(BufSize);
             { buffer below 1Mbyte }
 {$ENDIF MSDOS}

   { main while loop }

 {$IFDEF MSDOS}
   FreeMem(Buffer,BufSize);
 {$ELSE}
   GlobalDosFree(LongRec(Buffer).Selector);
   { free buffer }
 {$ENDIF}
 end.

We used GlobalDosAlloc and GlobalDosFree for this purpose. A large real mode buffer will yield higher performance, but leaves less available real mode memory for other purposes. That's why we use a dynamic real mode buffer, which is freed immediately after the results have been processed.
Timing on our input file of 155Kbytes yields the following results:

real-mode DPMI DPMI buffer DPMI real-mode buffer

swart5 11 27 15

Note that the DPMI version using a real-mode buffer is now almost as fast as the real-mode version!

DPMI & Memory Management
The last topic of this chapter is about customising the Borland Pascal DPMI Memory Manager and Suballocator. The procedures New, Dispose, GetMem and FreeMem all make use of this heap manager and suballocator. The heap manager allocates a global heap, and uses a segment suballocator algorithm to enhance performance and to allow more than 8192 blocks to be allocated.

SideBar: Suballocation
Suballocation is the process of allocating memory from the global heap and then managing a heap within the block. A suballocator allocates multiple blocks from the global heap as needed and then manages so-called local heaps within them. When a block is allocated with a size smaller than HeapLimit, a suballocation takes place. Otherwise, a new block of the exact size is allocated (that's also the main reason why setting HeapLimit to zero is such a great way of debugging protected mode and Windows applications).
When the local heaps contain no free space of the desired size, a new block of size HeapBlock is allocated, and a new local heap is started.

HeapLimit
Performance can be enhanced if several allocations are made of variables larger than the default HeapLimit (say 4096 bytes each). In this case, HeapLimit should be made equal to the largest blocksize.
We must also make sure that HeapBlock remains at least four times the size of HeapLimit, as stated in the Borland Pascal Language Guide.

HeapBlock
A bigger HeapBlock will increase performance since fewer GlobalAllocs are needed, but may decrease the amount of available memory and increase the memory usage overhead.

Customising the Suballocator
Normally, the Borland Pascal memory management suballocator works just fine. In a few cases though, there may be reasons to experiment and adjust the default settings of the suballocator, using the predefined control variables HeapLimit and HeapBlock.
The program below can be used to experiment with new settings of HeapLimit and HeapBlock. By compiling with {$DEFINE NoSubAllocator) the sub-allocator is disabled, resulting in bad performance. Note that only very small blocks of memory (between 1 and 256 bytes) will be allocated, to avoid a possible memory overflow when testing:

 {$IFDEF WINDOWS}
 uses
   WinCrt;
 {$ENDIF WINDOWS}

 {$DEFINE NoSubAllocator}
 const
   Blocks = 2048; { Number of blocks }

 var
   TimerTick: ^Word;
   StartTick,Ticks: Word;
   Block: array[1..Blocks] of PChar;
   i: Word;

 begin
   TimerTick := Ptr(Seg0040,$006C);
   {$IFDEF NoSubAllocator}
     HeapLimit := 0;
   {$ENDIF}
   writeln('MemAvail: ',MemAvail);

   StartTick := TimerTick^;
   repeat until StartTick <> TimerTick^;
   StartTick := TimerTick^;
   for i:=1 to Blocks do GetMem(Block[i],(i AND $FF)+1);
   Ticks := TimerTick^ - StartTick;
   writeln('Getmem uses ',Ticks);
   writeln('MemAvail: ',MemAvail);

   StartTick := TimerTick^;
   repeat until StartTick <> TimerTick^;
   StartTick := TimerTick^;
   for i:=1 to Blocks do FreeMem(Block[i],(i AND $FF)+1);
   Ticks := TimerTick^ - StartTick;
   writeln('FreeMem uses ',Ticks);
   writeln('MemAvail: ',MemAvail)
 end.

Without memory suballocator (HeapLimit := 0):

Windows DOS (DPMI16BI)

GetMem 90 14

FreeMem 5 4

With memory suballocator:

Windows DOS (DPMI16BI)

GetMem 2 1
FreeMem 0 0

The numbers indicate the clockticks needed to perform the specific action. A bigger number indicates worse performance. We can conclude from these figures, that although a value of zero for HeapLimit is a great debugging tool, it is also a great way to slow our programs down. Therefore, we should only set HeapLimit to zero whenever we want to test and debug our programs!

Borland Pascal Efficiency (4) - Speed vs. Size
The first thing that comes to mind when we talk about optimisation is execution speed. There is another way of optimising programs, however, which is optimising for size. In this chapter, we'll focus on the speed vs. size dilemma, with the emphasise on size optimisations.

Solutions that are good for execution speed often have a negative effect on the program size, and vice versa. This is mainly caused by the fact that a general (small) solution to a problem will run slower than a more specific (and hence often larger) solution to the problem.

Size Optimisation
First of all, Optimisations for size do not only mean .EXE size, but code size, data (segment) size, stack usage, dynamic memory usage and finally overall .EXE size. A program that minimise use of stack and heap memory may run on a machine where it otherwise would not have started.
If you want to see the actual Code Size, Data Size and Heap and Stack limits of a Turbo or Borland Pascal program, just enter the IDE, build the entire program and select Compile | Information. BPW then gives information about the Code size, Data size, Stack size and Local heap size. For each of the these issues (code size, data size, heap size, stack size and .EXE size) we will shortly describe what it consists of, what the limits are, and how we can decrease (optimise) the use of it.

Code Size
As I've said in the introduction: a general solution often uses less code than a more specific solution. Since the specific solution will be faster in execution speed compared to the general solution (at least, that's why we need specific solutions), this is our first introduction to the size vs. speed issue.

In theory, the total code size of an application is limited only by the resources of the machine. In practice this means 640K for real-mode DOS applications, and only several megabytes for DPMI and Windows applications. While the program itself may be rather big, each unit, on the other hand, has its own rather limited Code Segment of 64Kbytes.

There are several ways to reduce code size:

Put groups of statements that are executed often in a local procedure or function. We will no longer have the entire group of statements every time, but just a call to the procedure (and the actual contents of the procedure only once of course).
Note that his will also hold when you replace only one RTL statement that is expanded to several statements (like the WRITELN statement):
```
 function WriteLine(Const Str: String): Integer;
 begin
   writeln(Str);
   WriteLine := IOResult
 end {WriteLine};
```
A program that contains 100 calls to WriteLine has a code size of 2880 bytes, with 200 calls 3680 bytes, which makes an average of exact 8 bytes per call. The same program with 100 calls to WriteLn (even without calling IOResult) is 4336 bytes long, and with 200 calls even 6624 bytes, which makes an average of almost 23 bytes per call! The reason for this little example is that the Turbo Pascal WRITELN routine is actually expanded by the compiler itself into several different internal calls. This is also the main reason why WRITELN can have a various number of arguments, while we cannot make such a routine ourselves in Pascal.
Avoid or minimise the use of big InLine macros. As I've said before: InLine statements are now obsolete and should be replaced by BASM code. InLine macros are still very useful and often very fast (since they don't require a call and return), but they increase the code size as they are expanded for every time the InLine macro is called. As you see, the use of InLine macros has the opposite effect of grouping code in procedures as I've recommended in point 1.
Note that for very small InLine macros there can actually be a code decrease when using them, as they might be even smaller than the sum of the CALL statements + the routine itself. This will only be the case for very small (maximum about 10 bytes) InLine macros, however.
Avoid empty "begin end." pairs in the unit initialisation, but write only an "end." statement. This will save a few bytes (up to 12 when using the {$F+,S+} compiler directives). This might seem only a 'little' effective, but if you have 20 units with empty initialisations, it will quickly become about a quarter K of unnecessary (but linked and executed!) code.
Avoid coding itself by using lookup tables or conversion types. An example of a conversion type is the following TWord2Bytes (could also be used for TLongInt2Words of course):
```
 type
   TWord2Bytes = record
                   case Boolean of
                      True: (W: Word);
                     False: (Lo,Hi: Byte)
                   end;

 var
   Num,NumMod256: Word;
   Number: TWord2Bytes;
 begin
   Num := 298;
   NumMod256 := TWord2Bytes(Num).Lo; { automagically mod 256 }
   Number.Lo := $CA;
   Number.Hi := $FE; { conversion of 2 bytes to word }
 end.
```
Using the fact that 2 bytes and one word can take up the same amount of memory, we can have an automagic MOD 256 operator, or convert two Bytes to a Word without a sweat. Of course, this still needs an assignment, but no shifts, multiplies or additions anymore.
Lookup tables can be used to avoid calculations, like a lookup table for Sinus values. Personally, I use lookup tables more often for country specific character or string uppercasing.
```
 var
   Table: Array[Char] of Char;

 function UpCase(Ch: Char): Char;
 begin
   UpCase := Table[Ch]
 end {UpCase};
```
Note that a lookup table will always need initialisation code to fill it with the appropriate values. This might increase execution time and code size if the actual function (UpCase in this example) is only called a few times.
Also, while a lookup table will not only decrease code size and increase execution time, it certainly isn't a speed and time enhancement. The lookup table will have to be stored somewhere, which oftentimes increases the data size...

Data Size
While each unit of a Borland Pascal program can have a Code Segment of 64Kbytes, we have only one Data Segment of 64Kbytes. And the RTL already uses a few Kbs of this space. This is the reason why compiler error 49 (Data Segment too large) will be seen more frequently than the compiler error for (unit) Code Segment too large.

Virtual Method Tables of objects
Typed Constants
Global Variables

Stack space
Local Heap (Borland Pascal does not use the local heap - see also the sidebar)
PChar literals

There are several ways to minimise the use of the Data Segment if the limit is almost reached (or broken).

First of all, don't use lookup tables, but calculate the values each time you need them. This uses more code space and execution time, but illustrates the trade off and dilemma the programmer often has to face when optimising programs. Make sure you know what your bottle-necks are. Is it the execution speed, the code size, or perhaps the data size?

Another way to minimise the Data Size is to put things in group blocks. The Turbo and Borland Pascal compiler is smart enough not to link in any data (variables) that is not used, but only if everything in the same block is not used. A declaration block in this sense stands for:

a group of variables declared in the same "var" block
a group of typed constants declared in the same "const" block
a VMT for one object type

So, even if you use one variable (like FileDate) from a unit, you don't necessarily get the large variable FileBuffer as well. Not unless it's in the same var block that is. So, put your things that belong together in their own variable and const blocks, to avoid linking them in if you might use only part of it. Of course, this is generally more useful for units than regular programs.

The final real-mode way to minimise your Data Size is to allocate static data (i.e. global variables and typed constants) on the heap and make them dynamic. This will often decrease execution speed as well, as you need to allocate and deallocate the memory for the variables. Also, when referencing or writing to these dynamic variables, you need another layer of indirection (a pointer dereference), which also takes some extra code space and execution time.

Windows programs can eliminate the Data Size used up by PChar literals by using String Tables. This also enables the translation (internationalisation!) of the application itself.

Generally, a Windows program written in Borland Pascal does not use the local heap. Borland Pascal allocates memory from the global heap and uses a suballocation scheme to fulfil new and get mem requests (see also the previous chapter). The space specified for the local heap is then used by Windows controls (like editboxes) only. Windows itself is capable of expanding the local heap if needed.

SideBar:
Some time ago, Jeff Woods discovered on the CompuServe BPASCAL forum that the Data Segment of Windows DLLs compiled with Borland Pascal for Windows was allocated as FIXED blocks and only in conventional memory (i.e. below 1Mbytes). Kurt Barthelmess and Pat Ritchey explained that the problem with DLLs is that they can have ISR stuff embedded within them, so the Data Segment has to be available at all times. Only if this isn't the case, you can add two lines of code to make the data segment moveable and in high memory. This will decrease the burden on lower memory:

 GlobalPageUnLock(DSEG);
 GlobalReAlloc(DSEG, 0, GMEM_MODIFY or GMEM_MOVEABLE);

Pat also added that the space reserved for the Local Heap was ALSO reserved as a FIXED block in lower memory, and since Pascal programs for Windows generally do not use the local heap themselves, you should be able to specify a 0 local heap for your Windows applications! For Windows DLLs (that do not even need stack space since they use the caller's stack) you can always specify {$M 0,0}.

Heap Size
If you heap size if overflowing there are generally two main things to do: try to fix it, or try to find a way to expand your heap. I will not go deeper into these solutions at this time (perhaps later if there should be a huge demand).
The first solution (trying to minimise heap use) involves re-designing the data structure and algorithms used in this particular program, or using swap files or data files to temporarily store data. This often slows down a program considerably!
If you're looking for ways to expand your heap you can try to use XMS or EMS memory (several public domain or commercial units are available, for example EZXMS.ZIP by Julian Bucknall from TurboPower). Other possibilities include the use of Overlays or porting the application to protected mode (see previous chapters).

Stack Size
The stack is often a soft spot of many applications. Generally, it's best to specify a large stack size (like 32Kbytes) unless you have specific reasons not to do so. The most annoying thing about the stack it that it's often hard to see how much is actually used by a specific "run" of your program (and hence, how much was still free). The following unit can be used to check for the minimum value of the Stack Pointer. Note that the logic depends on the timer interrupt, which is only called 18.2 times a second. Therefore, you won't get much detailed information unless you run the program long enough (or use a way to increase the system clock).

 {$A+,B-,D-,E-,F-,G-,I-,L-,N-,O-,P-,Q-,R-,S-,T-,V-,X-}
 unit stack;
 interface
 const
   MinStack: Word = $FFFF;
   NumProbe: Word = $0000;

   procedure StackProbe;

 implementation
 uses Dos;

   procedure StackProbe; Assembler;
   ASM
         inc   NumProbe { only count REAL probes }
         cmp   SP,MinStack
         jnb   @Exit
         mov   MinStack,SP
  @Exit:
   end {StackProbe};

 const
   OurOwnSS: Word = 0;
   SaveExit: pointer = nil;
   OldInt08: pointer = nil;

   procedure NewInt08; interrupt; Assembler;
   ASM
         mov   AX,SS
         cmp   OurOwnSS,AX
         jne   @Exit
         inc   NumProbe { only count REAL probes }
         cmp   SP,MinStack
         jnb   @Exit
         mov   MinStack,SP
  @Exit:
         pushF
         call  DWORD PTR [OldInt08] { goto original int $08 handler }
   end {NewInt8};

   procedure NewExit; far;
   begin
     ExitProc := SaveExit;
     SetIntVec($08, OldInt08) {get old int from code segment}
   end {NewExit};

 begin
   ASM
         mov   AX,SS
         mov   OurOwnSS,AX
   end { TurboSS := SSeg };
   SaveExit := ExitProc;
   GetIntVec($08, OldInt08);  {store old int in code segment}
   SetIntVec($08, @NewInt08); {set int $08 to our ISR}
   ExitProc := @NewExit
 end.

Stack space is used by calls to routines and their local variables. Stack space is also used for the variables to procedures and functions. If you want to preserve stack space, you should take care when using recursive calls. Recursion may be a very elegant way to implement a solution, it is also dangerous as it eats your stack. Make sure you always know the maximum depth of a recursive call!

Furthermore, you should try to avoid passing large structures (like Strings) as Value parameters to procedures and functions. This will use a lot of stack space, and will slow down your program as the variable must be copied onto the stack (and removed from the stack again). Since version 7.0, we can pass arguments not only by value (which uses a lot of stack space) and by reference (var parameter) but also by 'constant value', which actually is by reference but means that you get an error if you try to modify the contents.

Functions returning strings are especially space wasters. For example, suppose you have a

 function UpCaseStr(Str: String): String;
 begin
   UpCaseStr := ...
 end;

256 bytes are allocated for "Str", the formal parameter
256 bytes for a local copy of "Str" since it was passed as a value parameter
256 bytes for a local variable of the type String, working storage to build the function result
256 bytes for assigning the result to the function result (as in: "UpCaseStr := Result").

You can cut this figure by 50% by taking changing the parameter header into "Function UpCaseStr(Const Str: String): String". Provided you don't change "Str", no local copy of the string will be created. An alternative could be to implement the routine in BASM, as BASM routines always pass String arguments by reference only.

Stack Usage & Protected Mode
If you compile a DPMI or Windows program with Stack Checking enabled, the system unit maintains a "low water" mark for your stack. Three words at the base of your data segment are used to maintain the stack top, bottom and minimum. You can use this undocumented trick of Windows (and the Borland compiler, as can be seen in the SE.ASM en ERRC.ASM Borland Pascal RTL files) to check the minimum stack value without needing the probe example given earlier:

 unit StackMin;
 {$S+}
 interface
 var
   _StackMin: ^Word;

 implementation
 const { from RTl\INC\SE.ASM }
   pStackMin = $0E;

 begin
   _StackMin := Ptr(SSEG,pStackMin);
 end.

Borland Pascal Efficiency (5) - (16-bits) Windows
The subject of this chapter is 'Pascal Performance & 16-bit Windows'. Compared to real-mode programming, Windows introduces a lot of extra parameters to take care of when trying to gain efficiency.
If we simplify a bit, we can consider Windows to consist of three major parts:

a 16-bit DPMI Server
using a non-preemptive message/event handler
and a graphical user interface (with limited resources)

If we consider Windows applications to be "users" or "clients" of these three parts (servers), we can identify ways how to improve the efficiency of our applications, without disturbing any other applications or windows itself too much.

DPMI Services
The DPMI Server stuff is already handled in the third issue, so I'll only repeat the two rules from there. As Windows (a protected mode environment) uses real-mode DOS for operating systems tasks (file I/O and such) we'll meet processor mode switches a lot. This is the reason why I/O-operations of a DPMI or Windows program is often much slower than a real-mode counterpart. For every I/O operation, your Windows (protected mode) program must ask (real-mode) DOS to perform the actual I/O operation. This requires a mode-switch by the processor (very costly, especially on 80286 CPUs) *and* copying of the resulting data from real-mode buffers to DPMI buffers.
The first rule is to minimise the mode switches, by if we have to switch, doing as much as possible before returning. Since DOS file I/O uses real-mode buffers, we can gain efficiency if these buffers are addressable from DPMI as well (so we don't have to copy the data after it's read). The second rule is therefore to use transfer buffers allocated in conventional memory using GlobalDosAlloc for this purpose.

Not much difference from a DPMI program, eh? If we consider the file copying, word counting etc. program from chapter 3, we can indeed port it fairly easily to Windows (we have to change the uses clause to include WinTypes and WinProcs instead of WinAPI, and must use the GetTickCount API instead of reading the system clock directly). It takes only 6,9 seconds to copy a 4 Mbytes file and count the words while copying. This doesn't sound too bad, considering the fact that the plain DPMI version (without Windows) only takes 6,7 seconds for the same file.

 Buffer := XGlobalDosAlloc(BufSize); { buffer below 1Mbyte }
 writeln('Buffer size of: ',BufSize,' bytes.');

 Ticks := GetTickCount;
 repeat
   BlockRead(infile,Buffer^,BufSize,Size);
   Inc(Words,WordCount(Buffer,Size)); { see chapter 3}
   UpperFast(Buffer,Size); { see chapter 3 }
   BlockWrite(outfile,Buffer^,Size);
 until Size < BufSize;
 Ticks := GetTickCount - Ticks;

 GlobalDosFree(LongRec(Buffer).Selector); { free buffer }

 close(infile);
 close(outfile);
 writeln('Swart5: ',words,' - ',Ticks:4,' msec.');

We have to remember, however, that we run in a multi-tasking environment, so we must ensure to release the conventional memory as soon as possible (it might not even be there - or not enough), since we are not the only one on the machine anymore!
Another bad thing is that our program seems to "hog" Windows while executing: the mouse cursor changes into an hourglass and we cannot do anything until the 6,9 seconds are over! While this may seem no big deal for 6,9 seconds, it is generally considered bad practice and very impolite to do so! An explanation (and solution) must be found in the non-preemptive multitasking behaviour of Windows.

Non-preemptive Multi-tasking
Windows 3.1 is a non-preemptive multi-tasking message/event handling system, which means that an application that currently is running must willingly give up the CPU to allow another application to get the CPU and run for a while. This non-preemptive characteristic of Windows 3.1 results in the fact that our facts must BEHAVE themselves when running under Windows (otherwise, the performance of other applications will be off - which ultimately might influence our own performance if everybody else does).
Way back in chapter 1, we viewed a function to Yield to Windows, and allow message to and from other applications to be processed (and hence other applications to run):

 function Yield: Boolean;
 var
   msg: TMsg;
 begin
   while PeekMessage(msg, 0, 0, 0, PM_REMOVE) do
   begin
     if msg.message = WM_QUIT then
     begin
       PostQuitMessage(msg.wParam);
       Yield := TRUE;
       EXIT
     end
     else
     begin
       TranslateMessage(msg);
       DispatchMessage(msg)
     end
   end;
   Yield := FALSE
 end {Yield};

When calling Yield from listing #2 right after BlockRead and BlockWrite from listing #1, we see that we no longer get the hourglass as a mouse cursor and are still able to do other things while our application is running. The execution time increases from 6,9 to 7,6 seconds (about 16% increase) but if we really want to play ball, we have to play by the rules!
Windows95 (and OS/2 for that matter) is a 32-bit preemptive operating system, where we do not have to worry about "conventional memory", mode switches or Yielding. Since there is no BP for OS/2, and Windows95 is still Not There, I'm afraid we'll have to pay attention to the above issues for quite a while. I would love to take at look at an OS/2 Warp implementation of Pascal, though!

Windows Timings
Another thing this example illustrates is the need for an accurate execution timing and measuring tool. The GetTickCount API gives us the number of milliseconds since Windows was started. As the result is stored in a LongInt, its value will overflow after about 49 days, although I've never been able to get Windows up and running that long to verify.
From the MMSYSTEM.DLL we can use the function timeGetTime (at index 607), which returns a LongInt. This API is accurate to one millisecond, more than the 55ms. accuracy of GetTickCount (although the SDK seems to disagree).
These two APIs only show the total amount of time that has elapsed during a certain operation (i.e. from a certain part of your program until another part, without taking into account the time that other applications consumed during this time period). Therefore, for truly insightful measuring, we need the TOOLHELP.DLL function TimerCount that at least tells us the time of our virtual machine only.

 uses
   WinCrt, WinTypes, ToolHelp;

 type
   PTimerInfo = ^TTimerInfo;
   TTimerInfo = record
                  dwSize: Longint;
                  dwmsSinceStart: LongInt;
                  dwmsThisVM: LongInt
                end;

 { function TimerCount(lpTimer: PTimerInfo): Bool;

   The TimerCount function fills the specified structure with the execution times
   of the current task and VM (virtual machine).
 }
 var
   Timer: TTimerInfo;
 begin
   Timer.dwSize := SizeOf(Timer);
   repeat
     TimerCount(@Timer);
     WriteLn('Time = ',Timer.dwmsSinceStart:8,' ',Timer.dwmsThisVM:8)
   until keyPressed
 end.

If we use the TimerCount information on listing #1 both when using Yield and without, we see that the actual time spend in the current task and VM (Timer.dwmsThisVM) is the same: about 6,9 seconds!

GUI - USER and GDI
Finally, we come to the Graphical User Interface part, which consist of the USER and GDI parts. The most general rule of thumb you must always remember is that painting takes time! We must always try to paint as little as possible, no redraws of whole screens, and certainly no complex calculations in paint routines.

Resources
The previous chapter contains several tips & tricks on how to minimise code and data size of applications. One thing we always have to take care of are Windows GDI and USER resources; things like Fonts, Brushes, Windows handles, etc. If we don't cleanup and delete this resources after we've used them, they will not be deleted by Windows, and could ultimately result in a system crash or "Not enough Memory" message. Note that this message may come fairly easily if you don't free up resources at all!

One thing that does get freed automagically, is the "normal memory" that is not allocated with GMEM_SHARE. Normal GetMem and New like operations (but of not resources like the above) do not have to be freed. This might actually save your application a little bit of time as a complex datastructure doesn't have to be walked to free the allocated space. Windows will automatically free this space itself, sometime after the application has ended. Note that sometime is almost always immediately, but we have no guarantee for this. Windows might wait until it feels like it.

 uses
   WinCrt;

 const
   MaxP = 1024; { allocate about 32M, until 2Mb is left (else GPF)... }

 var
   p: Array[1..MaxP] of PChar;
   size,i: Word;

   function min(x,y: LongInt): Word;
   begin
     if x < y then min := x
              else min := y
   end {min};

 var
   StartMem,StartMax: LongInt;

 begin
   i := 0;
   StartMem := MemAvail;
   StartMax := MaxAvail;

   repeat
     Inc(i);
     write(i:5,' ':2,MemAvail:10,' ':2,MaxAvail:10);
     size := min(32768,MaxAvail);
     writeln('---> ',size);
     GetMem(p[i],size);
   until (MaxAvail <= LongInt(2048) * 1024) or (i >= MaxP);

   writeln;
   writeln('Start: ',StartMem:10,' ':4,StartMax:10);
   writeln(' Stop: ',MemAvail:10,' ':4,MaxAvail:10);
   readln;
 end.

The program above illustrates this by allocating all available memory but the last 2 Mbytes (I found the system to be unstable and crashing if I allocated these last few bytes too), without explicitly freeing it at the end. However, if we run the program again, or check the Windows | Help information, we note that the memory is immediately freed by Windows itself! A nice feature which just one day might come in handy if you don't have the time to walk your own structure.

WinG
Writers of high-speed graphics applications for DOS, have always been in need of the highest possible performance. When running under DOS, this is possible by writing directly to the video frame buffer and use page flipping for fast screen updates. Under Windows, however, the programmer is shielded from this part of the video architecture, and cannot use these techniques, but have to resort to Windows GDI APIs. This is the main reason why regular graphics under Windows is so notorious slow!
Microsoft and Intel created DCI, a device driver interface that allows display graphics vendors to let Windows take advantage of the video hardware. DCI was developed for use by video device drivers only, however, and not for application programs.
WinG is the application interface (API) that programmers should use to gain the performance that DCI offers. WinG's philosophy is to provide a facility that allows a programmer to create a fast DC which has a Device Independent Bitmap (DIB) as it's drawing surface. This allows the programmer to write specialised, fast assembler routines to manipulate images directly without using the slow GDI.
WinG uses the best possible technique for a particular platform to perform screen updates from the DIB by running special profiling code when a WinG application first starts. This profiling code uses the MMSYSTEM timer drivers by the way to tests the various screen update techniques, and selects the one that performs best to use for all screen update operations. If WinG "knows" about your video hardware, it will obtain a pointer from DVA.386 to its graphics video memory (usually at $A000) and will use this pointer to write to. So, in a typical WinG application, you would create a single WinG DC and use it as your offscreen buffer to paint on.
WinG version 1.0 provides fast DIB-to-screen blts under Windows 3.1, Windows for Workgroups 3.11, Windows 95 and Windows NT version 3.5. WinG will not run under Windows NT version 3.1 or on earlier versions of Windows. The free WinG.DLL can be found in the WINMM forum, library 12, file WINGFL.ZIP or from Microsoft's site.
More information on actually using WinG can be found in an article by Mike Scott in a back issue of The Pascal Magazine. Mike also made a great and ready-to-use pascal import unit for the WinG.DLL!

Borland Pascal Efficiency (6) - (16-bits) DLLs
So far, we've explored how to profile code, we've seen DOS real-mode & DPMI efficiency techniques, focused on optimisations for speeds vs. size and - previous chapter - entered the world of Windows efficiency. I'm planning to change the contents of the remaining chapters a little bit. From now on, each chapter will focus on one aspect of Pascal Efficiency - this time on DLLs - and will also include a short review of a tool that I personally find very helpful with regard to the specific aspect.

Introduction DLLs
A DLL is a library of routines and resources that is linked into your application at run-time instead of compile time. The run-time linking allows a DLL to be shared by multiple applications, thus saving on memory and resource usage and load-time (of the DLL).

DPMI/Windows?
If you compile a DLL for a Windows target, then it can be used by a DPMI application as well. Not the other way around, so effectively, there is no reason why you should ever want a DLL to be compiled for a DPMI target (there is no substantial gain by doing so). While I develop applications for both Windows and DPMI, my DLL are always pure Windows targets, and can be re-used by both DPMI and Windows applications. The only thing I have to avoid are calls to USER's User Interface or GDI's Graphics APIs, which leaves enough 'engine' DLLs for me to work on.

Init & Exit
Each DLL has its own Init (loading) and Exit (unloading) code. Note that while a DLL can be used by many applications at the same time, the init & exit are only executed once (and not for each application). In Pascal, we can express this as follows:

 {$A+,B-,D-,F-,G+,I-,K-,L-,N-,P-,Q-,R-,S+,T-,V-,W-,X+,Y-}
 Library DLL;

   function Func: Integer; export;
   begin
     Func := 6;
   end {Func};

 exports
   Func index 1;

 var
   SaveExitProc: pointer;

   procedure NewExitProc; far;
   begin
     ExitProc := SaveExitProc
   end {NewExitProc};

 begin
   SaveExitProc := ExitProc;
   ExitProc := @NewExitProc
 end.

As a DLL is shared by a lot of applications, you can even enhance performance by placing shared code or resources in a DLL. Since the DLL won't have to be loaded by the second application, this saves on start-up time. Of course, loading each DLL, especially if we've got more than one DLL, takes more time than loading only one stand-alone EXE file would.

Loading & Unloading
Loading a DLL can be done in two ways: automatically with an import unit, or by hand with LoadLibrary. The first way is very easy, but can introduce a few problems when the DLL is not available, such as an error messagebox from windows that the DLL wasn't found. An example import unit for the little DLL given above might be as follows:

 {$A+,B-,D-,F-,G+,I-,K-,L-,N-,P-,Q-,R-,S+,T-,V-,W-,X+,Y-}
 unit Static;
 interface
 function Func: integer; far;

 implementation
 function Func; external 'DLL' index 1;
 end.

If we link this import unit to our main application, then the function Func will be automatically dynamically linked with our application; we won't have to do anything more. However, when we run our main application and the DLL is not present either in the current directory, the Windows directory or the Windows\System directory, then Windows will complain about this with an error message. It is possible to check for the presence of a DLL before attempting to use it. You'll have to change your interface unit for, changing all of the procedure definitions to procedural variable declarations. By eliminating all of the "external 'DLLNAME' index IndexNum;" clauses you eliminate the implicit reference to (and the automatic attempt to) load the DLL. To make "DLL" (from listing 2) a DLL that can be conditionally loaded, the interface unit might look something like:

 {$A+,B-,D-,F-,G+,I-,K-,L-,N-,P-,Q-,R-,S+,T-,V-,W-,X+,Y-}
 unit Dynamic;
 interface
 uses
   Wintypes, Winprocs, Win31;

 const
   DLLPresent: Boolean = False;

 var
   Func: function: Integer;

 implementation

 var
   SaveExitProc: Pointer;
   DLLHandle: THandle;

   procedure NewExitProc; far;
   begin
     if DLLHandle >= 32 then FreeLibrary(DLLHandle);
     ExitProc := SaveExitProc
   end {NewExitProc};

 begin
   SetErrorMode(SEM_NOOPENFILEERRORBOX);
 { this will avoid the Windows ErrorBox }
   DLLHandle := LoadLibrary('DLL.DLL');
   if DLLHandle >= 32 then
   begin
     SaveExitProc := ExitProc;
     ExitProc := @NewExitProc;
     DLLPresent := True; { use can use Func now }
     @Func := GetProcAddress(DLLHandle,'FUNC');
   end
 end.

This second interface unit causes DLL.DLL to be conditionally loaded. Note that we use the SetErrorMode Windows API to stop Windows from displaying the errorbox when the DLL could not be found. Hence, we have to check the returnvalue of LoadLibrary. If it's bigger than 31, it's a valid handle to the DLL we've just loaded. We can then use GetProcAddress to actually "dynamically link" the functions we want.
The variable DLLPresent can be tested in the program that uses this unit and call "Func" only when DLLPresent is true:

 {$A+,B-,D-,F-,G+,I-,K-,L-,N-,P-,Q-,R-,S+,T-,V-,W-,X+,Y-}
 {$M 1024,0}
 uses
   Dynamic, WinCrt;

 begin
   if Dynamic.DLLPresent then
     writeln(Dynamic.Func);
 end.

Unloading the DLL can also be done in two ways: either automatically or by hand with FreeLibrary. The last options even enables us to write a short program to unload 'dangling' DLLs from the system.

Using a DLL
There's no practical limit on the number of functions, but there is a limit on the size of your code segment - 64KB. Eventually you may need to move some of your functions into units which are then directly linked to the DLL. That would allow you to have multiple code segments. The data segment of your DLL is shared by all calling applications, and is also limited by 64K, minus the local heap in this case. A DLL has no stack for itself, and uses the stack of the calling application. Therefore, your DLL should be conservative about stack space; allocating 8K or so of local variables on the caller's stack would be a bit "rude", and can lead to a GPF real fast. Finally, you might want to arrange the functions into logical groups and (using the Unit idea above) mark the code segments as "loadoncall and discardable". If they are marked as "preload", Windows is going to try to drag them all into memory at startup, which could take a long time (more on this later).

Profiling a DLL
We can profile a DLL with Turbo Profiler for Windows. When you load in TPROFW a program that uses DLLs, TPROFW checks for symbol tables (just as the main .EXE) and autoamtically loads the symbol table and source of every linked DLL. If your DLL is loaded automatically, you can set profile areas right from the start by picking the DLL as Module to view. If the DLL is loaded by hand with LoadLibrary, then you must set a Stop area marker right after the LoadLibrary call, in order to get to the DLL Module to set the profile areas. This kludge is actually another reason why I generally prefer import libraries to automatically load DLLs instead of using LoadLibrary. Note that you'll get a separate .TFS and .TFA file for each DLL that is loaded by your program (since each DLL also has its own .MAP file and symbol table).
Tip: If you've installed TDDEBUG.386 (in SYSTEM.INI as device=) then you're able to interrupt a profiling (or debugging) session with the key-combination Ctrl+Alt+SysReq. This has the same effect as Ctrl+Break in Turbo Profiler for DOS.

Another way to profile the functions in a DLL is to use timer code from MMSYSTEM or TOOLHELP (see previous chapter), and log them on disk. With special {$IFDEF DEBUG} compiler directives, you can easily switch between debug and production versions of your DLL. A typical DLL with one BigFunction might look like this:

 {$A+,B-,D-,F-,G+,I-,K-,L-,N-,P-,Q-,R-,S+,T-,V-,W-,X+,Y-}
 {$M 1024,0}
 Library Profile;
 {$DEFINE DEBUG}
 {$IFDEF DEBUG}
 const
   LogFile = 'PROFILE.LOG';
   MakeLog: Boolean = TRUE;

 var
   Log: Text;

   function timeGetTime: LongInt; far; external 'MMSYSTEM' index 607;
 {$ENDIF}

   function BigFunction: Integer; export;
   {$IFDEF DEBUG}
   var
     Start,Stop: LongInt;
   {$ENDIF DEBUG}
   begin
     {$IFDEF DEBUG}
     Start := timeGetTime;
     {$ENDIF DEBUG}
     BigFunction := 0; { or some other heavy stuff  }
     {$IFDEF DEBUG}
     Stop := timeGetTime;
     if MakeLog then writeln(Log,'BigFunction: ',Stop-Start);
     {$ENDIF DEBUG}
   end {BigFunction};

 exports
   BigFunction index 1;

 var
   SaveExitProc: pointer;

   procedure NewExitProc; far;
   begin
     ExitProc := SaveExitProc; {resore saved exitproc}
     {$IFDEF DEBUG}
     if MakeLog then close(Log)
     {$ENDIF}
   end {NewExitProc};

 begin
   SaveExitProc := ExitProc;
   {$IFDEF DEBUG}
   Assign(Log,LogFile);
   rewrite(Log);
   {$ENDIF}
   if (IOResult = 0) then ExitProc := @NewExitProc
                     else MakeLog := False;
 end.

Using the framework from listing 5, you can embed profiling code in your own DLLs, and while you at it, you can even write debugging statement as well. I found this simple techniques very useful in either debugging or profiling parts of DLLs.
It appears that you can get in trouble when you call the WinAPI SetHandleCount from within an application that uses the DLL above to profile. SetHandleCount, a handy API to extend the number of files you can open from a Windows application, appears to 'clear' the already existing file handles, including our 'Log', which may lead to serious troubles in the example code above...

Loadtime Efficiency
Having split your Windows application into several DLLs to support your main application obviously introduced another problem: loadtime. Each DLL will have to be loaded, whether it's dynamically or automatically, and this takes time. You cannot expect each DLL to be already loaded by another application. The loadtime of a DLL depends on two factors: the number of functions inside the DLL that have to be 'imported' and assigned to your import library functions, and the total code size of the DLL. Optimising the individual functions for speed and size is something we've done before. Optimising the total DLL codesize seems like something we cannot do much about, or can we?

Enter EXECHAIN
In order to decrease the file size of a Windows executable or DLL, I've experimented with a nice program called EXECHAIN, version 2.1 and 2.2, written by Frank E. Haggar. The main purpose of EXECHAIN is to shrink the file size of Windows 3.x New Executable (NE) programs (such as EXE, DLL, DRV, SCR, CPL, and others), in order to speed program loading and startup. I've found a typical result of about 5% up to 15% reduction in the file size of a NE program built by Borland Pascal 7.0 or BC++ 4.5:
EXECHAIN reads in New Executable (NE) files and processes the relocation records (fixups) looking for duplicates or chains. The program can list a summary of this information on a segment-by-segment basis. It will also produce overall totals for all relocation records, chained records, and duplicate records (see figure 1).
EXECHAIN is based on the information about relocation record chaining which appeared in the July 1993 issue of Microsoft Systems Journal. The title of the article is: "Liposuction Your Corpulent Executables and Remove Excess Fat", by Matt Pietrek (Volume 7, 1993). In his discussion of reducing the size of executables, Matt describes the Windows' loader support for chaining relocation record together by utilizing the target address for the head record as a link to another offset requiring the identical fixup value. The chain is terminated by the occurance of 0xFFFF in the segment data at the fixup's target offset. Frank Haggar wrote EXECHAIN based on this concept, since not all linkers utilize this feature of the NE file format when creating executables. It can produce detailed lists on both the "before" and "after" files, and reports the overall savings in file size.
Simply provide the name of an NE file on the command line or use the "Open" menu command to load any NE formatted executable file. EXECHAIN will report an error if other types of files are loaded. Another easy way to select a file is by using the new "Directory List" command, which scans a directory and only lists NE files. This list also shows the savings per file that EXECHAIN can provide! I've actually saved a lot of Megabytes disk space (and probably quite some loadtime) by applying EXECHAIN to my Windows directory-structure!

Borland Pascal Efficiency (7) - Some Assembly Required
In the past few chapters we've seen Pascal Efficiency in many ways: finding bottle-necks, changing algorithms and datastructures and working with special language constructs. This time, we'll descend to the lowest level of program optimisation: assembly.

If all else fails, we can only buy a bigger machine or make sure we get more out of the existing machine. And the only way to get more out of the existing machine is to get down to the bare metal.
Borland Pascal offers four ways to include assembly code with our programs: we have BASM (Built-in ASseMbler code), InLine code, InLine macros and external Assembly. Except for InLine code, which is redundant since BASM was introduced with Turbo Pascal 6.0, each has its own strengths and special uses.

BASM
Between the keywords "asm" and "end" we can embed assembly statements. In fact, Borland Pascal offers a great deal of support for this kind of assembly, as we do not have to worry about the stack or setting up our own stack frame. This is all done for us. Also, we can call all local variables by name. All we need to take care of is not to trash the cs, ss and bp registers, and to save & restore the ds register at all time. Other than that, it's plain easy and very powerful!

InLine Code
Before Turbo Pascal 6.0, there was no BASM, and we had to use External Assembly or InLine code. InLine code consists of hexadecimal codes, which are the 'compiled' equivalent (opcodes and operands) of the assembly statements. Using DEBUG we had to get the hex version of the assembly statements. Luckily, these days are over now.

InLine Macros
Like InLine Code, InLine Macros consist of hex statements. Unlike InLine Code, however, InLine Macros still have a reason to exists. InLine Macros are no routines, and have no real call/return statements, they are macros in the sense that their definition is substituted for each pseudo-call. Since we have no call/return overhead, InLine macros can offer a big speed improvement compared to regular assembly code.
There's only one downside of InLine macros: they can bloat your code size as they are expanded for every pseudo-call! This means that the recommendation is simple: don't make them big and/or don't call them often!

External Assembly
Using external assembly, we write an .ASM file that compiles to an .OBE file using Turbo Assembler (a separate tool). With the {$L FILE.OBJ} directive we link the external .OBJ file with our Pascal program. We could even link other external .OBJ files in this way, for example from C or FORTRAN, as long as they confirm to certain rules (like no calls to the C runtime library).
The later is almost never necessary, and requires an external program Turbo Assembler. InLine macros are replaced by BASM (which is much easier to read and write than plain hex-codes anyway), which leaves BASM and InLine macros to study further.

BASM Example
Borland Pascal's built-in assembler BASM, allows us to write 8086 or 80286 code (depending on the state of the $G compiler option). No 32-bit code is possible at this time (32-bit BASM will probably be introduced with the 32-bit version of Delphi). We can write simple assembler statements between "asm" and "end", or make truly use of the power of assembly when writing assembler routines using the "assembler" keyword. Straight assembler code will compile directly into machine code, while one Pascal statements may often compile to several assembler statements that will generate bigger and slower machine code. That's why - especially in tight loops - assembly can make a difference!
The best way to learn BASM is to start small, with easy and little examples. Alternately, you could compile little pieces of Pascal code and compare the generated dis-assembly code in the Turbo Debugger (or Profiler) or generate a mixed Pascal/Assembly source code listing with DUMPPROG 2.10; a free and very handy tool written by Duncan Murdoch, with help from William Peavy and Jeroen Pluimers.
I suggest you also read the Borland Pascal Language Guide (chapter 24) for more background information. A lot of BASM examples can be found in the Borland RTL code. Take for example the Turbo Vision APP unit in the APP.PAS file, where a BASM implementation of ISQR (integer square root) can be found:

 function ISqr(X: Integer): Integer; assembler;
 asm
      MOV   CX,X   { CX := X;          }
      MOV   BX,0   { BX := 0;          }
                   { repeat            }
 @@1: INC   BX     {   BX := BX + 1;   }
      MOV   AX,BX  {   AX := BX;       }
      IMUL  AX     {   AX := AX * AX;  }
      CMP   AX,CX
      JLE   @@1    { until AX > CX;    }
      MOV   AX,BX  { AX := BX;         }
      DEC   AX     { Result := AX - 1; }
 end;

As you can see from the comments (written by me, the Borland Pascal RTL does seldom include comments), the algorithm is very straightforward: you just start with 1 and keep adding 1 until the square is bigger than the figure you need the integer square root from. The answer is just one before the current number.

BASM Strings
Another example where BASM can give you a true performance boost are String procedures and functions. Compare the following two String functions that return a string of a specific length with the same character. The functionality is the same, and so is the algorithm: first set the length Byte, then fill the string with the required character in a little loop. The only difference is that we use a high-level for-loop in the Pascal version, and we can use a special REP STOSB loop for the BASM version.

 function MakeString1(Ch: Char; Len: Word): String;
 var
   DI: Word;
 begin
   MakeString1[0] := Chr(Len);
   for DI:=1 to Len do MakeString1[DI] := Ch
 end {MakeString2};

 function MakeString2(Ch: Char; Len: Word): String; Assembler;
 asm
      mov   AL,Ch       { AL := Char   }
      mov   CX,Len      { CX := Length }
      les   DI,@Result  { point to str }
      cld               { move forward }
      mov   ES:[DI],CL  { store length }
      inc   DI          { first char   }
      repz  STOSB       { fill string  }
 end;

The BASM version is more than twice as fast (993 versus 2251 reps; see makestri.out, or run makestri.exe).

InLine Examples
As I've explained earlier, InLine macros differ from BASM code in two ways. First of all, they're macros, and not code that is called (in case of a procedure). Second, and often a bigger problem: they're written as hex codes! Using DUMPPROG, however, we get the mixed Pascal/Assembly listing with hex codes next to them. Now, all we need to remember is that the arguments are PUSHed on the stack so we have to clear the stack ourselves. Let's look at the first few examples with only one simple argument.
The unit isCmacro contains several C-like Character handling InLine macros, such as: IsAlNum, IsAlpha, IsAscii, IsCntrl, IsDigit, IsGraph, IsLower, IsPrint, IsPunct, IsSpace, IsUpper and IsXDigit. C Programmer have them available in the runtime library, now you have them in a fast InLine macro unit as well! IsAlpha (19 bytes) returns True if the argument given is a letter of the alphabet (A..Z, a..z), IsAscii (only 7 bytes) returns True if the argument given is a lower ASCII character (ord < 128):

 function IsAlpha(C: Char): Boolean;
 InLine(
   $58/          {      pop   AX        }
   $3C/Ord('A')/ {      cmp   AL,'A'    }
   $72/$0C/      {      jb    @17       }
   $3C/Ord('Z')/ {      cmp   AL,'Z'    }
   $76/$0A/      {      jbe   @19       }
   $3C/Ord('a')/ {      cmp   AL,'a'    }
   $72/$04/      {      jb    @17       }
   $3C/Ord('z')/ {      cmp   AL,'z'    }
   $76/$02/      {      jbe   @19       }
   $30/$C0       { @17: mov   AL,False  }
   );            { @19: ELSE  True      }

 function IsAscii(C: Char): Boolean;
 InLine(
   $5B/          {      POP     BX      }
   $31/$C0/      {      XOR     AX,AX   }
   $D0/$E3/      {      SHL     BL,1    }
   $1C/$FF);     {      SBB     AL,$FF  }

As you can see, the first thing we needed to do was to POP the argument from the stack (so we won't leave the stack in a bad state when we return).

PutPixel
Another example that really shows the speed difference of an InLine macro and plain Pascal is a little piece of code that may be of use in a mode $13 graphics application: PutPixel. A general plain Pascal way to "put" a pixel of a certain color on a 320x200x256 mode $13 screen is as follows:

 procedure PutPixel(x,y: Integer; c: Byte);
 begin
   Mem[$A000:Word(320*y)+x] := c;
 { is this as fast as we can do? }
 end;

Again, we can speed this up by several factors if we use a small InLine macro PutPixel18 (only 18 bytes):

 procedure PutPixel18(c: Byte; x,y: Integer);
 Inline(
   $B8/$00/$A0/      {  mov   AX,$A000   }
   $8E/$C0/          {  mov   ES,AX      }
   $5B/              {  pop   BX         }
   $88/$DC/          {  mov   AH,BL      }
   $5F/              {  pop   DI         }
   $01/$C7/          {  add   DI,AX      }
  {$IFOPT G+}
   $C1/$E8/$02/      {  shr   AX,2       }
  {$ELSE}
   $D1/$E8/          {  shr   AX,1       }
   $D1/$E8/          {  shr   AX,1       }
  {$ENDIF}
   $01/$C7/          {  add   DI,AX      }
   $58/              {  pop   AX         }
   $AA);             {  stosb            }

Of course, if you want to put several pixels on a line for example, you shouldn't put them one by one on the screen, but you should use a loop similar to the BASM String function we saw earlier. This example just shows that an InLine macro can even beat tight Pascal very easily.

Set Macros
Turbo Pascal has a built-in datatype Set, which is often a very elegant way to express or solve problems. However, Turbo Pascal Sets are limited to 256 elements, and the implementation of sets up to Borland Pascal version 7.0 was not really fast. In fact, sets have always been so slow in the past, I often told my fellow programmers to avoid using them. In order to understand the limits of the Turbo Pascal Sets, we must take a look at the internal datastructure of a set. A set is just a row of bytes, in which each element of the set occupies one bit:

   Byte 0             Byte 1                 Byte 31
  [_|_|_|_|_|_|_|_]  [_|_|_|_|_|_|_|_]  ... [_|_|_|_|_|_|_|_]
   0 1 2 3 4 5 6 7    0 1 2 3 4 5 6 7   ...  0 1 2 3 4 5 6 7

The bit is set if the element is 'in' the set, and the bit is zero if the element is not in the set. So, in order to see whether a particular element is in a set, we must check the bit at position Ord(element). In pure Turbo Pascal code, this looks as follows:

 function InSet(var _Set; OrdL:Byte):Boolean;
 var
   SetAr: Array[0..31] of Byte absolute _Set;
   ByteNr, BitNr: Byte;
 begin
   ByteNr := OrdL div 8;
   BitNr := OrdL mod 8;
   { Now the Byte and Bit numbers are known }
   if ((SetAr[ByteNr] SHR BitNr) AND $01)=$01
   then InSet := True  { bit is set }
   else InSet := False { bit is zero }
 end {InSet};

If we examine the technique used for InSet, we can see why a set is limited to 256 elements. Since the indexing of sets is done by Bytes, we are limited to indexes 0 through 255. If indexing would be done by Words instead of Bytes, sets could have as much as 65536 elements, and each set could be as big as 8192 bytes. The set-operations themselves would not suffer much from this increase in size, because the general algorithm (dividing by 8 to get the byte-number, while the remainder is the bit-number) remains the same. Copying or comparing these Big Sets would be real performance killers, but that's an optimising story for another issue of The Pascal Magazine.
The pascal code from function InSet can be optimised by implementing the same functionality into an InLine macro:

 function InSet(var _Set; OrdL: Byte): Boolean;
 InLine(
   $58/         { pop   AX                  }
   $31/$DB/     { xor   BX,BX               }
   $88/$C3/     { mov   BL,AL               }
 {$IFOPT G+}
   $C1/$EB/$03/ { shr   BX,$03              }
 {$ELSE}
   $B1/$03/     { mov   CL,3                }
   $D3/$EB/     { shr   BX,CL               }
 {$ENDIF}
   $24/$07/     { and   AL,$07              }
   $B1/$01/     { mov   CL,1                }
   $91/         { xchg  AX,CX               }
   $D2/$E0/     { shl   AL,CL               }
   $5F/         { pop   DI                  }
   $07/         { pop   ES                  }
   $26/         { ES:                       }
   $22/$01      { and   AL,BYTE PTR [DI+BX] }
 {$IFNDEF TRUEOPT}
  /$D2/$E8      { shr   AL,CL               }
 {$ENDIF}
 ) {InSet};

Note that we can only use the "SHR BX,$03", to divide BX by 8 and get the byte-number, when compiling with {$G+}. The routine InSet is 22 bytes long when compiled for 80286+ computers, and one byte longer when compiler for general 8088 computers and up.
If we use an undocumented feature of Turbo Pascal, namely the fact that FALSE is equal to 0, and TRUE is equal to anything but 0, we can even lose the last two bytes. This option is enabled with my personal TRUEOPT compiler directive in unit FASTSET.PAS.

Pentium Optimisations
Dr. Bob got himself a Pentium, so let's view a little Pentium-specific optimisation! Pentium Optimisation often involves instruction breakdown and reorganising. I've done the later, in order to enable to Pentium architecture to execute some instructions at the same time. In order to do this, I had to make sure two 'parallel' instructions were both simple and did not interfere with each other (i.e. not accessing the same register or memory address). Obviously, 80286 code is enable, so I took the code for $G+ and rearranged the instruction sequence long enough to come up with the following:

 function InSetP(var _Set; OrdL: Byte): Boolean;
 InLine(
   $58/         { pop   AX                  1}
   $31/$DB/     { xor   BX,BX               1}
   $5F/         { pop   DI                  2}
   $88/$C3/     { mov   BL,AL               2}
   $C1/$EB/$03/ { shr   BX,$03              3}
   $24/$07/     { and   AL,$07              3}
   $B1/$01/     { mov   CL,1                4}
   $07/         { pop   ES                  4}
   $91/         { xchg  AX,CX               5}
   $D2/$E0/     { shl   AL,CL               6}
   $26/         { ES:                       7}
   $22/$01      { and   AL,BYTE PTR [DI+BX] 7}
 {$IFNDEF TRUEOPT}
  /$D2/$E8      { shr   AL,CL               8}
 {$ENDIF}
 ) {InSetP};

Rearranging the instructions this way has no effect for the performance of InSetP compared to InSet on 80286 - 80486 machines. Only on a Pentium can we expect an increase in efficiency (as we'll see shortly).

Comparisons
After making these routines, we have to compare them to the set-routines from Turbo Pascal. I used Borland Pascal 7.01 to compile and compare the routines, counting the number of executions per clock-tick on a 80486 DX2 66Mhz machine and a 100Mhz Pentium. Timings were done using the so-called Reps Timer technique, where we count the number of executions of a certain piece of code in a certain time frame (typically one or two clock ticks). The results are exactly as I want them (but you should test it for yourself, of course):

80486 Pentium

IN 47513 137220

InSet 72224 165397 (121%)

InSetP 72224 171525 (125%)

We see that on a 80486 (and generally on any non-Pentium machine) the Pentium version of the InLine macro does not perform with a noticeable difference compared to its non-Pentiumised counterpart. The InLine macro on a Pentium does however give a 4% increase in performance for the Pentium versions. So by simply rearranging a few instructions I've managed to get an additional 4% increase in speed on a Pentium!!

InLine Code Size
InLine macros are faster than BASM, because they eliminate the (far or near) call. However, InLine macros can expand the code size of your application, because they are expanded each time they are called. Let's see if this is true for our relative small InLine macro InSet. Using DUMPPROG again, we can see how much the regular Borland IN operator takes:

 SET.PAS#1:  var S: Set of Char;
 SET.PAS#2:      t: Char;
 SET.PAS#3:  begin
 SET.PAS#4:    if t in S
   0000:000F A07000      MOV   AL,[0070]
   0000:0012 B401        MOV   AH,+01
   0000:0014 BA2000      MOV   DX,0020
   0000:0017 9A4E050600  CALL  FAR 0006:054E
   0000:001C 8BFA        MOV   DI,DX
   0000:001E 84855000    TEST  [DI+0050],AL
   0000:0022 7415        JZ    0039
 SET.PAS#7:  end.

Counting bytes, we see that the Borland version of IN takes 19 bytes. However, these 19 bytes include a far call to the RunTime Library, which accounts for the time difference between these routines and the InSet macro. The macros InSet is 4 bytes longer (3 bytes if compiled with $G+). It is my belief, however, that this increase in codesize outweighs the increase in speed. Especially in applications where each (micro-)second counts.

SpeedPack II
We've seen that BASM code or InLine macros can really make a difference in application speed. We've seen many examples, but it still seems complex enough to make them by hand. Fortunately, Eagle Performance Software is selling a commercial library of very highly optimised 32-bit system routines (system units) and string routines for Turbo and Borland Pascal in real-mode, DPMI and Windows.
Because optimised versions of the system unit are supplied, you can benefit of speed from the ground up! Your programs will run faster by simply recompiling them with the SpeedPack II system units. I've seen big differences in many routines, and they've included a list of expected efficiency increased and benchmarks. The SpeedPack II manuals are disk-based and are available through the Windows 3.1 WINHELP.EXE program.

If you have Windows (though it is not required), the installation of SpeedPack II requires version 3.1. This installation is intended for 32-bit computers, but it will install on 286 16-bit systems. It will not install completely on 8088/86 systems. After installation, SpeedPack II files are located in the SP2 subdirectory under your Pascal general directory such as C:\BP\SP2. The string unit files are in the STRG subdirectory such as C:\BP\STRG.
The installation program changes the Run-time library (RTL) to the "TPL and TPU" method for the fastest response for system unit switching. This means that the System units have been removed from the TPL files and the new ones placed in the UNITS directory. An RTL Manager is provided to switch between the three different system units.

STRG
Apart from a super-fast system unit, SpeedPack II comes with a super-fast STRG unit with lots of String routines. The best thing about this is not even its speed, but its accessibility: with some 216 (!) functions and procedures inside, you quickly forget what's in and what not (or what's the name of that particular routine you need right now). The makers of STRG have given this a deep thought, and came up with a very simple outline for every routine name: target, process, type, direction and case. Each of these five parts will consist of one to five characters that together make up the function name. Looking for a routine to find the position of the Nth occurance of a certain case-insensitive character from left to right in a string? try Chr-Pos-N-RL-I, or ChrPosNLRI for the complete name. It also works the other way around: given the name, you can quickly identify the purpose of a string function. Once you get the hang of it, you won't need any other String unit, believe me (especially if you consider the raw speed - all routines are hand written and optimised assembler code!).

TurboPower OPSTRING
But what if I already use a good string unit, for example OPSTRING from TurboPower, you might ask? Well, don't worry, because TurboPower has given Eagle special permission to distribute optimised versions of their OPSTRING unit. Eagle has supplied this unit as a high performance replacement for those who have used it and want a modification-free boost in performance. The documentation for these optimised routines are of course to be found in the TurboPower manuals and are not supplied by Eagle Performance Software. You must be licensed with TurboPower to use this unit since all of the source code is only available through them!

Borland Pascal Efficiency (8) - Algorithm Analysis
The past few chapters we've focused on programming issues concerned with efficiency. We've seen general topics, like profiling code, finding bottle-necks, and special language features of Pascal to get the best performance. We've also explored specific topics such as DOS, DPMI or Windows optimisation techniques. Last time, we even explored assembler code and the bare metal of the machine itself. This time, however, we'll examine a less machine specific but even more important aspect of performance optimisation: algorithm analysis.

Image that Batman's Submarine, the bat-sub, slowly floats to the entry of the bat-cave. Batman is tired, and he sets the bat-sub to the automatic bat-pilot. The bat-sub is 20 meters from it's slot at the bat-cave, going a steady 2.4 meters per second. A sub-sonic radar (sonar) bat-pulse is sent out by the bat-sub as the bat-pilot engages. The bat-pulse is targeted straight for the bat-cave at a speed of 100 meters per second. As it hits the designated slot in the bat-cave, it bounces back to the bat-sub. Back at the bat-sub it again bounces back to the bat-cave and so on, until the bat-sub is at its designated slot in the bat-cave. The bat-pilot of the bat-sub uses the time interval between two bounces of the sonar pulse to determine when to decrease speed in order to make a smooth and complete stop when the sub is at the slot in the bat-cave. Exactly 17 seconds after batman engaged the automatic bat-pilot, the bat-sub comes to a smooth stop in the bat-cave. Batman gets out, and wonders how much distance the sonar pulse had to travel for this stop (remember that a bat-pulse needs to be aligned every 10,000 miles!).

Question: design an fast algorithm to calculate the distance (in meters, accurate to a centimetre) the sonar pulse had to travel for this trip. If you wish, you can safely ignore the relative speed of the bat-pulse to the bat-sub (just assume the bat-pulse hits the bat-cave at a speed of 100 m/s).

This article is about algorithm analysis, and you can try to figure out a fast algorithm for the batman problem during the time you read the article. At the end, I'll give you my solution, and by that time I hope you'll understand the importance of algorithm analysis and design to performance and efficiency.

Bottle-necks
With every application that needs a performance boost, we first need to find the bottle-neck. You just don't simply start to optimise a program, you need to focus your attention to a specific target; the bottle-neck in speed. Whenever you've found the bottle-neck, the first and foremost thing that needs to be done is to check the algorithm and datastructure that are being used at that specific point.
But how do you check a specific algorithm? What can be done to ensure that a datastructure or algorithm is in fact a good solution or can be improved upon? For this, we need to explore a rather theoretical field called algorithm analysis and algorithm design.

Algorithm Analysis
Algorithm analysis focuses on the run-time behaviour of the algorithm in space and time efficiency. We often use the Big-Oh notation to express the running-time order of an algorithm. Given a set of N input arguments (like a list of N elements), an algorithm is said to be O(N) if the running-time of the algorithm is a linear function of the number of elements N. Searching for an item in an unsorted list is an example of an algorithm that is O(N). Likewise, an algorithm is said to be O(log N) if it takes only "log N" steps to perform the task at hand; like searching for a certain item in an ordered list using binary search (by-the-way: tell me honestly, have you ever used binary search when looking for a person's phone number in the phone book?). We can distinguish the following important O(N) running-time behaviours:

O(N) behaviour

c Constant

log N Logarithmic

N Linear

N log N linear-logarithmic

N2 Quadratic

N3 Cubic

2N Exponential

In practice, we'll mostly meet algorithms with a Logarithmic, Linear, Linear-Logarithmic or Quadratic behaviour. An algorithm with a Big-Oh slower than Quadratic is often not very practical (even quadratic is often not usable for large input sets).
Using this Big-Oh information, we can predict that an algorithm will be slower or faster than another algorithm. A Bubble Sort of O(N2), for example, is much slower than a Quick Sort which has an average case of O(N log N). Just fill in N=1000 for a list of 1000 elements that needs to be sorted, and you'll see that Bubble Sort will need O(1,000,000) operations, while Quick Sort can do it in O(10,000) - the difference between one second and two minutes! For operations, we can substitute a certain amount of work that need to be done in the so-called inner loop of the algorithm.

Algorithm Example
To illustrate the use of algorithm analysis to write more efficiency algorithm, let's consider a simple example of a function that raises a positive integer to a positive integer power (see listing 1):

 function Power(X, N: Word): LongInt;
 { calculate the power of X by repeated multiplication }
 var
   i: Word;
   Result: LongInt;
 begin
   Result := 1.0;
   for i:=1 to N do { inner loop O(N) }
     Result := Result * X;
   Power := Result
 end {Power};

We can see from the code of listing 1 that the algorithm indeed takes N multiplication operations (and some constant number of assignments, which are ignored in the Big-Oh behaviour description). Therefore, this algorithm is said to be O(N). Now, O(N) might not sound too bad, but what if the power function itself is called many times and in fact part of another inner-loop from another algorithm. Then, by optimising the Big-Oh or running time behaviour of the function power, we also optimise all algorithms that use power.

Let's see if we can do better. If I incorporate mathematics knowledge that states that XN = XN/2 * XN/2 if N is even, and XN = X(N-1)/2 * X(N-1)/2 * X if N is odd. Now, we can re-write the inner loop of the power function as follows:

 if Even(N) then
   Result := Power(X * X, N div 2)
 else
   Result := Power(X * X, N div 2) * X;

Which will result in a recursive, but otherwise O(log N) algorithm. The number of multiplications (and divides) is at most 3 log N, since two multiplications and one "div" are needed for the Odd(N) case which then halves the problem (hence the log).
Make sure, by-the-way, to include a first line that stops the recursive calls to Power in case N finally gets to 0. Otherwise, the algorithm can go on forever.

Will this new algorithm be faster in practice? Yes it will, because our algorithm analysis has already shown the Big-Oh behaviour to be significantly superior compared to the old version. Running the new power function will prove this to you (in case you need to see for yourself). And while this is only an example of a small function that needed to be optimised, imagine how much work you can save yourself as long as you're still theoretically proving that one algorithm is faster than another one (before you find out - in practice, after you've built it - that the Big-Oh behaviour is still the same, or worse)!

Show Me...
Just in case you do need to see the results for yourself. I've already done that, but I've included a third power function that eliminated the recursive calls and solved them in place. This third function is even faster than the second one, because it has the same Big-Oh behaviour, but saves on (recursive) function calls, and is thus faster than the recursive solution (remember that a call and return take time - so we've now used a language feature, the next step in performance optimisation, to make sure our final routine is faster than the original one). The code for the second power function (named WeissPower, read on!) and the SwartPower function can be seen in the following listing:

 function WeissPower(X: LongInt; N: Word): LongInt;
 var i: Integer;
 begin
   if N = 0 then WeissPower := 1
   else
     if N = 1 then WeissPower := X
     else
       if Odd(N) then
         WeissPower := WeissPower(X * X, N div 2) * X { recursion }
       else
         WeissPower := WeissPower(X * X, N div 2) {recursion }
 end {WeissPower};

 function SwartPower(X: LongInt; N: Word): LongInt;
 var Result: LongInt;
 begin
   if N = 0 then SwartPower := 1
   else
   begin
     if N = 1 then SwartPower := X
     else
     begin
       if Odd(N) then
       begin
         Result := X; { prevent first multiplication of 1 with X }
         Dec(N)
       end
       else Result := 1;

       while N > 1 do { inner loop O(log N) }
       begin
         if Odd(N) then Result := Result * X;
         X := X * X;
         N := N div 2
       end;

       if Result > 1 then
         SwartPower := Result * X
       else { prevent last multiplication with 1 }
         SwartPower := X
     end
   end
 end {SwartPower};

I've tested all three routines against each other by raising a given prime number (3 in this case) to the 0nd through 19th power, and measuring the number of executions per clocktick (also called the "reps"-method). The results can be seen below:

Note that the Weiss method is faster than the general Power function, of course, and that the Swart power function is fastest of all.

Algorithm Design
Let's go back to performance optimisation again. Suppose we've found a bottle-neck in our code, and we can identify the algorithm that's been used at that particular place (say, a simple bubble sort to sort the items in a pick list). Now, it's time to see what other algorithm might be used in this place. Often, we can simply pick a book and read what other algorithms exist to perform a certain task (like sorting). If that's the case, then we simply have to pick the algorithm that looks most promising for the given situation, also taking into account the time it'll take you to actually implement it. If you cannot find a replacement algorithm, you'll have to design one by yourself. In that case, you'll know on beforehand what Big-Oh behaviour you'll have to beat. So, on to the drawing board, to design a truly unique algorithm for your particular case that'll run in, say O(log N log N).

Book
A great book on data structures and algorithm analysis is the second edition of a book with that same title by Mark Weiss ("Data Structure and Algorithm Analysis", by Mark Allen Weiss, The Benjamin/Cummings Publishing Company, 1995, ISBN 0-8053-9057-X). Using Pascal, Mark illustrates the power of datastructures and algorithms and the use of algorithm analysis for efficiency and performance. Actually, the Weiss power function was taken from this book: a great algorithmic improvement over the general power function, which only could be speed up by using faster language constructs (i.e. eliminating the recursive calls).

Riddle Me This...
In case you haven't figured out the distance the sonar pulse travelled during the lifetime of the automatic pilot, remember that the pulse was only active for 17 seconds (the time it took the autopilot to stop the bat-sub). In those 17 seconds, the pulse must have had a most bouncing trip, but we're not interested in where it went, just how far it went. And at a speed of 100 meters per second, it used those 17 seconds to travel exactly 1,700 meters!

I guess this shows that the most obvious (brute force) way to solve a solution is not always the best one. Often, it's better to look at a problem from several different viewpoints in order to grasp it completely, and to make sure you're not forgetting a faster (or simpler) way to solve it...

References
[Borland] Borland Pascal 7.0 with Objects - Language Guide 1992.
[Borland] Open Architecture Handbook for Pascal - 1992.
[Swart, R.E.] Undocumented DOS Interrupts from DPMI - TUG Lines #57, 1994.
[Swart, R.E.] Borland Pascal Performance Optimization - Borland Developers Conference '94
[Williams, A.] DOS and Windows Protected Mode - Addision Wesley 1993.

BufferSize	ClockTicks
128	43
256	37
512	34
1024	31
2048	29
4096	28-29
8192	28
16384	28
32768	27-28

	swart1	swart2	swart3
with smartdrive	118	43	26
w/o smartdrive	341	289	N/A

program	real-mode	DPMI
swart1	118	179
swart2	43	65
swart3	26	41
swart4	11	27

	80486	Pentium
IN	47513	137220
InSet	72224	165397 (121%)
InSetP	72224	171525 (125%)

O(N)	behaviour
c	Constant
log N	Logarithmic
N	Linear
N log N	linear-logarithmic
N2	Quadratic
N3	Cubic
2N	Exponential