TCJ Issue #40

                        PROGRAMMING FOR PERFORMANCE
                                  PART 2

                               by Lee A. Hart


                             Know The Hardware

   Truly efficient software has an intimate, almost incestuous relationship
with its hardware.  They merge so thoroughly as to become inseparable;
neither makes any sense without the other.

   This requires that you, the programmer, must TOTALLY understand the
hardware.  I cannot stress this point too strongly.  The strengths and
weaknesses of the hardware influence program structure at every level, not
just the low-level drivers.  A system with weak disk I/O will be slow and
unresponsive if your program relies on overlays.  A shallow keyboard buffer
requires frequent checks to avoid missing keys.  The characteristics of the 
console device determines your whole approach to data displays.  If you try
to hide from these limitations in a high-level language, your program will
work as if it were written in BASIC 101.  Let's consider some actual case
histories of what can be gained by paying attention to the hardware.

CASE #1

   A customer needed a faster way to transfer data between two computers. 
He had been using a serial port at 9600 baud but complained that it was too
slow and tied up the computer's serial port.  Hardware mods were ruled out.

   After study, I found that each computer had unused handshake lines in its
RS-232 port.  A special "Y" cable was built to cross-connect two of these
lines, providing one bit of serial I/O in each direction.  A "software UART"
program was then written to transfer data between the two machines.  This
worked to about 30K  bits per second before timing dither (due to
interrupts, memory refresh, etc.) caused errors.

   The serial port's UART could be programmed to generate an interrupt when
the handshake line went low.  Therefore, an interrupt-driven protocol with
handshaking was devised.  A '0' was sent by pulling the output low until the
other computer echoed the low on its output.  A '1' was sent by pulsing the
output low and immediately back high and waiting until the other system
echoed it.  The data rate increased to over 100K bits per second, and
transfers were now unaffected by disk I/O, keyboard activity, etc.

CASE 2

   The firmware for a CRT terminal was to be upgraded to run 38400 bits per
second without handshaking.  Now, 38400 bps is fast, only 260 microseconds
per character (about 75 instructions for a 3 MHz Z80).

   The slowest routines need the most attention.  For example, clear-line
was accomplished by moving the stack pointer to the end of the line and
executing 36 PUSH HL instructions.  The interrupt handler needed a 4-level
stack, so the last 8 bytes were cleared normally.  Clear-screen used 25
iterations of clear-line.

   This still isn't fast enough to complete every ESC sequence before the
next one is received.  This calls for an interrupt-driven system.  Each
character received generates an interrupt.  The interrupt handler pushes the
character into one end of a FIFO (First-In-First-Out) buffer in memory.  The
main program pops characters out the other end and processes them.  The FIFO
fills while we process slow commands like clear-screen and empties back out
during fast commands.

   But what if some idiot sends a long string of slow commands (like 100
clear-screens in a row)?  The FIFO would eventually overflow, and data would
be lost.  I prevented this with "look-ahead" logic.  When the interrupt
handler spots a clear-screen command, it sets a flag so MAIN expects it. 
MAIN can then ignore unnecessary commands (no sense altering a screen that's
about to be cleared). 

   Scrolling is one of the most difficult actions.  The obvious algorithm is
to block move lines 2-24 up 1, then clear line 24.  But that's what IBM did
on the PC, and we all know how well that worked.  So examine the 6845 CRT
controller.  The Start-Address register holds the address of the first
character on the screen, the one displayed in the top left corner.  If we
add 80 to it, line 2 instantly becomes the top line, and we've scrolled the
whole screen up a line.  All that remains is to clear the 80 bytes that 
form the new 24th line, for which we have a fast routine.

   Each scroll moves the start address up another 80 bytes.  This obviously
can't go on indefinitely, so the original program spent a great deal of time
checking for overflow outside its 2K block of screen RAM (F800-FFFF).  For
instance, the old code read:

     ld   (hl),a    ; put character on screen
     inc  hl        ; advance to next
     ld   a,h       ; get new address
     or   0F8h      ; if overflow to 0000,
     ld   h,a       ;   force it to F800-FFFF

   But is this really necessary?  The schematic revealed that the 2K RAM was
partially decoded and actually occupied 16K in the Z80's address space
(C000-FFFF).  It's far easier to insure that an address lies within this
range:

     ld   (hl),a    ; put character on screen
     res  6,h       ; insure we don't wrap to 0000
     inc  hl        ; advance to next

CASE #3

   Fast Disk I/O.  Way back in 8 B.C. (eight years Before Clones) I had an
S-100 system.  Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAM
spread over half a dozen furnace boards.  Two Shugart SA-801R single-sided
8" drives provided disk storage, with CP/M 1.4 tying it all together.  That
old war horse and I fought many battles together, until it finally died the
Death-of-1000-Intermittents.

   Many of its "features" I'd rather forget, but it had one outstanding
attribute: the fastest floppies I've ever seen.  Warm boots were done before
your fingers were off the keys; Wordstar loaded in under a second; PIP
copied files at 10K bytes/sec.  All without a fast CPU, DMA, vectored
interrupts, or even a disk controller IC.  The "controller" was just a bunch
of TTL chips implementing a parallel port, an 8-bit shift register, and a
CRC checkcode generator.  The real work was done by the CPU, byte-banging
out the IBM 3740 SD/DD format in software.

   How good was it?  An 8" disk spins at 360 rpm, or 6 revs/sec.  Each track
held 6.5K (26 double-density sectors of 256 bytes each).  That makes the
theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec.  It actually
achieved 50% of this, or 20K bytes/sec.

   Few modern micros come anywhere near this level of performance.  The
Kaypro I wrote this article on creeps through the disk at 4K/sec.  My PC
clone is closer, at 12K/sec.  The problem is that the CPU spends most of its
time in wait loops; waiting for the drive motor to start, for the head to
load, for an index hole, for a certain sector to come around on the disk.
The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancy
disk controllers are thrown away by crude software.

   The CPU has better things to do.  If the disk isn't ready when an 
application program needs it, the BIOS should start the task, save the data
in a buffer, and set up an interrupt to finish the task later when the disk
is REALLY ready.  The time lost to wait loops is thus reclaimed to run your
application programs.

   That's how my antique worked.  The BIOS maintained a track buffer in RAM.
The first read from a particular track moved the head to the desired track
and read the whole thing into the buffer.  Further reads from that track
simply came from RAM, taking virtually no time at all.

   Similarly, writes to a sector on the current track just put data in the
buffer and marked it as changed.  The actual write was performed later, when
a new track was selected for read/write, or just before the drive timed out
from a lack of disk activity.

   Physical track reads/writes were fast as well.  The key was to simply
begin wherever the head was.  After seeking to the desired track, it read
the ID# of each sector encountered and transferred it to/from the appropriate
place in the RAM buffer.  No need to find the index hole, wait for a
particular sector ID#, or worry about interleave; one revolution got it all.

   Such a system must be implemented carefully.  CP/M does not expect
delayed error messages, which can produce some odd results.  For instance, a
BDOS read error might be reported when the real cause was a write error in
flushing the previous track buffer to disk.  Also, modern drives do not have
door locks to prevent disk removal if unwritten data remains in the track
buffer.

   The main factor limiting my S-100 system's performance was the slow CPU
and lack of DMA.  A double-density 8" disk has a peak data transfer rate of
500K bits/sec, which only allows 16 microseconds between bytes. This
required polled I/O where the CPU was 100% devoted to the disk during actual
reads/writes.

   5-1/4" disks have a slower maximum transfer rate, but modern hardware has
advantages that can make up for it.  A normal 5-1/4" disk spins at 300 rpm,
or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximum
transfer rate is 22.5K bytes/sec.  The peak data rate is 250K bits/sec, or
32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)
handle it on an interrupt basis.  Here's an interrupt handler to read 256
bytes from a disk controller chip at 32 microseconds max. per byte:

T-states
   23                       ; time to finish longest instruction
   13                       ; Z80 interrupt mode 0 or 1 response
   11 int:  push af         ; save registers used
   11       in   a,(data)   ; read data byte from disk controller
   13 next: ld   (buffer),a ; store it in buffer (a variable)
   13       ld   a,(next+1) ; get buffer address
    4       inc  a          ;   increment
   13       ld   (next+1),a ;   save for next time
    7       jr   nz,done    ; if end of page, done
   10       pop  af         ;   else restore registers
   10       ret             ;   and return
  ----
  128 T-states max = 32 microseconds with a 4 MHz Z80

   But this routine barely squeaks by. It can't use interrupt mode 2 (which
adds 6 T-states to the response time) or signal Z80 peripherals that the
interrupt is complete with an RETI (which adds 4 T-states).  It's limited to
a 256-byte sector.  Worse, some disk controller chips need processing time
of their own.  The popular Western Digital FD179x series only allows 27.5
microseconds for each byte.

   So we have to get clever again.  The following example reads pairs of
bytes, the first on an interrupt and the second by polled I/O.  This
improves performance to allow interrupt mode 2, larger sector sizes, and the
slow response time of a FD179x chip:

T-states
   23                       ; time to finish longest instruction
   19                       ; Z80 interrupt mode 2 response time
   11 int:  push af         ; save A and flags
   11       in   a,(data)   ; read 1st byte from disk controller
   11       push hl         ; save HL
   10 next: ld   hl,buffer  ; get buffer address (a variable)
    7       ld   (hl),a     ; store byte in buffer
    6       inc  hl         ; advance buffer pointer
    6       inc  hl         ;   for next interrupt
   16       ld   (next+1),hl;   & store it
    6       dec  hl         ;   point to current address
11+11 check:in   a,(status) ; check disk controller status
 4+ 4       rra             ; if not busy (bit 0=1),
 7+ 7       jr   nc,done    ;   then we're done
 4+ 4       rra             ; if next byte not ready (bit 1=0),
12+ 7       jr   nc,check   ;   then loop until it is
   11       in   a,(data)   ; get 2nd byte from disk controller
    7       ld   (hl),a     ;   & store it in buffer
   10       pop  hl         ; restore registers
   10       pop  af
   14       reti            ; return
  ----
  188 or 226 T-states (for 1 or 2 passes through status loop)

   This routine reads bytes from the controller chip within 17.75
microseconds worst-case.  Interrupt overhead averages 80% for a 4 MHz Z80,
leaving 20% for the main program execution.  The peculiar way of
incrementing the address pointer minimizes the worst-case delay from an
interrupt or status flag change until the byte is read.  We want to maximize
the chance that the second character is ready the first time the status is
checked.

   Why improve your disk system?  Because, as a practical matter, there's
more to be gained by improving it than any other change you could make. 
It's disk I/O that sets the pace, not CPU speed or memory size.  Users
almost never wait on CPU speed; it's the disk that keeps you twiddling your
thumbs with the keyboard ignored, the screen frozen, and the disk drive
emitting Bronx cheers.  Put a Commodore 64's tinkertoy disk system on an AT­
clone, and you'd have high-priced junk that only a masochist would use.
Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire­
breathing dragon that would eat its competition alive.


                                 Algorithms

   When a hardware engineer sits down to design a circuit, he doesn't begin
with a blank sheet of paper.  He has a vast library of textbooks, data
sheets, and catalogs of standard circuits to choose from.  Most of the task
is simply connecting off-the-shelf components into one of these standard
configurations, modifying them as necessary to satisfy any unique
requirements.

   Algorithms are to programmers what IC chips are to hardware designers.
Just as the engineer builds a library of standard parts and circuits, every
programmer must continually build his own algorithm collection.  Whether
it's a shoebox full of magazine clippings or a carefully indexed series of
notebooks, start NOW.

   Programming textbooks tend to concentrate on traditional computer 
algorithms for floating-point math, transcendental functions, and sorting
routines.  The old standby is Knuth's "The Art of Programming".  Hamming's
"Numerical Methods for Scientists and Engineers" explains the basics of
iterative calculations.  "Digital Computation and Numerical Methods" by
Southworth and Deeleeuw provides detailed flowcharts and sample code as
well.

   Magazines are a great source and tend to be more down-to-earth and closer
to the state of the art.  Read carefully!  Good algorithms may be expressed
in BASIC listings, assembly code for some obscure processor, pocket
calculator key sequences, and even disguised as circuit diagrams.
Professional journals like EDN or Computer Design are often better than the
popular magazines, which have pretty much abandoned education in favor of
marketing.  Especially check out back issues.  The cruder the hardware, the
trickier the algorithms had to be to make up for it.

   Manufacturers' technical literature is a gold mine. Get the 
manufacturers' own manuals, not some boiled-down paperback from the
bookstore.  They won't be models of clarity but are full of hidden gold.
Read everything, hardware and software manuals, data sheets, application
notes, etc.

   User groups are the traditional source of solutions to specific 
problems.  Even better, they provide actual implementations in printed
listings, on disk, or even by modem.

   Don't waste time reinventing the wheel.  Learn from others what works,
and what doesn't.  Some of the best (and worst) algorithms I know were found
by disassembling existing programs.  And once you find a good algorithm,
recycle it.  That clever sort routine for an antique 8008 may be the
foundation of the fastest '386 sort yet!


                                 Conclusion

   These techniques are not new; in fact old-timers will recognize many of
them from the early days of computing when hardware limitations were more
severe.  However, they have fallen into disuse.  A whole generation of
programmers has been taught that such techniques have no place in modern
structured programming.

   The theory goes something like this: Programs written in a high-level
language are faster and easier to write, debug, document, and maintain.
Memory and speed are viewed as infinite resources, so the performance loss
is unimportant.  Programs should be totally generic; it is the compiler's or
run-time library's job to worry about the hardware interface.

   These rules make sense in a mainframe environment, where the hardware
resources are truly awesome, and teams of programmers spend years working on
one application.  But they impose severe penalties on a microcomputer
system.  The user must pay for the programmer's luxuries with higher
hardware cost and lackluster performance.

   It's easy to forget that "microcomputer" literally means "one millionth
of a computer".  Microprocessors make abysmally bad CPUs.  Build a computer
with one, and you'll wind up needing $5000 worth of memory and peripherals
to support a $5 CPU chip.

   But micros make superlative controllers.  That's what they were designed
for, and what they do best.  A single microcomputer can replace dozens of
boards and hundreds of ICs with as little as a single chip.  That's why 90%
of all microprocessors go into non-computer uses: calculators, auto emission
controls, home entertainment equipment, industrial controls, and the like.
Of 30 million Z80s sold last year, fewer than 1 million went into computers.

   Programming a controller is different than a computer.  Most applications
demand real-time multi-tasking capabilities, and there is never enough speed
or memory.  Inputs and outputs are physical hardware devices, not abstract
data structures, so the code must inevitably be hardware-dependent. 
Computer languages are just not cut out for this sort of thing.

   The question is not, "How do I write a computer program to handle this
data?"  Instead, you should ask yourself, "How must I manipulate this
hardware to do the job?"  The techniques in this article may be out of place
in the bureaucracy of a large computer but are right at home in the wild­
west world of a microcomputer.

   Lest you think this has nothing to do with a "real" computer like your PC
clone, consider this.  Instead of a '286 with 1 meg of memory, suppose it
contained ten Z80 controller boards, each with 64K of memory and a fast
network to tie them together.  Each Z80 handles a different device:
keyboard, screen, printer, modem, and one for each disk.  The rest are free
to run application programs, several at a time!

   Suppose you're doing word processing on this system.  The keyboard does
spelling correction on data entry.  The screen Z80 displays your text in
bit-mapped fonts to match the printer's Z80, which is simultaneously
printing a file.  The Z80 running the word processor itself suffers no
annoying pauses or hesitation, since disk I/O is handled instantaneously via
each drive's Z80 track buffer.  Meanwhile, the modem's Z80 is downloading a
file while another assembles a program.  Pop-up utilities are ready and
waiting in still other Z80s in case they're needed.

   Such a system would clearly have half the hardware cost of a PC, yet
would outperform it by a wide margin.  True multi-tasking becomes child's
play with multiple processors.  More processors can be readily added for
even higher performance, or removed to save cost (or continue operation
while waiting for a replacement).

   If the computer scientists really want to further the micro-revolution,
they should stop trying to force antiquated mainframe languages onto micros
and concentrate on developing tools to maximize the use of micros as they
are!
                           