TCJ Issue #40 PROGRAMMING FOR PERFORMANCE PART 2 by Lee A. Hart Know The Hardware Truly efficient software has an intimate, almost incestuous relationship¨ with its hardware. They merge so thoroughly as to become inseparable;¨ neither makes any sense without the other. This requires that you, the programmer, must TOTALLY understand the¨ hardware. I cannot stress this point too strongly. The strengths and¨ weaknesses of the hardware influence program structure at every level, not¨ just the low-level drivers. A system with weak disk I/O will be slow and¨ unresponsive if your program relies on overlays. A shallow keyboard buffer¨ requires frequent checks to avoid missing keys. The characteristics of the ¨ console device determines your whole approach to data displays. If you try¨ to hide from these limitations in a high-level language, your program will¨ work as if it were written in BASIC 101. Let's consider some actual case¨ histories of what can be gained by paying attention to the hardware. CASE #1 A customer needed a faster way to transfer data between two computers. ¨ He had been using a serial port at 9600 baud but complained that it was too¨ slow and tied up the computer's serial port. Hardware mods were ruled out. After study, I found that each computer had unused handshake lines in its¨ RS-232 port. A special "Y" cable was built to cross-connect two of these¨ lines, providing one bit of serial I/O in each direction. A "software UART"¨ program was then written to transfer data between the two machines. This¨ worked to about 30K bits per second before timing dither (due to¨ interrupts, memory refresh, etc.) caused errors. The serial port's UART could be programmed to generate an interrupt when¨ the handshake line went low. Therefore, an interrupt-driven protocol with¨ handshaking was devised. A '0' was sent by pulling the output low until the¨ other computer echoed the low on its output. A '1' was sent by pulsing the¨ output low and immediately back high and waiting until the other system¨ echoed it. The data rate increased to over 100K bits per second, and¨ transfers were now unaffected by disk I/O, keyboard activity, etc. CASE 2 The firmware for a CRT terminal was to be upgraded to run 38400 bits per¨ second without handshaking. Now, 38400 bps is fast, only 260 microseconds¨ per character (about 75 instructions for a 3 MHz Z80). The slowest routines need the most attention. For example, clear-line¨ was accomplished by moving the stack pointer to the end of the line and¨ executing 36 PUSH HL instructions. The interrupt handler needed a 4-level¨ stack, so the last 8 bytes were cleared normally. Clear-screen used 25¨ iterations of clear-line. This still isn't fast enough to complete every ESC sequence before the¨ next one is received. This calls for an interrupt-driven system. Each¨ character received generates an interrupt. The interrupt handler pushes the¨ character into one end of a FIFO (First-In-First-Out) buffer in memory. The¨ main program pops characters out the other end and processes them. The FIFO¨ fills while we process slow commands like clear-screen and empties back out¨ during fast commands. But what if some idiot sends a long string of slow commands (like 100¨ clear-screens in a row)? The FIFO would eventually overflow, and data would¨ be lost. I prevented this with "look-ahead" logic. When the interrupt¨ handler spots a clear-screen command, it sets a flag so MAIN expects it. ¨ MAIN can then ignore unnecessary commands (no sense altering a screen that's¨ about to be cleared). Scrolling is one of the most difficult actions. The obvious algorithm is¨ to block move lines 2-24 up 1, then clear line 24. But that's what IBM did¨ on the PC, and we all know how well that worked. So examine the 6845 CRT¨ controller. The Start-Address register holds the address of the first¨ character on the screen, the one displayed in the top left corner. If we¨ add 80 to it, line 2 instantly becomes the top line, and we've scrolled the¨ whole screen up a line. All that remains is to clear the 80 bytes that ¨ form the new 24th line, for which we have a fast routine. Each scroll moves the start address up another 80 bytes. This obviously¨ can't go on indefinitely, so the original program spent a great deal of time¨ checking for overflow outside its 2K block of screen RAM (F800-FFFF). For¨ instance, the old code read: ld (hl),a ; put character on screen inc hl ; advance to next ld a,h ; get new address or 0F8h ; if overflow to 0000, ld h,a ; force it to F800-FFFF But is this really necessary? The schematic revealed that the 2K RAM was¨ partially decoded and actually occupied 16K in the Z80's address space¨ (C000-FFFF). It's far easier to insure that an address lies within this¨ range: ld (hl),a ; put character on screen res 6,h ; insure we don't wrap to 0000 inc hl ; advance to next CASE #3 Fast Disk I/O. Way back in 8 B.C. (eight years Before Clones) I had an¨ S-100 system. Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAM¨ spread over half a dozen furnace boards. Two Shugart SA-801R single-sided¨ 8" drives provided disk storage, with CP/M 1.4 tying it all together. That¨ old war horse and I fought many battles together, until it finally died the¨ Death-of-1000-Intermittents. Many of its "features" I'd rather forget, but it had one outstanding¨ attribute: the fastest floppies I've ever seen. Warm boots were done before¨ your fingers were off the keys; Wordstar loaded in under a second; PIP¨ copied files at 10K bytes/sec. All without a fast CPU, DMA, vectored¨ interrupts, or even a disk controller IC. The "controller" was just a bunch¨ of TTL chips implementing a parallel port, an 8-bit shift register, and a¨ CRC checkcode generator. The real work was done by the CPU, byte-banging¨ out the IBM 3740 SD/DD format in software. How good was it? An 8" disk spins at 360 rpm, or 6 revs/sec. Each track¨ held 6.5K (26 double-density sectors of 256 bytes each). That makes the¨ theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec. It actually¨ achieved 50% of this, or 20K bytes/sec. Few modern micros come anywhere near this level of performance. The¨ Kaypro I wrote this article on creeps through the disk at 4K/sec. My PC¨ clone is closer, at 12K/sec. The problem is that the CPU spends most of its¨ time in wait loops; waiting for the drive motor to start, for the head to¨ load, for an index hole, for a certain sector to come around on the disk.¨ The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancy¨ disk controllers are thrown away by crude software. The CPU has better things to do. If the disk isn't ready when an ¨ application program needs it, the BIOS should start the task, save the data¨ in a buffer, and set up an interrupt to finish the task later when the disk¨ is REALLY ready. The time lost to wait loops is thus reclaimed to run your¨ application programs. That's how my antique worked. The BIOS maintained a track buffer in RAM.¨ The first read from a particular track moved the head to the desired track¨ and read the whole thing into the buffer. Further reads from that track¨ simply came from RAM, taking virtually no time at all. Similarly, writes to a sector on the current track just put data in the¨ buffer and marked it as changed. The actual write was performed later, when¨ a new track was selected for read/write, or just before the drive timed out¨ from a lack of disk activity. Physical track reads/writes were fast as well. The key was to simply¨ begin wherever the head was. After seeking to the desired track, it read¨ the ID# of each sector encountered and transferred it to/from the appropriate¨ place in the RAM buffer. No need to find the index hole, wait for a¨ particular sector ID#, or worry about interleave; one revolution got it all. Such a system must be implemented carefully. CP/M does not expect¨ delayed error messages, which can produce some odd results. For instance, a¨ BDOS read error might be reported when the real cause was a write error in¨ flushing the previous track buffer to disk. Also, modern drives do not have¨ door locks to prevent disk removal if unwritten data remains in the track¨ buffer. The main factor limiting my S-100 system's performance was the slow CPU¨ and lack of DMA. A double-density 8" disk has a peak data transfer rate of¨ 500K bits/sec, which only allows 16 microseconds between bytes. This¨ required polled I/O where the CPU was 100% devoted to the disk during actual¨ reads/writes. 5-1/4" disks have a slower maximum transfer rate, but modern hardware has¨ advantages that can make up for it. A normal 5-1/4" disk spins at 300 rpm,¨ or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximum¨ transfer rate is 22.5K bytes/sec. The peak data rate is 250K bits/sec, or¨ 32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)¨ handle it on an interrupt basis. Here's an interrupt handler to read 256¨ bytes from a disk controller chip at 32 microseconds max. per byte: T-states 23 ; time to finish longest instruction 13 ; Z80 interrupt mode 0 or 1 response 11 int: push af ; save registers used 11 in a,(data) ; read data byte from disk controller 13 next: ld (buffer),a ; store it in buffer (a variable) 13 ld a,(next+1) ; get buffer address 4 inc a ; increment 13 ld (next+1),a ; save for next time 7 jr nz,done ; if end of page, done 10 pop af ; else restore registers 10 ret ; and return ---- 128 T-states max = 32 microseconds with a 4 MHz Z80 But this routine barely squeaks by. It can't use interrupt mode 2 (which¨ adds 6 T-states to the response time) or signal Z80 peripherals that the¨ interrupt is complete with an RETI (which adds 4 T-states). It's limited to¨ a 256-byte sector. Worse, some disk controller chips need processing time¨ of their own. The popular Western Digital FD179x series only allows 27.5¨ microseconds for each byte. So we have to get clever again. The following example reads pairs of¨ bytes, the first on an interrupt and the second by polled I/O. This¨ improves performance to allow interrupt mode 2, larger sector sizes, and the¨ slow response time of a FD179x chip: T-states 23 ; time to finish longest instruction 19 ; Z80 interrupt mode 2 response time 11 int: push af ; save A and flags 11 in a,(data) ; read 1st byte from disk controller 11 push hl ; save HL 10 next: ld hl,buffer ; get buffer address (a variable) 7 ld (hl),a ; store byte in buffer 6 inc hl ; advance buffer pointer 6 inc hl ; for next interrupt 16 ld (next+1),hl; & store it 6 dec hl ; point to current address 11+11 check:in a,(status) ; check disk controller status 4+ 4 rra ; if not busy (bit 0=1), 7+ 7 jr nc,done ; then we're done 4+ 4 rra ; if next byte not ready (bit 1=0), 12+ 7 jr nc,check ; then loop until it is 11 in a,(data) ; get 2nd byte from disk controller 7 ld (hl),a ; & store it in buffer 10 pop hl ; restore registers 10 pop af 14 reti ; return ---- 188 or 226 T-states (for 1 or 2 passes through status loop) This routine reads bytes from the controller chip within 17.75¨ microseconds worst-case. Interrupt overhead averages 80% for a 4 MHz Z80,¨ leaving 20% for the main program execution. The peculiar way of¨ incrementing the address pointer minimizes the worst-case delay from an¨ interrupt or status flag change until the byte is read. We want to maximize¨ the chance that the second character is ready the first time the status is¨ checked. Why improve your disk system? Because, as a practical matter, there's¨ more to be gained by improving it than any other change you could make. ¨ It's disk I/O that sets the pace, not CPU speed or memory size. Users¨ almost never wait on CPU speed; it's the disk that keeps you twiddling your¨ thumbs with the keyboard ignored, the screen frozen, and the disk drive¨ emitting Bronx cheers. Put a Commodore 64's tinkertoy disk system on an AT­ clone, and you'd have high-priced junk that only a masochist would use.¨ Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire­ breathing dragon that would eat its competition alive. Algorithms When a hardware engineer sits down to design a circuit, he doesn't begin¨ with a blank sheet of paper. He has a vast library of textbooks, data¨ sheets, and catalogs of standard circuits to choose from. Most of the task¨ is simply connecting off-the-shelf components into one of these standard¨ configurations, modifying them as necessary to satisfy any unique¨ requirements. Algorithms are to programmers what IC chips are to hardware designers.¨ Just as the engineer builds a library of standard parts and circuits, every¨ programmer must continually build his own algorithm collection. Whether¨ it's a shoebox full of magazine clippings or a carefully indexed series of¨ notebooks, start NOW. Programming textbooks tend to concentrate on traditional computer ¨ algorithms for floating-point math, transcendental functions, and sorting¨ routines. The old standby is Knuth's "The Art of Programming". Hamming's¨ "Numerical Methods for Scientists and Engineers" explains the basics of¨ iterative calculations. "Digital Computation and Numerical Methods" by¨ Southworth and Deeleeuw provides detailed flowcharts and sample code as¨ well. Magazines are a great source and tend to be more down-to-earth and closer¨ to the state of the art. Read carefully! Good algorithms may be expressed¨ in BASIC listings, assembly code for some obscure processor, pocket¨ calculator key sequences, and even disguised as circuit diagrams.¨ Professional journals like EDN or Computer Design are often better than the¨ popular magazines, which have pretty much abandoned education in favor of¨ marketing. Especially check out back issues. The cruder the hardware, the¨ trickier the algorithms had to be to make up for it. Manufacturers' technical literature is a gold mine. Get the ¨ manufacturers' own manuals, not some boiled-down paperback from the¨ bookstore. They won't be models of clarity but are full of hidden gold.¨ Read everything, hardware and software manuals, data sheets, application¨ notes, etc. User groups are the traditional source of solutions to specific ¨ problems. Even better, they provide actual implementations in printed¨ listings, on disk, or even by modem. Don't waste time reinventing the wheel. Learn from others what works,¨ and what doesn't. Some of the best (and worst) algorithms I know were found¨ by disassembling existing programs. And once you find a good algorithm,¨ recycle it. That clever sort routine for an antique 8008 may be the¨ foundation of the fastest '386 sort yet! Conclusion These techniques are not new; in fact old-timers will recognize many of¨ them from the early days of computing when hardware limitations were more¨ severe. However, they have fallen into disuse. A whole generation of¨ programmers has been taught that such techniques have no place in modern¨ structured programming. The theory goes something like this: Programs written in a high-level¨ language are faster and easier to write, debug, document, and maintain.¨ Memory and speed are viewed as infinite resources, so the performance loss¨ is unimportant. Programs should be totally generic; it is the compiler's or¨ run-time library's job to worry about the hardware interface. These rules make sense in a mainframe environment, where the hardware¨ resources are truly awesome, and teams of programmers spend years working on¨ one application. But they impose severe penalties on a microcomputer¨ system. The user must pay for the programmer's luxuries with higher¨ hardware cost and lackluster performance. It's easy to forget that "microcomputer" literally means "one millionth¨ of a computer". Microprocessors make abysmally bad CPUs. Build a computer¨ with one, and you'll wind up needing $5000 worth of memory and peripherals¨ to support a $5 CPU chip. But micros make superlative controllers. That's what they were designed¨ for, and what they do best. A single microcomputer can replace dozens of¨ boards and hundreds of ICs with as little as a single chip. That's why 90%¨ of all microprocessors go into non-computer uses: calculators, auto emission¨ controls, home entertainment equipment, industrial controls, and the like.¨ Of 30 million Z80s sold last year, fewer than 1 million went into computers. Programming a controller is different than a computer. Most applications¨ demand real-time multi-tasking capabilities, and there is never enough speed¨ or memory. Inputs and outputs are physical hardware devices, not abstract¨ data structures, so the code must inevitably be hardware-dependent. ¨ Computer languages are just not cut out for this sort of thing. The question is not, "How do I write a computer program to handle this¨ data?" Instead, you should ask yourself, "How must I manipulate this¨ hardware to do the job?" The techniques in this article may be out of place¨ in the bureaucracy of a large computer but are right at home in the wild­ west world of a microcomputer. Lest you think this has nothing to do with a "real" computer like your PC¨ clone, consider this. Instead of a '286 with 1 meg of memory, suppose it¨ contained ten Z80 controller boards, each with 64K of memory and a fast¨ network to tie them together. Each Z80 handles a different device:¨ keyboard, screen, printer, modem, and one for each disk. The rest are free¨ to run application programs, several at a time! Suppose you're doing word processing on this system. The keyboard does¨ spelling correction on data entry. The screen Z80 displays your text in¨ bit-mapped fonts to match the printer's Z80, which is simultaneously¨ printing a file. The Z80 running the word processor itself suffers no¨ annoying pauses or hesitation, since disk I/O is handled instantaneously via¨ each drive's Z80 track buffer. Meanwhile, the modem's Z80 is downloading a¨ file while another assembles a program. Pop-up utilities are ready and¨ waiting in still other Z80s in case they're needed. Such a system would clearly have half the hardware cost of a PC, yet¨ would outperform it by a wide margin. True multi-tasking becomes child's¨ play with multiple processors. More processors can be readily added for¨ even higher performance, or removed to save cost (or continue operation¨ while waiting for a replacement). If the computer scientists really want to further the micro-revolution,¨ they should stop trying to force antiquated mainframe languages onto micros¨ and concentrate on developing tools to maximize the use of micros as they¨ are!