Analysis of the RISC System/6000 Processor Architecture
As far as Reduced Instruction Set Computer (RISC) systems go, IBM has played an enormous role in both their development and success. It was IBM who pioneered the path to RISC architecture and was the
creator of the first RISC system, the 801, which led directly to the creation of the RS/6000. If it had not been for the groundbreaking work done at IBM’s Thomas J. Watson Research Center in the mid 1970’s, RISC architecture as we know it today, would simply not exist.
RISC architecture was a major breakthrough in the world of computing, and perhaps, one of the greatest innovations to be developed in computer science in the second half of the 20th century. It allowed for the design of CPUs with a simpler set of instructions, with a simpler set of goals. The main concept behind RISC architecture was to make the instructions simpler in order to make the cycles per instruction decrease, which would inversely result in an increase in machine efficiency. Before RISC architecture, and quite possibly the inspiration that led to discovery of RISC, researchers began to notice that the majority of the orthogonal addressing modes-an aspect of the instruction set architecture that define how machine language instructions identify the operand of each instruction-were being completely ignored.1 What they also discovered was that this was decreasing the level of performance between the processor and main memory. The decrease in performance was also a result of the increasing use of compilers to create programs, as opposed to simply writing them in assembly language, which was common practice up until this point. To solve this problem, researchers came up with the idea of making the instructions in the instruction set architecture as simple and, therefore, as fast as possible. This concept eventually evolved into RISC, in which the goal was to create instructions so simple that each one could be executed in a single clock cycle.2
IBM’s drive toward RISC architecture can be attributed to many factors. The main catalyst for this advance in CPU architecture, however, was the increasing problem researchers were noticing with computer performance. Processors, even before the birth of RISC, were becoming faster at an exponential rate, while the advances in memory access were much less dramatic. This increased the need for researchers to find different methods to boost performance and overcome the gap that existed between the processor and memory. The goal IBM had for the original RS/6000, released in February 1990, was to develop a method to conduct very complex calculations necessary for scientific and engineering research. At the time, the supercomputers responsible for carrying out these massive calculations were extremely expensive. IBM’s main goal in the superscalar RS/6000 was to increase the performance of the CPU, while making the price of the unit more affordable.
The main functional units of the RS/6000 will be covered in the paper. The branch processor of the RS/6000 is a bit different from the implementation of other comparable processors. The branch processor unit in the RS/6000 can handle one branch every cycle, like most other comparable processors. Additionally, it uses branch prediction to predict that any unresolved conditional branches will not be taken. What the branch processor does next is where it really differs from other comparable processors. The branch unit fetches both the branch-not-taken as well as the branch-taken path into the instruction buffers.4 However, it only dispatches the branch-not-taken path for execution. It is not until after the condition branch has been resolved that the instructions from the incorrect path are flushed from the instruction buffers. At this point, if the instructions have already begun to be executed, they are cancelled from their respective functional unit. The benefit of the branch unit’s method of branch prediction is that, at worst, the miss-prediction penalty is a mere 3 cycles.5 To add to the benefit, this penalty can be eliminated if, between the comparison and the branch, there are independent instructions. Another advantage in how the branch processor handles operations is the ability of the unit to restore state if it encounters an exception by maintaining a Program Counter stack. This method allows exceptions to be handled without any interruption in performance. Adding to the uniqueness and quality of the branch processor is the fact that it implements a special branch-and-count instruction that, in turn, decrements a counter register and then conditionally branches on the result, all in a single cycle.6
The Data Cache in the RS/6000 is a four way set associative cache with 64KB of total memory, allowing each memory address to have a maximum of four locations in the cache where it can be stored.7 The D-cache has a line size of 128 bytes and is split up into four separate D-cache units of equal size, leaving each unit with a total of 16 KB of memory.8 The D-cache communicates with main memory using a 4-word interface, the Floating Point unit using a 2-word interface, the Fixed Point unit using a 1-word interface, and the I-cache unit using a 2-word interface. One of the unique features of the D-cache is that it utilizes a store-back buffer which is also 128 bytes wide.9 This allows for better performance due to the fact that it cuts down on the traffic over the memory bus. As a result of the use of the store-back buffer, data that is stored in the D-cache does need to be sent directly to main memory. Instead, the data is only written to memory when a miss in the cache is replaced. This allows for the flexibility of leaving missed lines on the buffer while the new line enters the cache, so that you need not write the missed line to memory before a new line is brought in. Adding to the advantage of this implementation is the fact that the D-cache is not kept busy during the store-back process.10 What makes this process available at no cost to the efficiency of the D-cache and therefore, the performance of the processor, is through the method by which the line is loaded into the store-back buffer. The D-cache is able to implement this feature by loading the line in a parallel fashion into the store-back buffer, which is all done over a total period of two cycles.11 The other main feature of the D-cache is its use of cache-reload buffers to store, from memory, a line that contains a miss. What this means is that the processor need not wait for the entire line to be grabbed from main memory before it can access the cache arrays. Similar to the store-back buffer, the cache-reload buffer does not tie up the cache, so there is no performance penalty that must be paid to load the cache-reload buffer.12
A standard configuration of the main memory within a RS/6000 consists of two separate memory cards, although the system can hold up to eight. Each memory card can output two words of data to the D-cache due to a four way interleaved design.13 This means that the system must have at least two memory cards in order to accommodate the four word interface with the D-cache. The standard configuration of the RS/6000, with 2 memory cards can store up to four instructions and sixteen words of data. The four way interleaving design is implemented by two data-multiplexing chips and one control chip. The RS/6000 can hold from two to eight memory cards, with each ranging between 8MB and 32MB of total memory. This allows the RS/6000 to utilize anywhere from 16MB to 256MB of total memory. This total can be doubled, however, using the 4MB DRAM that are supported by the chips memory cards increasing the total memory on each card to 64MB, which in turn would allow for a total memory of a half a Gigabyte.14
The instruction cache in the RS/6000 is a two way set associative 8KB cache with a total line size of 64 bytes, 16 instructions.15 It delivers four instructions per cycle to dispatch, while continuously re-aligning them so that the farthest instruction to the left is valid.16 The I-cache is responsible for dispatching instructions to their respective units. The dispatcher selects, from the available instructions, the first branch, condition, and two fixed or floating point instructions. The branch unit, along with the condition unit, executes its instruction immediately after it receives them from the dispatcher. The two fixed or floating point instructions, however, are sent to the instruction buffers to await execution.17 Actually, in the RS/6000 processor, the fixed and floating point units are not at all affected by the result of a branch instruction.18 In most cases, the fixed and floating point units receive an uninterrupted stream of instructions resulting in a zero cycle branch.19 The unique dispatch logic of the RS/6000 actually allows the fifth instruction in the I-cache to be executed by the branch unit. This allows the branch unit to completely overlap the fixed and floating point units in a situation such as a loop where there are an equal number of fixed and floating point instructions to be executed.20
All of the fixed point instructions in the RS/6000 processor are decoded and executed by the Fixed Point Unit. The Fixed Point Unit also decodes and executes floating point instructions of the load and store types due to the fact that these types of floating point instructions are actually just fixed point operations. Because of its control over load and store floating point instructions, the Fixed Point Unit fuels the movement of data between the Floating Point Unit and the D-cache. The Fixed Point Unit has undergone little change from the RS/6000’s RISC predecessor, the 801. The Fixed Point Unit contains the Arithmetic Logic Unit to handle the instructions involving arithmetic that are sent to the Fixed Point Unit. It also allows load and store instructions to overlap by executing independent instructions. This feature is made possible through the use of register tagging. The Fixed Point Unit (FXU) utilizes 32-32 bit registers a feature that remains the same from the original 801. Almost all instructions carried out by the FXU are completed within one cycle. One of the changes in the FXU from the original 801 is that it features a fixed point multiply and divide unit to handle multiplication and division. Due to the specialized multiply and divide unit, instructions involving multiplication take from 3 to 5 cycles. In contrast, any given divide instruction will take from 19 to 20 cycles, hindering the multiply/divide unit a bit.21 To handle address translation, data locking and page protection, the FXU contains a 128 entry two way set associative Data-Translation Look-aside Buffer (D-TLB). Additionally, page table look-ups for the Instruction-Translation Look-aside Buffer (I-TLB) and D-TLB reloads are performed by the FXU. The FXU also contains D-cache directories and controls; and, therefore, any address generation or D-cache controls for either fixed or floating point load/store operations are also performed by the FXU.22 The main reason that the FXU handles address generation is to be able to accommodate the newly implemented multiply-add instructions that are executed by the Fixed Point Unit. In order for the multiply-add instructions to be worthwhile, data must be exchanged between the Floating Point Unit by a method that is not slower that the multiply-add execution time.23 Due to its unique implementation, in order to allow the processor the ability to wait for the right time to write an instruction into the D-cache, the data and address of one fixed point store instruction can be held up in the store buffers within the FXU. This feature means that both the FXU and the Fixed Point Unit (FPU) will receive their data in a timelier manner by allowing fixed and floating point load instructions to pass up the fixed point store instructions that were ahead of them in line.24
One of the biggest accomplishments in the RS/6000 architecture is its intuitive implementation of the Floating Point Unit. The Fixed Point Unit (FPU) in the RS/6000 contains 32-64 bit registers and has, in addition to these main registers, six rename registers for register renaming, and two divide registers to accommodate for floating point divide instructions. The FPU is a fully pipelined unit which allows one instruction to begin its execution at the start of every cycle. One of the most advantageous features brought upon by its unique implementation is that every instruction, except for divide types, has a result latency of only two cycles in the FPU.25 The FPU handles all of the multiply, divide, add, and subtract operations involving floating point instructions. Additionally, it computes a standard set of move, negate, and absolute-value operations.26 It will generate one double precision result for every cycle regardless of the various types of instructions in the buffer. This means that data held in the floating point registers is always represented in double-precision format.27 One of the highlights of the RS/6000’s FPU is its ability to execute a unique multiply-add instruction. What makes this feature key is the fact that these instructions, of the (A x B) + C type, are executed with the same delay as a single multiply or add instruction. The multiply-add instruction increases the performance of the RS/6000 exponentially by combining two instructions into one which reduces the number of instructions that need to be executed in a given program. This is made apparent in scientific and graphical applications that rely heavily on matrix operations that can utilize the multiply-add instruction. This instruction is also more accurate than in previous implementations of the instruction. This is due to the fact that the result of the multiply is not rounded before the addition takes place. Therefore, no accuracy is lost during the multiply-add operations. In the FPU, the two word interface with the D-cache unit provides the required amount of transferability for every floating point instruction. Another prime feature of the FPU is its ability to fully overlap load and store instructions with the execution of arithmetic operations. It is able to perform these overlapping executions due to the use of register renaming. This function allows floating point loads to be executed independently of floating point arithmetic operations, enabling the FXU to perform floating point loads without having to wait for previous floating point arithmetic operations to be completed.28
In the RS/6000, input/output operations are handled through an I/O channel controller which undertakes the task of moving data to and from the main system memory of the computer (i.e. disk).29 The I/O channel controller creates a Micro Channel interface, which is a bus architecture that defines how peripheral devices and internal components communicate across the CPU’s expansion bus. The I/O unit incorporates a two word interface between itself and the system memory via a two word interface with the SIO bus. The Micro Channel interface establishes a one word address bus and a one word data bus out of the two word interface to the SIO.30 The I/O controller architecture focuses on better performance and error handling in the RS/6000. The I/O controller’s main task is to handle the exchange of information between the system memory and the Micro Channel interface. The processor can transfer data to and from the Micro Channel interface through the use of I/O load and store operations. On the other hand, the Micro Channel interface handles data transfer with the system memory through the use of DMA (Direct Memory Access) channels.31 DMA controllers transfer data from system memory directly to the Micro Channel interface without bogging down the processor. A feature of the I/O unit that provides extensive data security is the use of address protection mechanisms which provide secure exchange of information within all data transfers. The units I/O channel controller supports up to 15 DMA channels for improved performance over the original 801 architecture. A notable feature of the Micro Channel interface is the streaming data function. This feature enables more than one packet of data to be sent over the SIO bus within a single bus envelope. The Micro Channel architecture accomplishes this task by sending a starting address followed by a single block of data that contains multiple separate packs of data. This is an extremely formidable attribute of the I/O unit creating a powerful performance boost due to the fact that it can double bandwidth on large transfers of data.32
The RS/6000 is a fully pipelined processor with several stages that execute in parallel across multiple functional units to handle the execution of various instructions. The first stage in the pipeline of the RS/6000 is the instruction fetch (IF) cycle. In this stage, four instructions are fetched from the cache arrays within the I-cache and placed into the instruction buffers. The second stage in the pipeline is the Disp/BRE cycle, in which a total of four instructions are analyzed for dispatching. Also in this stage, the branch and condition instructions are executed, the target addresses for the branch instructions are generated, and the two fixed point or floating point instructions are sent to their respective units to await execution.
The third stage in the pipeline is the FXD cycle. In this stage, the FXU decodes the instructions stored in its instruction buffer and obtains the operands from the register file. The next stage is the FXE cycle where the FXU executes its instructions. In this stage the D-TLB’s and D-cache directories are searched for load and store instructions. The fifth stage in the pipeline is the C cycle wherein, the D-cache arrays are accessed.34 The WB cycle, which is the next stage in the pipeline, is responsible for writing fixed point instruction results to the register file. The PD cycle is the next stage in the pipeline, and where the FPU pre-decodes its instructions. Following the PD cycle is the Remap stage. In this stage the floating point instruction registers are mapped to the physical registers. In the next stage, the FPD cycle, the FPU actually decodes its instructions. The next two stages, the FPE1 cycle and the FPE2 cycle, are where multiply-add instructions are executed. The last stage in the pipeline is the FPWB cycle. In this stage, the results from floating point operations, except for load/store types, are written to the register file.35
Going through a cycle-by-cycle analysis of the pipeline, we start off with cycle 1, where the first four instructions are fetched from the I-cache. In cycle 2, the instructions are analyzed for dispatch, branch and condition instructions are executed, target addresses are generated, and the fixed point and floating point instructions are sent to their respective units to be executed. Also in this cycle, the next four instructions in line are fetched from the I-cache. The third cycle sends the next two fixed point or floating point instructions to their respective units, while the FXU is decoding the first floating point instruction and the FPU is pre-decoding its first instruction. Also in this cycle, the FXU will execute the floating point load instruction and the FPU will send the first two instructions to the PD stage for renaming.36 Additionally, another four instructions are fetched from the I-cache. In the third cycle, however, the four instructions fetched from the I-cache include the BCT-a special loop-closing branch instruction-followed by the next three instructions in line. In the fourth cycle, the FXU will generate the address for the first floating point load instruction. The third instruction pair is dispatched to the FXU and FPU respectively. The FPU will rename the load instruction and the multiply-add. Also, the second fixed point or floating point instruction pair is in the FXD and FPD cycles. In the fourth cycle the address of the BCT is also generated. In cycle 5 the next four instructions are fetched from the I-cache, while the fourth pair of fixed point or floating point instructions are sent to their respective functional units. In this cycle the BCT is executed while the first FMA (floating point multiply-add) instruction is being processed in FPD. Meanwhile, the first floating point load is accessing the D-cache and at the end of this cycle, the FMA instruction will enter the FPE1 stage. In cycle 6, the second floating point load instruction will access the D-cache while the second FMA instruction will be decoded by the FPU. Also in this cycle, the FPU is generating the address for the first store instruction, which will be placed in the store buffer at the end of the cycle.37
The advantage that the RS/6000 architecture gives to the performance of the pipeline is that the loop-closing branch instruction does not affect the performance of the pipeline whatsoever. In fact, the FXU and FPU operate completely independent of the Branch instructions. This allows the floating pipeline to remain busy throughout the different stages, allowing for two floating point results in every cycle.38
The RS/6000 has many architectural advantages over other comparable processors. Perhaps the biggest advantage of this processor is that of its Floating Point Unit. The FPU in the RS/6000 increases its performance greatly due to features such as the multiply-add instruction, register renaming, and its lightning fast two cycle pipeline. Another aspect that makes the RS/6000 desirable over other architectures is its simplicity in design. The computer designers at IBM left its organization open to mold to the constant advancement in processor architecture. An equally important advantage that the RS/6000 processor puts forward is its ability to process zero cycle branches which cut down on execution time and processor load immensely. Lastly, the fact that the processor has very fast exception handling and recovery is a huge advantage to the system.
The disadvantages of the RS/6000, although not as many in number as the advantages of the system, can prove to be a performance bottleneck in some cases. The biggest disadvantage of the system is its inability to accommodate out-of-order execution. The processor does process out-of-order operations, but only involving accesses to the D-cache. Needless to say, it would mean better performance overall if the system was able to execute instructions out-of-order. However, the cost of being able to perform such a feat with the RS/6000 may be more cumbersome than advantageous. In order to implement out-of-order execution in the processor it would require more flexibility in the area of register renaming and an increased amount of logic, which, in the long run, may not be worth the added performance.