Hardware generated from a high-level language can be optimized in a number of ways. Many high-level optimizations on the source language will translate to optimized hardware, but some high-level optimizations may be hardware-neutral or even generate less-efficient hardware. Jump optimizations, for example, are hardware-neutral because jump instructions do not create hardware at the synthesis step: they merely control the naming and routing of execution bits. Similarly, optimizations designed to ease register pressure or other register-renaming optimizations will have no effect once the code has been translated to SSA form. Common subexpression elimination, on the other hand, has the same effect on hardware as it does on the source language: it reduces the number of computations necessary to achieve a result.
Certain optimizations may be done at either a high or low level. Source-language common subexpression elimination may alternatively be performed as part of a logic-reuse algorithm at a low level, and dead-code elimination can be done by pruning unused logic. In fact, the low-level optimizations may be able to take advantage of bit-level structure or the logic implementation of certain operators to optimize details invisible to a high-level analysis. However, the low-level optimizations will invariably be much slower than their high-level equivalents, due to the larger dataset they must process and the often-exponential running time of logic optimization algorithms. Hence, we want to perform our optimizations at a high level whenever possible. We also hope to take advantage of source-language information which may be lost in translation; induction-variable transformations, for example, may be very difficult to perform at a logic level.
As mentioned in section 4, the SUIF optimizer porky was used to perform a variety of general-purpose high-level optimizations on our input source. We were also interested in examining high-level hardware-specific optimizations, and in particular whether some of the disadvantages of C as a hardware description language could be overcome by intelligent optimization.
The bitsize module implemented in this work was designed to address C's inflexible type system. In hardware design, the exact bitwidth of various datapaths is often known exactly, and inefficiencies are introduced when this datapath is expressed using one of C's fixed-width integer types. The bitsize module attempts to recreate the bitwidth information using the arithmetic properties of the computation performed. All constants are given exact bitwidths, and the result of combining variables and constants with a given operator can often be assigned a maximum bitwidth as well. The SUIF type information for all variables was then modified to represent the newly computed restricted bitwidths.
The current implementation does not take advantage of all possible bit-width information; it implements only a def-use analysis on the SSA-format input. Further optimization is often possible by examining the control flow graph. For example, analysis of the code in figure 6 could determine that i2 was always 3 bits wide if the condition at the bottom of the loop was true and i2 was an unsigned type. Since i0 is known to be a 1 bit type, this could be used to assign i1 a width of 3 bits as well. i2 would be pushed up to 4 significant bits by the increment statement, but its value at L1 would remain at 3 bits because of the loop condition. This optimization requires splitting variables at branches to differentiate between the width of i2 at L1 (three bits) and its width at L2 (four bits). If the width of i0 was unknown, the loop may be rewritten to allow the branch information to narrow the variable width in all but the first iteration. This and other sophisticated width optimizations are possible improvements that were not implemented in the current codebase.
In an attempt to evaluate the bitsize module on a realistic hardware description, it was fed the main routine of a PDP-8 simulator written for an earlier project. The PDP-8 has a twelve bit datapath, and various methods were used in the code to truncate results to obtain an accurate simulation. The unmodified code also contains string operations to output the machine state to the console during simulation; it is to be expected that these operations would not benefit from the bitsize module, but they would also not typically be synthesized into hardware.
The bitsize module managed to trim 315 of the 1,466 compiler variables in the ExecuteNextInstruction function; this corresponded to reducing the summed datapath width of all variables from 21,715 to 30,056 bits (eliminating 28% of the bit paths). Elimination of the string manipulation code changed these numbers to 261 of 1,129 variables, for a new sum datapath width of 16,932 bits from 23,808 bits (eliminating 41% of the bits). The 12-bit datapath was defined using C 32-bit int types, so a 'perfect' reduction would have eliminated 60% of the datapath bits. See figure 7, and note that the actual machine state variables were outside the analyzed routine, so they propagated their full width through the calculations. By adding analysis of static variables, it is expected that compiler bitwidth optimization can approach the ideal closely.
|