Implementation

The C hardware compiler discussed in this paper was implemented using the SUIF (Stanford University Intermediate Format) compiler system [13]. Translation from C to structural VHDL takes place in a six stage pipeline, using two code libraries to extend the base SUIF system. Except for the first and last, every module reads and writes the SUIF intermediate representation, annotated with the various pieces of information collected at each stage. Figure 3 shows the pipeline schematically.

**Figure 3:** Hardware compiler pipeline stages
$\includegraphics[angle=270,width=3in]{pipeline.eps}$

The standard SUIF front-end, scc, is used to generate a structured intermediate-format file from C source code. The output from this stage is given an .spd suffix.

The next module is named exit1. Its job is to create single-entry, single-exit functions from the general SUIF code output by scc. It creates a new label and a return opcode at the end of each function, using a new ``return value'' variable. It then replaces return statements through the code with cpy and jmp opcodes. This code restructuring makes it easier for the VHDL-generation stage to synthesize the function output port. The output of exit1 is given an .sp1 extension.

The SUIF optimizer, porky, is then used to convert the .sp1 file into low-SUIF by breaking down high-level for, loop, if, block, and mbr structures. This simplifies the VHDL generation phase by reducing the number of different opcodes it must translate. We also perform constant propagation, common subexpression elimination, and dead-code elimination to reduce the amount of generated logic. These optimizations could all be performed by a aggressize logic optimization after synthesis, but at a greatly increased cost. Finally, porky does jump optimization to remove any inefficiencies introduced by the simple-minded exit1 pass. The optimized low-SUIF output is given an .spx extension.

The ssa module then translates the intermediate representation into Static Single Assignment (SSA) form [3]. A control flow graph of basic blocks is generated from the input, using a version of the cfg library from the machsuif package highly modified to handle low-SUIF instead of machine-SUIF. The dominator tree and dominance frontier are then computed and used to place phi functions on the control flow graph according to [1]. The variables in the program are renamed and the phi function information written out as annotations to the SUIF instruction objects. The phi library manages the phi function information, which is reused in all following stages. The output of this stage is given the extension .ssa.

Modules to optimize the intermediate representation to generate more efficient hardware fit into the pipeline after the conversion to SSA form. One such module was implemented, which attempts to narrow the bit-widths of variables in order to generate more efficient datapaths. This bitsize module will be discussed in section 5.

The final module, vgen, generates structural VHDL code from an SSA-annotated input file. It is worth noting that the structural VHDL produced is merely a convenient notation for describing logic at a module and netlist level; VHDL is a language much more capable than the use to which we are putting it.

**Figure 4:** Hardware generated for straight-line code.
$\includegraphics[width=3in]{straight.eps}$

The rules for generating logic from straight-line SSA form input are illustrated in figure 4. Each basic three-register instruction maps directly to a hardware functional unit; its source and destination registers indicate connections to the ports of the unit.

Figure 5: Hardware translation of an if-else statement in SSA form, using execution bits (dashed lines).

$\includegraphics[width=3in,trim=0 -45.5 0 -45]{ifelse.eps}$

L0:	if (b < 5)
L1:	a₀ = b + 1;
	else
L2:	a₁ = c + 2;
L3:	a₂ = $\phi$ (a₀,a₁);

Each basic block has an additional signal associated with it, which we call the ``execution bit.'' This bit is set to pass the flow of control, and corresponds to control flow graph edges. Branch instructions switch the bit according to their target, and phi functions become multiplexers controlled by the bits. Figure 5 illustrates.

Figure 6: Hardware translation of a simple for-loop, illustrating the handling of control-flow backedges. L_1a is the logical OR of L₀ and L_1c.

$\includegraphics[width=3in]{loop.eps}$

L0:	i₀ = 0;
L1:	i₁ = $\phi$ (i₀,i₂);
	i₂ = i₁ + 1;
	if (i₂ < 8) goto L1;
L2:	a = i₂ + b;

Back edges in the control flow graph have registers added in-line, as figure 6 illustrates. Note that the input execution bit (L₀) is only active for one clock cycle; consequently the output execution bit (L₂) is also active for only one cycle. This is the minimum cycle implementation of the simple loop structure of the figure; the first loop iteration is overlapped with the evaluation of pre-loop code, and the post-loop code executes on the same clock cycle as the final loop iteration. The execution bits form a one-hot encoding [6] of the loop state in a simple finite-state machine.

The function prototype becomes an entity and port description for the generated logic.