Advantages of transport-triggered architectures

Previous: The MOVE concept Up: The MOVE project Next: MOVE framework

Advantages of TTAs can be split into implementation advantages and new software compilation optimization possibilities. The most important implementation advantages are:

Ideal for employing superpipelining at both operation and data transport level. FU pipelines can be stretched to make shorter cycle times possible. The only lower bound on the clock cycle is register-register transfer time across the network. The network can be superpipelined itself; in that case the achievable clock cycle time reduces to the register-register transfer time within one network cluster; this time can be very short.

Ideal for employing functional parallelism at both operation and data transport level. FUs and transport capacity can be added to increase parallelism. Unlike OTAs, CPUs using a TTA do not need three busses and three register ports for each operation per instruction; e.g. twelve ports for a four operation per instruction VLIW. Three busses and ports per operation is a worst case assumption, since many operations do not need them. For example, some operations need a single source operand, or do not produce a result. Also a lot of results are directly bypassed to the next FU, without needing to be stored in a GPR. Further the compiler can perform many optimizations specific for TTAs which reduce the needed transport capacity, and therefore the network requirements (like number of transport busses and network connectivity), even further. As a consequence, the required VLSI area is reduced and the cycle time is further improved.

Ideal for the design of application specific processors (ASPs). FU parameters (like number of FUs, supported operations, latencies, throughput and pipelining degree) and interconnection network parameters (like topology, number of busses and pipelining degree) can be set according to the needs of an application domain.

Ideal for automatic generation. The CPU has a very simple design; this is a consequence of having independent FUs and of the reduced data transport requirements. Furthermore, the network has less complexity; e.g. no complex bypassing hardware is needed, bypassing is done in software. Due to its simplicity, it becomes possible to use a silicon compiler for automatic layout generation. The inputs to the silicon compiler are a template description, VLSI building blocks (e.g. FUs), and values for the architecture parameters. The output is a CPU silicon layout ready for fabrication of the CPU.

Perfect suitable for incorporating register mapped interprocessor communication support; the concept of register mapped functionality is inherently supported by a TTA. Register mapped communication enables very short latencies and a high communication bandwidth. This opens the possibility of integrating systolic communication within a general purpose processor framework.

Besides the traditional compiler optimizations, the TTAs offer the following unique optimizations:

More scheduling freedom. TTAs divide operations into smaller data transport components. This makes parallelism more fine-grained, and the resulting code schedules are more efficient and achieve higher CPU execution performance.

Software bypassing. A result of an operation can be used for another operation in the same cycle if software bypassing is applied. Bypassing means getting the value from the FU that produced it instead of the RU where it will be stored. Bypassing reduces the delay between (true) dependent operations.

Unlike OTAs, TTAs do not need special (associative) hardware to do bypassing, but all bypassing can be done in software under control of the compiler. For example, when an operand move r0 -> add_O uses the result of a result move add_T -> r0 in the same cycle as the result move, then the operand move needs to be changed into add_T -> add_O so that the result is taken from the FU instead of the RU.

Operand sharing. When two successive operations on the same FU are guaranteed to have the same value in the operand register, one operand move can be saved. We call this operand sharing, and it can be viewed as a special form of common subexpression elimination. When one operand move is shared among all iterations of a loop, then the operand move is loop invariant and can be placed before the loop.

Dead result move elimination. When all uses of a result are used via software bypassing, then the result move can be eliminated. This saves one move and the usage of one GPR. Since most results are only used once or twice (e.g. temporaries), dead result move elimination occurs frequently.

Reduced GPR demand. TTAs need fewer GPRs since (1) results are directly bypassed from FU to FU, and (2) operations can stay longer in a pipeline than is needed to do the operations, this makes GPR lifetimes shorter.

Initial experiments [11] have shown that the results of these new optimizations are very promising. The experiments show that TTAs perform 20-50% better than OTAs with similar hardware for scientific code using adapted software pipelining techniques.

In [12] TTAs are analysed for general purpose applications, using a basic block scheduling compiler. It is shown that a operation based architectures require up to 30% more transport capacity, however in order to efficiently exploit this capacity by a TTA we need a scheduling scope far beyond basic block boundaries. The latter is currently under development. First results will be publiced in [13]. It is shown that scheduling beyond basic blocks gives a 40% performance improvement.

For a more extensive description and evaluation of the MOVE concept see e.g. [15][11][14][10].

Last modified on March 18^th, 1997 by Irek Karkowski, email I.Karkowski@et.tudelft.nl