1 Architecture Template

1 Transport Triggered Architecture

The processor template from which the application specific processors designed with TCE are defined is called Transport Triggered Architecture (TTA). For a detailed description behind the TTA idea, refer to [Cor97]. A short introduction is presented in [Bou01].

TTA is based on VLIW (Very Long Instruction Word) processor paradigm but solves its major bottlenecks: the complexity of the register file (RF) and register file bypass network. TTA is statically scheduled at compile time and supports instruction-level parallelism (ILP) like VLIW. TTA instructions are commonly hundreds of bits wide. TTAs can have multiple independent function units (FUs) and a customized interconnection network, as illustrated in Fig. 7.1.

The term transport-triggered means that instruction words control the operand transfers on the interconnection network and the operation execution happens as a side effect of these transfers. An operation is triggered to execute when a data operand is transferred to a specific trigger input port of an FU. In the instructions there is one slot for each transport bus. For example, ``FU0.out0 -> LSU.trig.stw'' moves data from the output port $0$ of function unit $0$ to the trigger input port of load-store unit (LSU). After that the LSU starts executing the triggered operation, in this case store word to memory.

Figure 7.1: TTA basic structure

In a basic case, all FU input and output ports are registers which relieves the pressure on the register file. Thanks to the programming model, operands can be bypassed directly from one FU to another. This is called software bypassing. Additionally, if the bypassed operand is not needed by anyone else there is no need to write the operand to a register file. This optimization technique is called dead result elimination. Combining software bypassing with dead result elimination helps to reduce register file traffic. Moreover, the TCE TTA template gives freedom for partitioning register files. For example, there can be several small and simple register files instead one centralized multiported RF.

The transport programming is also beneficial because it allows easy scalability of the architecture and compiler, as well as supports varying pipeline depths at FUs. At the same time, the number of FU inputs and outputs is not restricted, unlike in most processor templates which support only instructions with 2 inputs and 1 output value. User can also create instruction set extensions with a special function unit (SFU) which can have arbitrary number of I/O operands. Instruction set extension is a powerful way of enhancing the performance of certain applications.

Table 7.1: Configurable aspects of the TCE's TTA template
Property Values Example
Functional unit Type, count 3x ALU, 2x LSU, 1x MUL, 1x ctrl...
Register file (RF) # registers, #RFs, #ports, width 16x 32b RF 2x rd + 2x wr ports, 16x 1b boolean RF
Interconnection network #buses, #sockets 5 buses, total of 43 write and 44 read sockets
Memory interfaces Count, type 2x LSU for SRAM w/ 32b data & 32b addr
Special FU User-defined functionality dct, semaphor_lock, FIFO I/O

2 Immediates/Constants

The TTA template supports two ways of transporting program constants in instructions, such as ``add value+5''. Short immediates are encoded in the move slot's source field, and thus consume a part of a single move slot. The constants transported in the source field should usually be relatively small in size, otherwise the width of a move slot is dominated by the immediate field.

Wider constants can be transported by means of so called long immediates. Long immediates can be defined using an ADF parameter called instruction template. The instruction template defines which slots are used for pieces of the instruction template or for defining the transports (moves). The slots cannot be used for regular data transports when they are used for transporting pieces of a long immediate.

An instruction template defining a long immediate also provides a target to which the long immediate must be transported. The target register resides in a so called immediate unit which is written directly from the control unit, not through the transport buses. The immediate unit is like a register file expect that it contains only read ports and is written only by the instruction decoder in the control unit when it detects an instruction with a long immediate (see Fig. 7.1).

Thus, in order to support the special long immediate encoding, one has to add a) an instruction template that supports transporting the pieces of the immediate using full move slots b) at least one long immediate unit (a read-only register file) to which the instruction writes the immediates and of which registers the immediates can be read to the datapath.

3 Operations, Function Units, and Operand Bindings

Due to the way TCE abstracts operations and function units, an additional concept of operand binding is needed to connect the two in processor designs.

Operations in TCE are defined in a separate database (OSAL, Sections 2.2.6 and 4.3 ) in order to allow defining a reusable database of ``operation semantics''. The operations are used in processor designs by adding function units (FU) that implement the wanted operations. Operands of the operations can be mapped to different ports of the implementing FU, which affects programming of the processor. Mapping of operation operands to the FU ports must be therefore described by the processor designer explicitly.

Example. Designer adds an FU called 'ALU' which implements operations 'ADD', 'SUB', and 'NOT'. ALU has two input ports called 'in1' and 'in2t' (triggering), and an output port called 'out'. A logical binding of the 'ADD' and 'SUB' operands to ALU ports is the following:

 ADD.1 (the first input operand) bound to ALU.in1
 ADD.2 (the second input operand) bound to ALU.in2t
 ADD.3 (the output operand) bound to ALU.out

 SUB.1 (the first input operand) bound to ALU.in1
 SUB.2 (the second input operand) bound to ALU.in2t
 SUB.3 (the output operand) bound to ALU.out

However, operation 'NOT', that is, the bitwise negation has only one input thus it must be bound to port 'FU.in2t' so it can be triggered:

 NOT.1 bound to ALU.in2t
 NOT.2 (the output operand) bound to ALU.out

Because we have a choice in how we bind the 'ADD' and 'SUB' input operands, the binding has to be explicit in the architecture definition. The operand binding described above defines architecturally different TTA function unit from the following:

 SUB.2 bound to ALU.in1
 SUB.1 bound to ALU.in2t
 SUB.3 bound to ALU.out

With the rest of the operands bound similarly as in the first example.

Due to the differing 'SUB' input bindings, one cannot run code scheduled for the previous processor on a machine with an ALU with the latter operand bindings. This small detail is important to understand when designing more complex FUs, with multiple operations with different number of operands of varying size, but is usually transparent to the basic user of TCE.

Reasons for wanting to fine tune the operand bindings might include using input ports of a smaller width for some operation operands. For example, the width of the address operands in memory accessing operations of a load store unit is often smaller than the data width. Similarly, the second operand of a shift operation that defines the number of bits to shift requires less bits than the shifted data operand.

4 Datapath Connectivity Levels

The datapath interconnection network of TTAs is visible to the programmer (i.e. the compiler in practice). This enables full customization of the connectivity, making it possible to remove connections that are rarely, if at all, used by the programs the processor at hand is designed to run. This offers notable saving in the HW area. However, the less connections the machine has, the more challenging it becomes to automatically produce efficient code for it. This section describes the different TTA ``connectivity levels'' and their support in the TCE design flow.

The currently identified and supported connectivity levels are, in the order of descending level of connectivity, as follows:

  1. Fully connected. Completely connected interconnection network ``matrix''. All bus-socket and socket-port connections are there. There is a shortcut for creating this type of connectivity in the ProDe tool.

    The easy target for the high-level language compiler tcecc. However, not a realistic design usually due to its high implementation costs.

  2. Directly reachable. The connectivity has been reduced. However, there is still at least one direct connection from each function unit (FU) and register file (RF) output to all inputs.

    An easy target for tcecc.

  3. Fully RF connected. All FUs are connected to all RFs. That is, you can read and write any general purpose register (GPR) from any FU with a single move. However, some or all bypass connections between FUs might be missing.

    An easy target for tcecc. However, reduction of bypass connections means that less software bypassing can be done.

  4. Reachable. All FUs are connected to at least one RF and all RFs (and thus other FUs) can be reached via one or more additional register copy moves.

    Compilation is fully supported by tcecc. The number of copies is not limited by tcecc. However, this style of connectivity results in suboptimal code due to the additional register copies which introduce additional moves, consume registers, and produce dependencies to the code which hinder parallelism.

  5. RF disconnected. Some FUs are not connected to any RF or there are ``separated islands'' without connectivity to other ``islands''.

    Not supported by tcecc. However, any connectivity type is supported by the TCE assembler. Thus, one can resort to manual TTA assembly programming in this case.

Pekka Jääskeläinen 2016-11-24