Subsections


1 TCE Tour

This tutorial goes through most of the tools in TCE using a fairly simple example application. It starts from C code and ends up with VHDL of the processor and a bit image of the parallel program. This tutorial will also explain how to accelerate an algorithm by customizing the instruction set, i.e. by using custom operations. The total time to go through the tutorial is about 2 to 3 hours.

The tutorial file package is available at:

http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz

Unpack it to a working directory and cd to tce_tutorials/tce_tour.

1 The Sample Application

The test application counts a 32-bit CRC (Cyclic Redundant Check). The C code implementation is written by Michael Barr and it is published under Public Domain. The implementation consists of two different version of crc, but we will be using the fast version only.

The program consists of two separate files: main.c contains the simple main function and crc.c contains the actual implementation. Open crc.c in your preferred editor and take a look at the code. The main difference between the crcSlow and crcFast implementations is that crcFast exploits precalculated table values. This is a quite usual method of algorithm optimization.

2 Starting Point Processor Architecture

We use the minimal architecture as the starting point. File minimal.adf describes a minimalistic architecture containing just enough resources that the TCE compiler can still compile programs for it. Function units in minimal.adf are selected from the hardware database (HDB, Section 2.2.2) so we are able to generate a VHDL implementation of the processor automatically later in the tutorial.

Copy the minimal.adf included in TCE distribution to a new ADF file which is your starting point architecture:

cp $(tce-config -prefix)/share/tce/data/mach/minimal.adf start.adf

Take a look at the starting point architecture using the graphical Processor Designer (ProDe, Section 4.1) tool. Then Start ProDe with:

prode start.adf &

If you have GHDL installed in your system and you want to simulate the generated processor I suggest you to decrease the amount of memory in the processor. Otherwise the GHDL generated testbench might consume tremendous amount of memory on your computer. To do this select Edit -> Address Spaces in ProDe. Then edit the bit widths of data and instruction address spaces and set them to 15 bits which should be plenty for our case.

3 Evaluating the Starting Point Architecture

Now we want to know how well the starting point architecture executes our program. We must compile the source code for our starting point architecture. This can be done with command:

tcecc -O3 -a start.adf -o crc.tpef -k result main.c crc.c

This will produce a parallel program called crc.tpef that can be executed with the processor design start.adf. The parallel program is now tied to a specified architecture, so it can only be executed on that architecture. The switch -k was used to tell the compiler to keep the result symbol in the generated program in order to access them by name in the simulator later on.

After successfully compiling the program we can now simulate it. Let's use graphical user interface version of the simulator called Proxim (Section 6.1.5). You can start it with:

proxim start.adf crc.tpef &

The simulator will load the architecture definition and the program and wait for commands. To execute the program click ``Run''. Simulator will then execute the program code and display the cycle count in the bottom bar of the simulator. Write down this cycle count for future comparison.

You can check the result straight from the processor's memory by writing this command to the command line at the bottom of the simulator:

x /u w result

The correct checksum result is 0x62488e82.

Processor resource utilization data can be viewed with command:

info proc stats

This will output a lot of information like the utilization of transport buses, register files and function units.

Proxim can also show othes various pieces of information about the program's execution and its processor resource utilization. For example to check the utilization of the resources of our architecture, select View>Machine Window from the top menu. The parts of the processor that are utilized the most are visualized with darker red color.

4 Accelerating the Algorithm

Custom operations implement application specific functionality in TTA processors. In this part of the tutorial we accelerate the CRC computation by adding a custom operation to the starting point processor design.

1 Evaluating Custom Operation Candidates

First of all, it is quite simple and efficient to implement CRC calculation entirely on hardware. Naturally, using the whole CRC function as a custom operation would be quite pointless and the benefits of using a processor based implementation would get smaller. Instead, we will consentrate on trying to accelerate smaller parts of the algorithm, picking a custom operation that is potentially useful also for other algorithms than CRC.

2 Finding the Bottlenecks

In this case finding the operation to be optimized is quite obvious if you look at function crcFast(). It consists of a for-loop in which the function reflect() is called through the macro REFLECT_DATA. If you look at the actual function you can see that it is quite simple to implement on hardware, but requires many instructions if done with basic operations in software. The function ``reflects'' the bit pattern around the middle point like a mirror. For example, the bit pattern 0101 0100 would look like this after reflection: 0010 1010. The main complexity of the function is that the bit pattern width is not fixed. Fortunately, the width cannot be arbitrary. If you examine the crcFast()-function and the reflect macros you can spot that function reflect() is only called with 8 and 32 bit widths (unsigned char and 'crc' which is an unsigned long).

5 Analyzing the Custom Operation

A great advantage of TCE is that the operation semantics, processor architecture and implementation are separate abstractions. How this affects designing custom operations is that you can simulate your design by simply defining the simulation behaviour of the operation and setting the latency of the operation to the processor architecture definition. This is nice as you do not need an actual hardware implementation of the operation at this point of the design, but can evaluate different custom operation possibilities at the architectural level. However, this brings up an awkward question: how to determine the latency of the operation? Unrealistic or too pessimistic latency estimates can produce inaccurate performance results and bias the analysis.

One approach to the problem is to take an educated guess and simulate some test cases with different custom operation latencies. This way you can determine a latency range in which the custom operation would accelerate your application to the satisfactory level. After this you can scetch how the operation could be implemented in hardware, or consult someone knowledgeable in hardware design to figure out whether the custom operation is implementable within the latency constraint.

Another approach is to try and determine the latency by examining the operation itself and considering how it could be implemented. This approach requires some insight in digital design.

Besides latency you should also consider the size of the custom function unit. It will consume extra die area, but the size limit is always case-specific. For accurate size estimation you need to have the actual implementation and synthesis.

Let us consider the reflect function. If we had fixed width we could implement the reflect by hard wiring (and registering the output) because the operation only moves bits to other locations in the word. This could be done easily in one clock cycle. But we need two different bit widths so things would be a bit more complicated. We could design the hardware in such way that it has two operations: one for 8-bit data and another for 32-bit data. On hardware one way to implement this is to have 32-bit wide crosswiring and register the output. In this case the 8-bit value would be reflected to the 8 MSB bits of the 32-bit wiring. Then we need to move the 8 MSB bits to the LSB end and zero the rest. This moving can be implemented using multiplexers. So concerning the latency this can all be done easily within one clock cycle as there is not much logic needed.


6 Creating the Custom Operation

Now we have decided the operation to be accelerated and its latency. Next we will create a function unit implementing the operation and add it to our processor design. First, a description of the semantics of the new operation must be added at least to Operation Set Abstraction Layer (Section 2.2.6). OSAL stores the semantic properties of the operation, which includes the simulation behavior, operand count etc., but not the latency. OSAL definitions can be added by using the OSAL GUI, OSEd (Section 4.2.1).

If processors that use the custom operation are to be synthesized or simulated at the VHDL level, at least one function unit implementing the operation should be added to the Hardware Database (Section 2.2.2). Cost data of the function unit needs to be added to the cost database if cost estimates of a processor containing the custom function unit are wanted. In this tutorial we add the FU implementation for our custom operation so the processor implementation can be generated, but omit the cost data required for the cost estimation.

1 Using Operation Set Editor (OSEd) to add the operation data.

OSEd is started with the command 'osed'.

Create a new operation module, which is a container for a set of operations. You can add a new module in any of the predefined search paths, provided that you have sufficient file system access on the chosen directory.

For example, choose directory `/home/user/.tce/opset/custom', where user is the name of the user account being used for the tutorial. This directory is intended for the custom operations defined by the current user, and should always have sufficient access rights.

  1. Right-click on a path name in the left area of the main window. A drop-down menu appears below the mouse pointer.
  2. Select Add module menu item.
  3. Type in the name of the module (for example, `tutorial') and press OK. The module is now added under the selected path.

2 Adding the new operations.

We will now add the operation definitions to the newly created operation module.

  1. Select the module that you just added by right-clicking on its name, displayed in the left area of the main window. A drop down menu appears.
  2. Select Add operation menu item.
  3. Type `REFLECT8' as the name of the operation.
  4. Add one input by pressing the Add button under the operation input list. Select UIntWord as type.
  5. Add one output by pressing the Add button under the operation output list. Select UIntWord as type.
  6. After the inputs and the output of the operation have been added, close the dialog by pressing the OK button. A confirmation dialog will pop up. Press Yes to confirm the action. The operation definition is now added to the module.
  7. Then repeat the steps for operation `REFLECT32'

3 Defining the simulation behaviour of the operations

The new operations REFLECT8 and REFLECT32 do not yet have simulation behavior models, so we cannot simulate programs that use these operations with the TCE processor simulator. Open again the operation property dialog by right-clicking REFLECT8, then choosing Modify properties. Now press the Open button to open an empty behavior source file for the module. Copy-paste (or type if you have the time!) the following code in the editor window (you can copy most of this code from the crc.c reflect function):

#include "OSAL.hh"
OPERATION(REFLECT8)
 TRIGGER

 unsigned long data = UINT(1);
 unsigned char nBits = 8;

 unsigned long  reflection = 0x00000000;
 unsigned char  bit;

 /*
  * Reflect the data about the center bit.
  */
 for (bit = 0; bit < nBits; ++bit)
 {
     /*
      * If the LSB bit is set, set the reflection of it.
      */
     if (data & 0x01)
     {
         reflection |= (1 << ((nBits - 1) - bit));
     }

     data = (data >> 1);
 }

 IO(2) = static_cast<unsigned> (reflection);

 return true;
 END_TRIGGER;
 END_OPERATION(REFLECT8)

OPERATION(REFLECT32)
 TRIGGER

 unsigned long data = UINT(1);
 unsigned char nBits = 32;

 unsigned long  reflection = 0x00000000;
 unsigned char  bit;

 /*
  * Reflect the data about the center bit.
  */
 for (bit = 0; bit < nBits; ++bit)
 {
     /*
      * If the LSB bit is set, set the reflection of it.
      */
     if (data & 0x01)
     {
         reflection |= (1 << ((nBits - 1) - bit));
     }

     data = (data >> 1);
 }

 IO(2) = static_cast<unsigned> (reflection);

 return true;
 END_TRIGGER;
 END_OPERATION(REFLECT32)

This code has the behaviours for the both operations. These behavior definitions reflect the input operand integer (with id 1) and writes the result to the ''output operand`` (with id 2) which is the first output and signals the simulator that all results are computed successfully.

Open file crc.c in your preferred editor. Compare the behaviour definition of reflect operations and the original reflect-function. The function is mostly similar except for parameter passing. On the custom hardware operation behavior definition the data is read from the function unit input ports and written to output ports and the nBits-value is determined from the operation code (REFLECT8 or REFLECT32).

Save the code and close the editor. REFLECT8 and REFLECT32 operations now have TCE simulator behaviour models.

4 Compiling operation behavior.

REFLECT-operations have been added to the test module. Before we can simulate the behavior of our operation, the C++-based behavior description must be compiled to a plugin module that the simulator can call.

  1. Right-click on the module name ('tutorial') displayed in the left area to bring up the drop down menu.
  2. Select Build menu item.
  3. Hopefully, no errors were found during the compilation! Otherwise, re-open the behaviour source file and try to locate the errors with the help of the diagnostic information displayed in the build dialog.

After the operation simulation model has been added and compiled the operation can be simulated. But for the sake of speed up we will skip the operation simulation here. However if you are interested in the operation simulation, see chapter 4.2.1

5 Adding a Customized Function Unit to the Architecture.

Now the operation definitions of the custom operations have been added to the Operation Set Abstraction Layer (OSAL) database. Next we need to add at least one functional unit (FU) which implements these operations so that they can be used in the processor design. Note the separation between ''operation`` and an ''function unit`` that implements the operation(s) which allows using the same OSAL operation definitions in multiple FUs with different latencies.

First, add the architecture of the FU that implements the custom operations to the starting point processor architecture. Let's take a copy of the starting point processor design which we can freely modify and still be able to easily compare the architecture with and without the custom operation support later on:

cp start.adf custom.adf

Open the copy in ProDe:

prode custom.adf &

Then:

  1. Add a new function unit to the design, right click the canvas and select: Add>Function Unit. Name the FU ''REFLECTER``. Add one input port (named as trigger) and an output port (output1) to the FU in the Function unit dialog. Set the input port (trigger) triggering (Click the port named trigger->Edit->Check dialog ''triggers``). This port starts the execution of the operation when it is written to.
  2. Add the operation ''REFLECT8`` we defined to the FU: Add from opset>REFLECT8>OK and set the latency to 1. Click on the REFLECT8 operation and ensure that the operation input is bound to the input ports and the output is bound to the output port. Check that the operand usage is in such a way that input is read at cycle 0 and the result is written at the end of the cycle (can be read from the FU on the next cycle). Thus, the latency of the operation is 1 clock cycles.
  3. Repeat the previous step for operation ''REFLECT32``
  4. Now an FU that supports the custom operations has been added to the architecture. Next, fully connect the machine to connect the FU to the rest of the architecture. This can be done by selecting Tools->Fully Connect IC. Save the architecture description by clicking Save.

7 Use the custom operation in C code.

To get some benefits from the added custom hardware, we must use it from the C code. This is done by replacing a C statement with a custom operation invocation.

Let us first backup the original C code.

cp crc.c crc_with_custom_op.c

Then open crc_with_custom_op.c in your preferred text editor.

  1. Add #include ``tceops.h'' to the top of the file. This includes automatically generated macros which allow us to use specific operations from C code without getting our hands dirty with inline assembly.

    Usage of these macros is as follows:

     _TCE_<opName>(input1, ... , inputN, output1, ... , outputN);
    

    where <opName> is the name of the operation in OSAL. Number of input and output operands depends on the operation. Input operands are given first and they are followed by output operands if any.

    In our case we need to write operands into the reflecter and read the result from it. We named the operations ``REFLECT8'' and ``REFLECT32'', thus the macros we are going to use are as follows:

     _TCE_REFLECT8(input1, output);
     _TCE_REFLECT32(input1, output);
    

    Now we will modify the crcFast function to use the custom op. First declare 2 new variables at the beginning of the function:

     crc input;
     crc output;
    

    These will help in using the reflect FU macro.

    Take a look at the REFLECT_DATA and REFLECT_REMAINDER macros. The first one has got a magic number 8 and ``variable'' X is the data. This is used in the for-loop.

    In the for-loop the input data of reflect function is read from message[]. Let us modify this so that at the beginning of the loop the input data is read to the input variable. Then we will use the _TCE_REFLECT8 macro to run the custom operations, and finally replace the REFLECT_DATA macro with the output variable. After these modifications the body of the for-loop should look like this:

     input = message[byte];
     _TCE_REFLECT8(input, output);
     data = (unsigned char) output ^ (remainder >> (WIDTH - 8));
     remainder = crcTable[data] ^ (remainder << 8);
    

    Next we will modify the return statement. Originally it uses REFLECT_REMAINDER macro where nBits is defined as WIDTH and data is remainder. Simply use _TCE_REFLECT32 macro before return statement and replace the original macro with the variable output:

     _TCE_REFLECT32(remainder, output);
     return (output ^ FINAL_XOR_VALUE);
    

    And now we are ready. Remember to save the file.

  2. Compile the custom operation using C code to a parallel TTA program using the new architecture which includes a FU with the custom operation:

    tcecc -O3 -a custom.adf -o crc_with_custom_op.tpef -k result \
    crc_with_custom_op.c main.c

  3. Simulate the parallel program. This time we will use the command line simulator ttasim. We will also enable writing of bus trace. It means that the simulator writes a text file containing the bus values of the processor from every executed clock cycle. This bus trace data will be used to verify the processor RTL implementation. Start the simulator with command:

    ttasim

    Then enable the bus trace setting:

    setting bus_trace 1

    Load architecture and program and run the simulation

    mach custom.adf

    prog crc_with_custom_op.tpef

    run

    Verify that the result is the same as before (x /u w result). It should be the same as earlier (0x62488e82). Check the cycle count info proc cycles and compare it to the cycle count of the version which does not use a custom operation. You should see a very noticeable drop compared to the starting point architecture without the custom operations. Write this cycle count down for a later step.

    The simulator execution also created a file crc_with_custom_op.tpef.bustrace which contains the bus trace.


8 Adding an implementation of the FU to the hardware database (HDB).

Now we have seen that the custom operation accelerates our application. Next we'll add a VHDL implementation of the custom FU to Hardware Database (hdb). This way we will be able to generate a VHDL implementation of our processor.

If you want to skip this phase you can use the given tour_example.hdb instead of creating it yourself.

Start HDBEditor (see Section 4.5):

hdbeditor &

TCE needs some data of the FU implementation in order to be able to automatically generate processors that include the FU.

  1. Create a new hdb and name it tour.hdb. Add the ''reflecter`` function unit from custom.adf file (edit->add->FU architecture from ADF). You can leave the ''parametrized width`` and ''guard support`` unchecked. Then define implementation for the added function unit entry right click reflect -> Add implementation....

  2. Open file tour_vhdl/reflect.vhdl that was provided in the tutorial package with the editor you prefer, and take a look. This is an example implementation of a TTA function unit performing the custom 'reflect8' and 'reflect32' operations.

  3. The HDB implementation dialog needs the following information from the VHDL:

    1. Name of the entity and the naming of the FU interface ports.

    Name the implemention after the top level entity: ``fu_reflect''.

    By examining the VHDL code you can easily spot the clock port (clk), reset (rstx) and global lock port (glock). Operation code (opcode) port is t1opcode. Write these into the appropriate text boxes. You do not have to fill the Global lock req. port field because the function unit does not need to cause a global lock to the processor during its execution.

    2. Opcodes.

    Set the operation codes as they are in the top of the vhdl file. REFLECT32 has operation code ``0'' and REFLECT8 has operation code ``1''.

    The operation codes must be always numbered according to the alphabetical order of the OSAL operation names, starting at 0. For example, in this case REFLECT32 is earlier than REFLECT8 in the alphabetical order.

    3. Parameters.

    Parameters can be found from the VHDL file. On top of the file there is one parameter: busw. It tells the width of the transport bus and thus the maximum width of input and output operands.

    Thus, add parameter named busw, type it as integer and set the value to 32 in the Parameter dialog.

    4. Architecture ports.

    These settings define the connection between the ports in the architectural description of the FU and the VHDL implementation. Each input data port in the FU is accompanied with a load port that controls the updating of the FU input port registers.

    Choose a port in the Architecture ports dialog and click edit. Name of the architecture port p1 is t1data and load port is t1load. Width formula is the parameter busw.

    Name the output port (p2) to r1data and the width formula is now busw because the output port writes to the bus. The output port does not have a load port.

    5. Add VHDL source file.

    Add the VHDL source file into the Source code dialog. Notice that the HDL files must be added in the compilation order (see section 4.5). But now we have only one source file so we can simply add it without considering the compilation order (Add -> Browse -> tour_vhdl/reflect.vhdl).

    Now you are done with adding the FU implementation. Click OK.


9 Generating the Final Products

In this step we generate the VHDL implementation of the processor, and the bit image of the parallel program.

1 Select Function Unit Implementations

You can either use the given custom_operations.idf included in the tutorial files or select the implementations yourself. If you use the given file replace custom.idf with custom_operations.idf in the following commands.

Next, we must select implementations for all components in the architecture. Each architecture component can be implemented in multiple ways, so we must choose one implementation for each component to be able to generate the implementation for the processor.

This can be done in the ProDe tool:

prode custom.adf

Then we'll select implementations for the FUs which can be done in Tools>Processor Implementation.... Note that the selection window is not currently very informative about the different implementations, so a safe bet is to select an implementation with parametrizable width/size.

  1. Select implementation for RF: Click the RF name, 'Select RF implementation', find the TCE's default HDB file from your tce installation path (PREFIX/share/tce/hdb/asic_130nm_1.5V.hdb) and select an implementation for the RF from there.

  2. Next select implementation for the boolean RF like above. But this time select an implementation which is guarded i.e. select an implementation which has word ``guarded_0'' in its name.

  3. Similarly, select implementations for the function units from TCE's default HDB. Notice that it is vital that you choose the implementation for LSU from the asic_130nm_1.5V.hdb. Then select implementation for the reflecter but this time you have to use the tour.hdb created earlier to find the FU we added that supports the REFLECT custom operations.

  4. Next select the IC/Decoder generator plugin used to generate the decoder in the control unit and interconnection network: Browse... (installation_path)/share/tce/icdecoder_plugins/base/ DefaultICDecoderPlugin.so>OK. This should be selected by default.

  5. Enable bus tracing from the Implementation-dialog's IC / Decoder Plugin tab. Set the bustrace plugin parameter to ``yes'' and the bustracestartingcycle to ``5''. The IC will now have a component which writes the bus value from every cycle to a text file. Notice that this option cannot be used if the processor is synthesized.

    You do not have to care about the HDB file text box because we are not going to use cost estimation data.

  6. Click ``Save IDF...''

2 Generate the VHDL for the processor using Processor Generator (ProGe).

You can start processor generation from ProDe's implementation selection dialog: Click ``Generate Processor''. For Binary Encoding Map: Select the ``Generate new''. In the target directory click ``Browse'' and create a new directory proge-output and select it. Then click OK to create the processor.

Or alternatively execute ProGe from command line:

generateprocessor -t -i custom.idf -o proge-output custom.adf

Now directory proge-output includes the VHDL implementation of the designed processor except for the instruction memory width package which will be created by Program Image Generator. You can take a look what the directory includes, how the RF and FU implementations are collected up under vhdl subdir and the interconnection network has been generated to connect the units (the gcu_ic subdir). The tb subdir contains testbench files for the processor core.

3 Generate instruction memory bit image using Program Image Generator.

Finally, to get our shiny new processor some bits to chew on, we use generatebits to create instruction memory and data memory images:

generatebits -d -w 4 -p crc_with_custom_op.tpef -x proge-output custom.adf

Now the file crc_with_custom_op.img includes the instruction memory image in ``ascii 0/1'' format. Each line in that file represents a single instruction. Thus, you can get the count of instructions by counting the lines in that file:

 wc -l crc_with_custom_op.img

Accordingly, the file crc_with_custom_op_data.img contains the data memory image of the processor. Program Image Generator also created file proge-output/vhdl/imem_mau_pkg.vhdl which contains the width of the instruction memory (each designed TTA can have a different instruction width).

4 Simulation and verification

If you have GHDL installed you can now simulate the processor VHDL. First cd to proge-output directory:

cd proge-output

Then compile and simulate the testbench:

./ghdl_compile.sh

./ghdl_simulate.sh

This will take some time as the bus trace writing is enabled. The simulation produces file ``bus.dump''. As the testbench is ran for constant amount of cycles we need to get the relevant part out of the bus dump for verification. This can be done with command:

head -n <number of cycles> bus.dump > sim.dump

where the <number of cycles> is the number of cycles in the previous ttasim execution. Then compare the trace dumps from the VHDL simulation and the architecture simulation:

diff -u sim.dump ../crc_with_custom_op.tpef.bustrace

If the command does not print anything the dumps were equal.

10 Increasing performance by adding resources

As the current architecture is minimalistic we can increase the performance even further by adding resources to the processor.

1 Transport buses.

The architecture has only one transport bus, thus the compiler can't exploit any instruction level parallelism. Let's start architecture customization by adding another transport bus. After this there can be 2 moves per clock cycle. First copy the current architecture:

cp custom.adf modified.adf

and open the new architecture in ProDe:

prode modified.adf &

A new transport bus can be added simply by selecting the current bus and pressing ``ctrl+c'' to copy the bus and then pressing ``ctrl+v'' to paste it. Add 3 more buses. After you have added the buses you have to connect it to the sockets. Easiest way to do this is to select ``Tools->Fully connect IC''. Save the architecture, recompile the source code for the new architecture and simulate.

tcecc -O3 -a modified.adf -o crc.tpef -k result crc_with_custom_op.c main.c

ttasim -a modified.adf -p crc.tpef

Now when you check the cycle count from the simulator:

info proc cycles

You might see a significant drop in cycles. Also check the processor utilization statistics from the simulator with command:

info proc stats

2 Register files.

From the previous simulator statistics you can see from ``operations'' table that there are a lot of load and store operations being executed. As the architecture has only 5 general purpose registers this tells us that there are a lot of register spilling to memory. Let's try how the amount of registers affect the cycle count. There are two options how we can add registers. We can either increase the number of registers in a register file or add a new register file.

Let's try the latter option because this way we increase the number of registers that can be accessed simultaneously on one clock cycle. This can be done by selecting the RF and using copy and paste. Then connect it to the IC. Simulation statistics should indicate performance increase. As expected, the number of load and store opertations decreased. But notice also that the number of add operations decreased quite a lot. The reason is simple: addition is used to calculate stack memory addresses for the spilled registers.

3 Function units.

Next subject for the bottleneck is the ALU as now all the basic operations are performed in a single function unit. From the simulator statistics you can see that logical operations and addition are quite heavily utilized. Instead of duplicating the ALU let's add more specific FUs from the Hardware Database. Select ``Edit->Add From HDB->Function Unit...''. Select a FU which has operations and(1), ior(1), xor(1) and click ``Add''. Then select FU with operation add(1) and click ``Add''. Close the dialog, connect the function units and save the architecture. Recompile and simulate to see the effect on cycle count.

The architecture could be still modified even further to drop the cycle count but let's settle for this now.

11 Final Words

This tutorial is now finished. Now you should know how to make and use your own custom operations, how to customize the processor architecture and generate the processor implementation along with its instruction memory bit image.

In this tutorial we used a ``minimalistic'' processor architecture as our starting point. The machine had only one transport bus and 5 registers so it could not fully exploit the parallel capabilities of TTA. Then we added two simple custom operations to the starting point architecture and saw a huge improvement in cycle count. Then we increased resources in the processor and the cycle count dropped even further.

If you have interest you can also add more resources to the starting point architecture and see how good cycle counts you can get out of it without using custom operations!

Pekka Jääskeläinen 2011-12-08