This tutorial goes through most of the tools in TCE using a fairly simple example application. It starts from C code and ends up with VHDL of the processor and a bit image of the parallel program. This tutorial will also explain how to accelerate an algorithm by customizing the instruction set, i.e. by using custom operations. The total time to go through the tutorial is about 2 to 3 hours.
The tutorial file package is available at:
http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz
Unpack it to a working directory and cd to tce_tutorials/tce_tour.
The test application counts a 32-bit CRC (Cyclic Redundant Check). The C code implementation is written by Michael Barr and it is published under Public Domain. The implementation consists of two different version of crc, but we will be using the fast version only.
The program consists of two separate files: main.c contains the simple main function and crc.c contains the actual implementation. Open crc.c in your preferred editor and take a look at the code. The main difference between the crcSlow and crcFast implementations is that crcFast exploits precalculated table values. This is a quite usual method of algorithm optimization.
We use the minimal architecture as the starting point. File minimal.adf describes a minimalistic architecture containing just enough resources that the TCE compiler can still compile programs for it. Function units in minimal.adf are selected from the hardware database (HDB, Section 2.2.2) so we are able to generate a VHDL implementation of the processor automatically later in the tutorial.
Copy the minimal.adf included in TCE distribution to a new ADF file which is your starting point architecture:
cp $(tce-config -prefix)/share/tce/data/mach/minimal.adf start.adf
Take a look at the starting point architecture using the graphical Processor Designer (ProDe, Section 4.1) tool. Then Start ProDe with:
prode start.adf &
If you have GHDL installed in your system and you want to simulate the generated processor I suggest you to decrease the amount of memory in the processor. Otherwise the GHDL generated testbench might consume tremendous amount of memory on your computer. To do this select Edit -> Address Spaces in ProDe. Then edit the bit widths of data and instruction address spaces and set them to 15 bits which should be plenty for our case.
Now we want to know how well the starting point architecture executes our program. We must compile the source code for our starting point architecture. This can be done with command:
tcecc -O3 -a start.adf -o crc.tpef -k result main.c crc.c
This will produce a parallel program called crc.tpef that can be executed with the processor design start.adf. The parallel program is now tied to a specified architecture, so it can only be executed on that architecture. The switch -k was used to tell the compiler to keep the result symbol in the generated program in order to access them by name in the simulator later on.
After successfully compiling the program we can now simulate it. Let's use graphical user interface version of the simulator called Proxim (Section 6.1.5). You can start it with:
proxim start.adf crc.tpef &
The simulator will load the architecture definition and the program and wait for commands. To execute the program click ``Run''. Simulator will then execute the program code and display the cycle count in the bottom bar of the simulator. Write down this cycle count for future comparison.
You can check the result straight from the processor's memory by writing this command to the command line at the bottom of the simulator:
x /u w result
The correct checksum result is 0x62488e82.
Processor resource utilization data can be viewed with command:
info proc stats
This will output a lot of information like the utilization of transport buses, register files and function units.
Proxim can also show othes various pieces of information about the program's execution and its processor resource utilization. For example to check the utilization of the resources of our architecture, select View>Machine Window from the top menu. The parts of the processor that are utilized the most are visualized with darker red color.
Custom operations implement application specific functionality in TTA processors. In this part of the tutorial we accelerate the CRC computation by adding a custom operation to the starting point processor design.
First of all, it is quite simple and efficient to implement CRC calculation entirely on hardware. Naturally, using the whole CRC function as a custom operation would be quite pointless and the benefits of using a processor based implementation would get smaller. Instead, we will consentrate on trying to accelerate smaller parts of the algorithm, picking a custom operation that is potentially useful also for other algorithms than CRC.
In this case finding the operation to be optimized is quite obvious if you look at function crcFast(). It consists of a for-loop in which the function reflect() is called through the macro REFLECT_DATA. If you look at the actual function you can see that it is quite simple to implement on hardware, but requires many instructions if done with basic operations in software. The function ``reflects'' the bit pattern around the middle point like a mirror. For example, the bit pattern 0101 0100 would look like this after reflection: 0010 1010. The main complexity of the function is that the bit pattern width is not fixed. Fortunately, the width cannot be arbitrary. If you examine the crcFast()-function and the reflect macros you can spot that function reflect() is only called with 8 and 32 bit widths (unsigned char and 'crc' which is an unsigned long).
A great advantage of TCE is that the operation semantics, processor architecture and implementation are separate abstractions. How this affects designing custom operations is that you can simulate your design by simply defining the simulation behaviour of the operation and setting the latency of the operation to the processor architecture definition. This is nice as you do not need an actual hardware implementation of the operation at this point of the design, but can evaluate different custom operation possibilities at the architectural level. However, this brings up an awkward question: how to determine the latency of the operation? Unrealistic or too pessimistic latency estimates can produce inaccurate performance results and bias the analysis.
One approach to the problem is to take an educated guess and simulate some test cases with different custom operation latencies. This way you can determine a latency range in which the custom operation would accelerate your application to the satisfactory level. After this you can scetch how the operation could be implemented in hardware, or consult someone knowledgeable in hardware design to figure out whether the custom operation is implementable within the latency constraint.
Another approach is to try and determine the latency by examining the operation itself and considering how it could be implemented. This approach requires some insight in digital design.
Besides latency you should also consider the size of the custom function unit. It will consume extra die area, but the size limit is always case-specific. For accurate size estimation you need to have the actual implementation and synthesis.
Let us consider the reflect function. If we had fixed width we could implement the reflect by hard wiring (and registering the output) because the operation only moves bits to other locations in the word. This could be done easily in one clock cycle. But we need two different bit widths so things would be a bit more complicated. We could design the hardware in such way that it has two operations: one for 8-bit data and another for 32-bit data. On hardware one way to implement this is to have 32-bit wide crosswiring and register the output. In this case the 8-bit value would be reflected to the 8 MSB bits of the 32-bit wiring. Then we need to move the 8 MSB bits to the LSB end and zero the rest. This moving can be implemented using multiplexers. So concerning the latency this can all be done easily within one clock cycle as there is not much logic needed.
Now we have decided the operation to be accelerated and its latency. Next we will create a function unit implementing the operation and add it to our processor design. First, a description of the semantics of the new operation must be added at least to Operation Set Abstraction Layer (Section 2.2.6). OSAL stores the semantic properties of the operation, which includes the simulation behavior, operand count etc., but not the latency. OSAL definitions can be added by using the OSAL GUI, OSEd (Section 4.2.1).
If processors that use the custom operation are to be synthesized or simulated at the VHDL level, at least one function unit implementing the operation should be added to the Hardware Database (Section 2.2.2). Cost data of the function unit needs to be added to the cost database if cost estimates of a processor containing the custom function unit are wanted. In this tutorial we add the FU implementation for our custom operation so the processor implementation can be generated, but omit the cost data required for the cost estimation.
OSEd is started with the command 'osed'.
Create a new operation module, which is a container for a set of operations. You can add a new module in any of the predefined search paths, provided that you have sufficient file system access on the chosen directory.
For example, choose directory `/home/user/.tce/opset/custom', where user is the name of the user account being used for the tutorial. This directory is intended for the custom operations defined by the current user, and should always have sufficient access rights.
#include "OSAL.hh" OPERATION(REFLECT8) TRIGGER unsigned long data = UINT(1); unsigned char nBits = 8; unsigned long reflection = 0x00000000; unsigned char bit; /* * Reflect the data about the center bit. */ for (bit = 0; bit < nBits; ++bit) { /* * If the LSB bit is set, set the reflection of it. */ if (data & 0x01) { reflection |= (1 << ((nBits - 1) - bit)); } data = (data >> 1); } IO(2) = static_cast<unsigned> (reflection); return true; END_TRIGGER; END_OPERATION(REFLECT8) OPERATION(REFLECT32) TRIGGER unsigned long data = UINT(1); unsigned char nBits = 32; unsigned long reflection = 0x00000000; unsigned char bit; /* * Reflect the data about the center bit. */ for (bit = 0; bit < nBits; ++bit) { /* * If the LSB bit is set, set the reflection of it. */ if (data & 0x01) { reflection |= (1 << ((nBits - 1) - bit)); } data = (data >> 1); } IO(2) = static_cast<unsigned> (reflection); return true; END_TRIGGER; END_OPERATION(REFLECT32)
This code has the behaviours for the both operations. These behavior definitions reflect the input operand integer (with id 1) and writes the result to the ''output operand`` (with id 2) which is the first output and signals the simulator that all results are computed successfully.
Open file crc.c in your preferred editor. Compare the behaviour definition of reflect operations and the original reflect-function. The function is mostly similar except for parameter passing. On the custom hardware operation behavior definition the data is read from the function unit input ports and written to output ports and the nBits-value is determined from the operation code (REFLECT8 or REFLECT32).
Save the code and close the editor. REFLECT8 and REFLECT32 operations now have TCE simulator behaviour models.
After the operation simulation model has been added and compiled the operation can be simulated. But for the sake of speed up we will skip the operation simulation here. However if you are interested in the operation simulation, see chapter 4.2.1
Now the operation definitions of the custom operations have been added to the Operation Set Abstraction Layer (OSAL) database. Next we need to add at least one functional unit (FU) which implements these operations so that they can be used in the processor design. Note the separation between ''operation`` and an ''function unit`` that implements the operation(s) which allows using the same OSAL operation definitions in multiple FUs with different latencies.
First, add the architecture of the FU that implements the custom operations to the starting point processor architecture. Let's take a copy of the starting point processor design which we can freely modify and still be able to easily compare the architecture with and without the custom operation support later on:
cp start.adf custom.adf
Open the copy in ProDe:
prode custom.adf &
Then:
To get some benefits from the added custom hardware, we must use it from the C code. This is done by replacing a C statement with a custom operation invocation.
Let us first backup the original C code.
cp crc.c crc_with_custom_op.c
Then open crc_with_custom_op.c in your preferred text editor.
Usage of these macros is as follows:
_TCE_<opName>(input1, ... , inputN, output1, ... , outputN);
where <opName> is the name of the operation in OSAL. Number of input and output operands depends on the operation. Input operands are given first and they are followed by output operands if any.
In our case we need to write operands into the reflecter and read the result from it. We named the operations ``REFLECT8'' and ``REFLECT32'', thus the macros we are going to use are as follows:
_TCE_REFLECT8(input1, output); _TCE_REFLECT32(input1, output);
Now we will modify the crcFast function to use the custom op. First declare 2 new variables at the beginning of the function:
crc input; crc output;
These will help in using the reflect FU macro.
Take a look at the REFLECT_DATA and REFLECT_REMAINDER macros. The first one has got a magic number 8 and ``variable'' X is the data. This is used in the for-loop.
In the for-loop the input data of reflect function is read from message[]. Let us modify this so that at the beginning of the loop the input data is read to the input variable. Then we will use the _TCE_REFLECT8 macro to run the custom operations, and finally replace the REFLECT_DATA macro with the output variable. After these modifications the body of the for-loop should look like this:
input = message[byte]; _TCE_REFLECT8(input, output); data = (unsigned char) output ^ (remainder >> (WIDTH - 8)); remainder = crcTable[data] ^ (remainder << 8);
Next we will modify the return statement. Originally it uses REFLECT_REMAINDER macro where nBits is defined as WIDTH and data is remainder. Simply use _TCE_REFLECT32 macro before return statement and replace the original macro with the variable output:
_TCE_REFLECT32(remainder, output); return (output ^ FINAL_XOR_VALUE);
And now we are ready. Remember to save the file.
tcecc -O3 -a custom.adf -o crc_with_custom_op.tpef -k result \
crc_with_custom_op.c main.c
ttasim
Then enable the bus trace setting:
setting bus_trace 1
Load architecture and program and run the simulation
mach custom.adf
prog crc_with_custom_op.tpef
run
Verify that the result is the same as before (x /u w result). It should be the same as earlier (0x62488e82). Check the cycle count info proc cycles and compare it to the cycle count of the version which does not use a custom operation. You should see a very noticeable drop compared to the starting point architecture without the custom operations. Write this cycle count down for a later step.
The simulator execution also created a file crc_with_custom_op.tpef.bustrace which contains the bus trace.
Now we have seen that the custom operation accelerates our application. Next we'll add a VHDL implementation of the custom FU to Hardware Database (hdb). This way we will be able to generate a VHDL implementation of our processor.
If you want to skip this phase you can use the given tour_example.hdb instead of creating it yourself.
Start HDBEditor (see Section 4.7):
hdbeditor &
TCE needs some data of the FU implementation in order to be able to
automatically generate processors that include the FU.
1. Name of the entity and the naming of the FU interface ports.
Name the implemention after the top level entity: ``fu_reflect''.
By examining the VHDL code you can easily spot the clock port (clk), reset (rstx) and global lock port (glock). Operation code (opcode) port is t1opcode. Write these into the appropriate text boxes. You do not have to fill the Global lock req. port field because the function unit does not need to cause a global lock to the processor during its execution.
2. Opcodes.
Set the operation codes as they are in the top of the vhdl file. REFLECT32 has operation code ``0'' and REFLECT8 has operation code ``1''.
The operation codes must be always numbered according to the alphabetical order of the OSAL operation names, starting at 0. For example, in this case REFLECT32 is earlier than REFLECT8 in the alphabetical order.
3. Parameters.
Parameters can be found from the VHDL file. On top of the file there is one parameter: busw. It tells the width of the transport bus and thus the maximum width of input and output operands.
Thus, add parameter named busw, type it as integer and set the value to 32 in the Parameter dialog.
4. Architecture ports.
These settings define the connection between the ports in the architectural description of the FU and the VHDL implementation. Each input data port in the FU is accompanied with a load port that controls the updating of the FU input port registers.
Choose a port in the Architecture ports dialog and click edit. Name of the architecture port p1 is t1data and load port is t1load. Width formula is the parameter busw.
Name the output port (p2) to r1data and the width formula is now busw because the output port writes to the bus. The output port does not have a load port.
5. Add VHDL source file.
Add the VHDL source file into the Source code dialog. Notice that the HDL files must be added in the compilation order (see section 4.7). But now we have only one source file so we can simply add it without considering the compilation order (Add -> Browse -> tour_vhdl/reflect.vhdl).
Now you are done with adding the FU implementation. Click OK.
In this step we generate the VHDL implementation of the processor, and the bit image of the parallel program.
You can either use the given custom_operations.idf included in the tutorial files or select the implementations yourself. If you use the given file replace custom.idf with custom_operations.idf in the following commands.
Next, we must select implementations for all components in the architecture. Each architecture component can be implemented in multiple ways, so we must choose one implementation for each component to be able to generate the implementation for the processor.
This can be done in the ProDe tool:
prode custom.adf
Then we'll select implementations for the FUs which can be done in Tools>Processor Implementation.... Note that the selection window is not currently very informative about the different implementations, so a safe bet is to select an implementation with parametrizable width/size.
You do not have to care about the HDB file text box because we are not going to use cost estimation data.
You can start processor generation from ProDe's implementation selection dialog: Click ``Generate Processor''. For Binary Encoding Map: Select the ``Generate new''. In the target directory click ``Browse'' and create a new directory proge-output and select it. Then click OK to create the processor.
Or alternatively execute ProGe from command line:
generateprocessor -t -i custom.idf -o proge-output custom.adf
Now directory proge-output includes the VHDL implementation of the designed processor except for the instruction memory width package which will be created by Program Image Generator. You can take a look what the directory includes, how the RF and FU implementations are collected up under vhdl subdir and the interconnection network has been generated to connect the units (the gcu_ic subdir). The tb subdir contains testbench files for the processor core.
Finally, to get our shiny new processor some bits to chew on, we use generatebits to create instruction memory and data memory images:
generatebits -d -w 4 -p crc_with_custom_op.tpef -x proge-output custom.adf
Now the file crc_with_custom_op.img includes the instruction memory image in ``ascii 0/1'' format. Each line in that file represents a single instruction. Thus, you can get the count of instructions by counting the lines in that file:
wc -l crc_with_custom_op.img
Accordingly, the file crc_with_custom_op_data.img contains the data memory image of the processor. Program Image Generator also created file proge-output/vhdl/imem_mau_pkg.vhdl which contains the width of the instruction memory (each designed TTA can have a different instruction width).
If you have GHDL installed you can now simulate the processor VHDL. First cd to proge-output directory:
cd proge-output
Then compile and simulate the testbench:
./ghdl_compile.sh
./ghdl_simulate.sh
This will take some time as the bus trace writing is enabled. The simulation produces file ``bus.dump''. As the testbench is ran for constant amount of cycles we need to get the relevant part out of the bus dump for verification. This can be done with command:
head -n <number of cycles> bus.dump > sim.dump
where the <number of cycles> is the number of cycles in the previous ttasim execution. Then compare the trace dumps from the VHDL simulation and the architecture simulation:
diff -u sim.dump ../crc_with_custom_op.tpef.bustrace
If the command does not print anything the dumps were equal.
As the current architecture is minimalistic we can increase the performance even further by adding resources to the processor.
cp custom.adf modified.adf
and open the new architecture in ProDe:
prode modified.adf &
A new transport bus can be added simply by selecting the current bus and pressing ``ctrl+c'' to copy the bus and then pressing ``ctrl+v'' to paste it. Add 3 more buses. After you have added the buses you have to connect it to the sockets. Easiest way to do this is to select ``Tools->Fully connect IC''. Save the architecture, recompile the source code for the new architecture and simulate.
tcecc -O3 -a modified.adf -o crc.tpef -k result crc_with_custom_op.c main.c
ttasim -a modified.adf -p crc.tpef
Now when you check the cycle count from the simulator:
info proc cycles
You might see a significant drop in cycles. Also check the processor utilization statistics from the simulator with command:
info proc stats
From the previous simulator statistics you can see from ``operations'' table that there are a lot of load and store operations being executed. As the architecture has only 5 general purpose registers this tells us that there are a lot of register spilling to memory. Let's try how the amount of registers affect the cycle count. There are two options how we can add registers. We can either increase the number of registers in a register file or add a new register file.
Let's try the latter option because this way we increase the number of registers that can be accessed simultaneously on one clock cycle. This can be done by selecting the RF and using copy and paste. Then connect it to the IC. Simulation statistics should indicate performance increase. As expected, the number of load and store opertations decreased. But notice also that the number of add operations decreased quite a lot. The reason is simple: addition is used to calculate stack memory addresses for the spilled registers.
The architecture could be still modified even further to drop the cycle count but let's settle for this now.
This tutorial is now finished. Now you should know how to make and use your own custom operations, how to customize the processor architecture and generate the processor implementation along with its instruction memory bit image.
In this tutorial we used a ``minimalistic'' processor architecture as our starting point. The machine had only one transport bus and 5 registers so it could not fully exploit the parallel capabilities of TTA. Then we added two simple custom operations to the starting point architecture and saw a huge improvement in cycle count. Then we increased resources in the processor and the cycle count dropped even further.
If you have interest you can also add more resources to the starting point architecture and see how good cycle counts you can get out of it without using custom operations!
Pekka Jääskeläinen 2012-06-07