Subsections


5 Implementing Programs in Parallel Assembly Code

This tutorial will introduce you to TTA assembly programming. It is recommended that you go through this tutorial because it will certainly familiarize you with TTA architecture and how TTA works.

1 Preparations

For the tutorial you need to download file package from http://tce.cs.tut.fi/tutorial_files/tce_tutorials.tar.gz and unpack it to a working directory. Then cd to parallel_assembly-directory.

The first thing to do is to compile the custom operation set called cos16 shipped within the parallel_assembly-directory. The easiest way to do this is:

buildopset cos16

This should create a file named cos16.opb in the directory.

2 Introduction to DCT

Now you will be introduced to TCE assembler language and assembler usage. Your task is to write TCE assembly code for 2-Dimensional 8 times 8 point Discrete Cosine Transform (DCT_8x8). First take a look at the C code of DCT_8x8 dct_8x8_16_bit_with_sfus.c. The code is written to support fixed point datatype with sign plus 15 fragment bits, which means coverage from $-1\ to\ 1-2^{15}$. The fixed point multiplier, function mul_16_fix, and fixed point adder, function add_16_fix, used in the code scale inputs automatically to prevent overflow. Function cos16 takes $x(2i+1)$ as input and returns the corresponding cosine value $cos\left(\frac{x(2i+1)\pi}{16}\right)$. The code calculates following equations:


\begin{displaymath}
F(x) = \frac{C(x)}{2}\sum_{i = 0}^7 \left[f(i)cos\left(\frac{x
\left(2i+1\right)\pi}{16}\right)\right] \nonumber
\end{displaymath}  


\begin{displaymath}
F(y) = \frac{C(y)}{2}\sum_{i = 0}^7 \left[f(i)cos\left(\frac{y
\left(2i+1\right)\pi}{16}\right)\right] \nonumber
\end{displaymath}  


\begin{displaymath}
C(i) =
\left\{
\begin{array}{c c}
\frac{2}{\sqrt2} &, i = 0 \\
1 &, else
\end{array} \right . . \nonumber
\end{displaymath}  


\begin{displaymath}
F(x,y) = F(x)F(y) \nonumber
\end{displaymath}  

3 Introduction to TCE assembly

First take a look at assembly example in file example.tceasm to get familiar with syntax. More help can be found from section 5.3

Compilation of the example code is done by command:

tceasm -o example.tpef dct_8x8_16_bit_with_sfus.adf example.tceasm

The assembler will give some warnings saying that ``Source is wider than destination.'' but these can be ignored.

The compiled tceasm code can be simulated with TCE simulator, ttasim or proxim(GUI).

ttasim -a dct_8x8_16_bit_with_sfus.adf -p example.tpef , or

proxim dct_8x8_16_bit_with_sfus.adf example.tpef

It is recommended to use proxim because it is more illustrating to track the execution with it. Especially if you open the Machine Window (View -> Machine Window) and step through the program.

Check the result of example code with command (you can also write this in proxim's command line at the bottom of the main window):

x /a IODATA /n 1 /u b 2.

the output of x should be 0x40.

4 Implementing DCT on TCE assembly

Next try to write assembly code which does the same functionality as the C code. The assembly code must be functional with the given machine dct_8x8_16_bit_with_sfus.adf. Take a look at the processor by using prode:

prode dct_8x8_16_bit_with_sfus.adf &

The processor's specifications are the following:

Supported operations
Operations supported by the machine are: mul, mul_16_fix, add, add_16_fix, ldq, ldw, stq, stw, shr, shl, eq, gt, gtu, jump, cos16 and immediate transport.

When you program using TTA assembly you need to take into account operation latencies. The jump latency is four clock cycles and load latencies (ldq and ldw) are three cycles. Latency for multiplications (mul and mul_16_fix) are two clock cycles.

Address spaces
The machine has two separate address spaces, one for data and another for instructions. The data memory is 16-bit containing 128 memory slots and the MAU of data memory is 16-bits. The instruction memory has 1024 memory slots which means that the maximum number of instructions of 1024.

Register files
The machine contains 4 register files, each of which have 4 16-bit registers, leading to total of 16 16-bit registers. The first register file has 2 read ports.

Transport buses
The machine has 3 16-bit buses, which means maximum of 3 concurrent transports. Each bus can contain a 8-bit short immediate.

Immediates
Because the transport buses can only contain 8-bit short immediates you must use the immediate unit if you want to use longer immediates. The immediate unit can hold a 16-bit immediate. There is an example of immediate unit usage in file immediate_example.tceasm. Basically you need to transfer the value to the immediate register. The value of immediate register can be read on the next cycle.

The initial input data is written to memory locations 0-63 in the file assembler_tutorial.tceasm. Write your assembly code in that file.

1 Verifying the assembly program

The reference output is given in reference_output. You need to compare your assembly program's simulation result to the reference output. Comparision can be done by first dumping the memory contents in the TCE simulator with following command:

x /a IODATA /n 64 /u b 0

The command assumes that output data is stored to memory locations 0-63.

The easiest way to dump the memory into a text file is to execute ttasim with the following command:

ttasim -a dct_8x8_16_bit_with_sfus.adf -p assembler_tutorial.tpef < input_command.txt > dump.txt

After this you should use sed to divide the memory dump into separete lines to help comparison between your output and the reference output. Use the following command to do this (there is an empty space between the first two slashes of the sed expression):

cat dump.txt | sed 's/ /$\backslash$n/g' > output.txt

And then compare the result with reference:

diff -u output.txt reference_output

When the TCE simulator memory dump is the same as the reference output your assembly code works and you have completed this tutorial. Of cource you might wish to improve your assembly code to minimize cycle count or/and instruction count.

You should also compile the C program and run it because it gives more detailed information which can be used as reference data if you need to debug your assembly code.

To compile the C code, enter:

gcc -o c_version dct_8x8_16_bit_with_sfus.c

If you want the program to print its output to a text file, you can use the following command:

./c-version > output.txt

To get some idea of the performance possibilities of the machine, one assembly code has 52 instructions and it runs the DCT8x8 in 3298 cycles.

Pekka Jääskeläinen 2018-03-12