8 How to print from Altera FPGAs

TCE comes with a special function unit which enables character output from Altera FPGAs. This SFU utilizes Altera JTAG UART IP core and nios2-terminal host program for this functionality. Neither of them are included in TCE so you need the appropriate Altera software and libraries (namely Quartus II and Nios II EDS) in order to use the printing capability.

1 Hello World 2.0

If you haven't already downloaded the tutorial file package, you should download it now from:

Then unpack it to a working directory and cd to tce_tutorials/fpga_stdout

Let's begin by examining our tutorial architecture. Open the given test.adf architecture in ProDe:

prode test.adf &

As you can see, the architecture includes IO function unit which implements the STDOUT operation. In addition the architecture also has a timer function unit which we will also be used in this tutorial.

1 Examine the first version

Open source code file std_print.c in your preferred text editor. As you can see, the code includes stdio.h and uses printf() for printing ``Hello World!''. Furthermore, operation RTC is used to measure how long the printing takes and this time is then printed at the end of the program. Writing value ``0'' to the RTC resets the real time clock in timer FU. When the RTC is invoked with a non-zero value, the current time is written to the given variable (in this case to variable ``timestamp''). RTC counts the time in milliseconds and by default, the instruction set simulator assumes clock frequency of 100 MHz.

Now compile the source code with the following command:

tcecc -O0 -swfp -a test.adf -o std_print.tpef std_print.c

The compilation command contains a few flags that should be explained in more detail. Let's first look into the -swfp flag: this tells the compiler to link the program with the floating point emulation library. Floating point support is needed because printf() function includes support for printing floating point values and our architecture does not contain floating point function units. To emphasize the lesson of this tutorial it is important that you compile the code without optimizations i.e. with -O0 flag. Otherwise our witty compiler will optimize the first printf() call (which just prints a constant character string) into inlined _TCE_STDOUT operations.

After compiling, execute the program in the instruction set simulator:

ttasim -a test.adf -p std_print.tpef

You should see the printed text in your terminal. By the time of writing this tutorial, the duration of the ``Hello World'' printing with printf() took 61 ms (time may vary depending on the TCE version).

But speaking of the lesson, execute the following command to examine the minimum memory consumption of this program:

dumptpef -m std_print.tpef

As you should notice, the instruction count is high (around 45 000 at the time of writing this tutorial) on this simple sequential TTA architecture with optimizations disabled. Major part of the instructions are spent on the printf() and floating point emulation functions.

First step in reducing the instruction count is to use iprintf() instead of printf(). Function iprintf() is a simplified version of printf() which drops the support for printing floating points. In our test case we don't need to print floats, so open the source code file std_print.c in your preferred text editor and change the two printf() calls to iprintf(). Alternatively, you can do this with program sed:

sed -i 's/printf/iprintf/g' std_print.c

Now recompile the program and check the instruction count:

tcecc -O0 -a test.adf -o std_print.tpef std_print.c

dumptpef -m std_print.tpef

You should see a significant drop in the instruction count. If you simulate the new program, you notice no difference in the behavior (except that the measured time might be a bit lower).

2 Light weight printing

TCE includes a Light Weigth PRinting (lwpr) library to provide small and simple functions for printing strings and integers for further reducing the instruction count overhead of print support. Take a look at the source code file lightweight_print.c to see how these functions are used. The library is included with header lwpr.h. Function lwpr_print_str() is utilized to output strings and function lwpr_print_int() is used for printing integers. There is also function for printing integers in hexadecimal format called lwpr_print_hex(), but it is not used in this tutorial.

Compile the new code to see the difference of using lwpr:

tcecc -O0 -llwpr -a test.adf -o lightweight_print.tpef lightweight_print.c

Notice the new compilation flag -llwpr for including the light weight printing library.

First, check the instruction count with:

dumptpef -m lightweight_print.tpef

You should notice that the instruction count has dropped dramatically and also the initialized data memory is a lot smaller that previously.

Next, simulate program:

ttasim -a test.adf -p lightweight_print.tpef

Printed text is the same as previously except that the measured duration has dropped significantly.

2 FPGA execution

Next step is to get the program running on an FPGA. Prerequisite for this step is that you have Altera Quartus II and nios2-terminal programs installed and you have a Stratix II DSP FPGA board. But don't worry if you don't have this specific board, this tutorial can be completed with other Altera FPGA boards as well. In case you are using alternative board you must do a few manual changes before synthesizing and executing the design. Disclaimer: we assume no liability in case you fry your FPGA :)

1 Preparations

Before starting to generate the processor we first must adjust the address space sizes of the architecture. In order to do this, open the architecture in ProDe and open the address space dialog (Edit -> Address Spaces...).

prode test.adf &

Adjust both the data and instruction address space to be 10 bits wide. This should be enough for our application according to the dumptpef output.

Next recompile the application so it adapts to the new address space sizes:

tcecc -O0 -llwpr -a test.adf -o lightweight_print.tpef lightweight_print.c

If you wish you can execute the program with ttasim to verify that it still works after the address space sizes changed.

Before generating the processor, you must also select the RF and FU implementations. If you wish to skip this step, you can use the given IDF by renaming it:

mv preselected.idf test.idf

Then move on to section If you choose to learn and do this step manually, keep following the instructions. In case you don't already have the architecture open, do it now:

prode test.adf &

Then select Tools -> Processor Implementation... to open the implementation dialog. First select the Register Files implementations from the asic_130nm_1.5V.hdb. It doesn't matter which implementation you choose for this tutorial.

After selecting RFs, click open the Function Unit tab. Now you must choose the implementations carefully:

  1. LSU: Change the HDB to stratixII.hdb (don't worry if you are using another Altera FPGA) and select the fu_lsu_with_bytemask_always_3.

  2. ALU: Change the HDB back to asic_130nm_1.5V.hdb and select any available implemenation.

  3. IO: Change the HDB to altera_jtag_uart.hdb and select altera_jtag_uart_stdout_always_1 as the implementation.

  4. TIMER: Change the HDB again to stratixII.hdb and select the timer implementation with id 5. This implementation is for 100 MHz clock frequency (you can check it by opening the stratixII.hdb with hdbeditor and examining the generic parameters).

Rememeber to save IDF before closing the dialog.

2 Generate the processor

Now we will use the Platform Integrator feature (see section 4.6 for more information) of ProGe. Execute the following command to generate the processor implementation:

generateprocessor -i test.idf -o proge-out -g Stratix2DSP -d onchip -f onchip
-p lightweight_print.tpef -e hello_tta test.adf

The new flag -g commands to use Stratix II DSP board Platform Integrator and -d and -f tells the Platform Integrator to implement instruction and data memory as FPGA onchip memory. Flag -p defines the name of the program we want to execute on the processor (it is just needed for naming the memory initialization files) and flag -e defines the toplevel entity name for our processor.

Next, generate the memory images from the tpef program:

generatebits -d -w 4 -f mif -o mif -x proge-out -p lightweight_print.tpef -e hello_tta test.adf

Important flags to notice here are the -x which tells where the HDL files generated by ProGe are stored and -e which defines the toplevel entity name (must be the same you gave to ProGe). Data and instruction memory image formats are defined as mif (Altera's memory initialization format for onchip memories).

3 Modifications for using alternative Altera FPGA boards

If you are using the Stratix II DSP FPGA board you can skip this section. Otherwise you must complete the tasks described here to get the processor running on your alternative FPGA.

1 HDL changes

Open file proge-out/platform/hello_tta_toplevel.vhdl in your preferred text editor. Change the value of the toplevel generic dev_family_g according to your FPGA. For example if you are using Altera DE2 board with a Cyclone II FPGA, change the string from ``Stratix II'' to ``Cyclone II''. Save the file before exit.

2 Changes to Quartus II project files

Open file hello_tta_toplevel.qsf in your preferred text editor. You must locate and change the following settings according to the FPGA board you are using (all the examples are given for the DE2 board for illustrative purposes):

  1. FAMILY: Set the FPGA device family string according to your FPGA. Example: ``Cyclone II''

  2. DEVICE: Set the specific FPGA device name according to your FPGA. Example: EP2C35F672C6

  3. FMAX_REQUIREMENT: Set the target clock frequency according to the oscillator you are using to drive clock signal. Example: ``50 MHz''

  4. Pin assignments: Change the pin assignments for clk and rstx accorind to your FPGA board. Reset signal rstx is active low so take this into consideration in the pin mapping. Example: PIN_N2 -to clk (50 MHz oscillator to clk signal) and PIN_G26 -to rstx (push button KEY0 to rstx)

4 Synthesize and execute

Synthesize the processor simply by executing the generated script:


Assuming that your FPGA is set up, on and properly connected via USB Blaster to your PC, you can program the FPGA with the following script:


Or if you prefer, you can also use the graphical Quartus programmer tool for FPGA programming.

After the programmer has finished, open nios2-terminal program in order to capture the printed characters:

nios2-terminal -i 0

After the connection is open and program starts to run, you will the see the characters printed to the terminal. Notice that the measured printing time may vary according to your FPGA board and clock frequency as the timer implementation was set for 100 MHz clock.

3 Caveats in printing from FPGA

There are few caveats in printing from FPGA you should be aware of. First of all the transfer speed between the host PC and FPGA is finite and in order to avoid extra stalls the STDOUT FU uses a character buffer. When this buffer is full, the TTA processor pipeline is stalled until there is space again in the buffer. This happens especially when the nios2-terminal is not open i.e. there's no host process to clear the buffer. In otherwords, if your application uses STDOUT you must open nios2-terminal with the correct instance number to avoid the execution getting stuck.

Because of the stalling behavior you should avoid print during profiling as the results may be affected. You should not print during the execution of timing critical code as the real time characteristics cannot be guaranteed due to the possibility of stalls.

4 Summary

This tutorial illustrated how to print from Altera FPGAs. In addition this tutorial discussed different ways to print from C code and demonstrated their impact on the instruction count.

For demonstrational purposes all the compilations were done without optimizations. However, in case optimizations are applied, printf() or iprintf() may sometimes produce the smallest code overhead when printing only constant strings. This is possible when the compiler is able to reduce the function call into direct _TCE_STDOUT macro calls and fill NOP slots with the STDOUT operations. But in general it is advisable to use light weight printing library functions whenever applicable.

Pekka Jääskeläinen 2016-11-24