Subsections

9 Designing Floating-point Processors with TCE

TCE supports single precision floating-point calculations. They can be performed by using the float datatype in C code, or by using macros from tceops.h, such as _TCE_FADD. If the compilation target architecture does not support these operations, they can be emulated using integer arithmetic in software. Passing the switch -swfp to tcecc enables the software emulation library linkage.

A set of floating-point FU implementations is included with TCE, in a HDB file named fpu_embedded.hdb, which can be found at PREFIX/share/tce/hdb/fpu_embedded.hdb. The FUs operate with 32-bit, single-precision floating point numbers. Supported operations include addition, subtraction, negation, absolute value, multiplication, division, square root, conversion between floats and integers, and various comparisons.

The FUs are based on the VHDL-2008 support library (http://www.vhdl.org/fphdl/), which is in public domain. Changes include:

The FUs are optimized for synthesis on Altera Stratix II FPGA's, and they have been benchmarked both on a Stratix II EP2S180F1020C3, and a Stratix III EP3SL340H1152C2. They have maximum frequencies between 190-200 MHz on the Stratix II, and between 230-280 MHz on the Stratix III. Compared to an earlier implementation based on the Milk coprocessor (coffee.cs.tut.fi), they are between 30% and 200% faster.

1 Restrictions

The FUs are not IEEE compliant, but instead comply to the less strict OpenCL Embedded Profile standard, which trades off accuracy for speed. Differences include:

The TCE Processor Simulator uses IEEE-compliant floats. With a processor simulated on GHDL or synthesized on actual hardware, the calculation results are thus slightly different from the ones from Processor Simulator.

2 Function Units

The emphfpu_embedded function units are described in detail below.

fpu_sp_add_sub

Supported operations: addf, subf

Latency: 5

A straightforward floating-point adder.

fpu_sp_mul

Supported operations: mulf

Latency: 5

A straightforward floating-point multiplier.

fpu_sp_div

Supported operations: divf

Latency: 15 (mw/2+3)

A radix-4 floating-point divider.

fpu_sp_sqrt

Supported operations: sqrtf

Latency: 26 (mw+3)

A floating-point square root FU, using Hain's algorithm.

Note that the C standard function sqrt does not take advantage of hardware acceleration; the _TCE_SQRTF macro must be used instead.

fpu_sp_conv

Supported operations: cif, cifu, cfi, cfiu

Latency: 4

Converts between 32-bit signed and unsigned integers, and single-precision floats. OpenCL embedded allows no loss of accuracy in these conversions, so rounding is to nearest even.

fpu_sp_compare

Supported operations: absf, negf, eqf, nef, gtf, gef, ltf, lef

Latency: 1

A floating-point comparator. Also supports the absolute value and negation operations, which are extremely simple with floating points (the former sets the sign bit to 0, the latter negates it).

3 Benchmark results

The FPUs have been benchmarked on the FPGAs Stratix II EP2S180F1020C3 and Stratix III EP3SL340H1152C2. As a baseline, a simple TTA processor was synthesized that had enough functionality to support an empty C program. After this, each of the FPUs was added to the baseline processor and synthesized. The results are shown below in Tables 3.1 and 3.2.


Table 3.1: Synthesis results for Stratix II EP2S180F1020C3
  mul add_sub sqrt conv comp div baseline
Comb ALUTs 1263 1591 4186 1500 1012 2477 907
Total regs 892 967 2444 917 669 1942 567
DSP blocks 8 0 0 0 0 0 0
$F_{max} (MHz)$ 196.39 198.81 194.78 191.5 192.2 199.32 222.82
Latency 5 5 26 4 1 15 -



Table 3.2: Synthesis results for Stratix III EP3SL340H1152C2
  mul add_sub sqrt conv comp div baseline
Comb ALUTs 1253 1630 4395 1507 1002 2597 1056
Total regs 819 1007 2401 997 665 2098 710
DSP blocks 4 0 0 0 0 0 0
$F_{max} (MHz)$ 272.03 252.4 232.07 232.88 244.32 260.82 286.45
Latency 5 5 26 4 1 15 -


4 Alternative bit widths

The fpu_embedded Function Units have mantissa width and exponent width as generic parameters, so they can be used for float widths other than the IEEE single precision. The FPUs are likely prohibitively slow for double-precision calculation, but half-precision floats should be usable.

The parameters are mw and ew for all FUs. In addition, the float-int converter FU fpu_sp_conv has a parameter intw, which decides the width of the integer to be converted.

Use of these parameters has the following caveats:

5 Processor Simulator and Floating Point Operations

Designers of floating point TTAs should note that ttasim uses the simulator host's floating point (FP) hardware to simulate floating point operations (for speed reasons). Thus, it might or might not match the FP implementation of the actual implemented TTA as it depends on the standard compliance, the default rounding modes, and other differences between floating point implementations.

Pekka Jääskeläinen 2012-06-07