TCE supports single precision floating-point calculations. They can be performed by using the float datatype in C code, or by using macros from tceops.h, such as _TCE_FADD. If the compilation target architecture does not support these operations, they can be emulated using integer arithmetic in software. Passing the switch -swfp to tcecc enables the software emulation library linkage.
A set of floating-point FU implementations is included with TCE, in a HDB file named fpu_embedded.hdb, which can be found at PREFIX/share/tce/hdb/fpu_embedded.hdb. The FUs operate with 32-bit, single-precision floating point numbers. Supported operations include addition, subtraction, negation, absolute value, multiplication, division, square root, conversion between floats and integers, and various comparisons.
The FUs are based on the VHDL-2008 support library (http://www.vhdl.org/fphdl/), which is in public domain. Changes include:
The FUs are optimized for synthesis on Altera Stratix II FPGA's, and they have been benchmarked both on a Stratix II EP2S180F1020C3, and a Stratix III EP3SL340H1152C2. They have maximum frequencies between 190-200 MHz on the Stratix II, and between 230-280 MHz on the Stratix III. Compared to an earlier implementation based on the Milk coprocessor (coffee.cs.tut.fi), they are between 30% and 200% faster.
The FUs are not IEEE compliant, but instead comply to the less strict OpenCL Embedded Profile standard, which trades off accuracy for speed. Differences include:
The TCE Processor Simulator uses IEEE-compliant floats. With a processor simulated on GHDL or synthesized on actual hardware, the calculation results are thus slightly different from the ones from Processor Simulator.
The emphfpu_embedded function units are described in detail below.
Supported operations: addf, subf
Latency: 5
A straightforward floating-point adder.
Supported operations: mulf
Latency: 5
A straightforward floating-point multiplier.
Supported operations: divf
Latency: 15 (mw/2+3)
A radix-4 floating-point divider.
Supported operations: sqrtf
Latency: 26 (mw+3)
A floating-point square root FU, using Hain's algorithm.
Note that the C standard function sqrt does not take advantage of hardware acceleration; the _TCE_SQRTF macro must be used instead.
Supported operations: cif, cifu, cfi, cfiu
Latency: 4
Converts between 32-bit signed and unsigned integers, and single-precision floats. OpenCL embedded allows no loss of accuracy in these conversions, so rounding is to nearest even.
Supported operations: absf, negf, eqf, nef, gtf, gef, ltf, lef
Latency: 1
A floating-point comparator. Also supports the absolute value and negation operations, which are extremely simple with floating points (the former sets the sign bit to 0, the latter negates it).
The FPUs have been benchmarked on the FPGAs Stratix II EP2S180F1020C3 and Stratix III EP3SL340H1152C2. As a baseline, a simple TTA processor was synthesized that had enough functionality to support an empty C program. After this, each of the FPUs was added to the baseline processor and synthesized. The results are shown below in Tables 3.1 and 3.2.
|
The fpu_embedded Function Units have mantissa width and exponent width as generic parameters, so they can be used for float widths other than the IEEE single precision. The FPUs are likely prohibitively slow for double-precision calculation, but half-precision floats should be usable.
The parameters are mw and ew for all FUs. In addition, the float-int converter FU fpu_sp_conv has a parameter intw, which decides the width of the integer to be converted.
Use of these parameters has the following caveats:
Designers of floating point TTAs should note that ttasim uses the simulator host's floating point (FP) hardware to simulate floating point operations (for speed reasons). Thus, it might or might not match the FP implementation of the actual implemented TTA as it depends on the standard compliance, the default rounding modes, and other differences between floating point implementations.
Pekka Jääskeläinen 2012-06-07