TCE supports single and half precision floatingpoint calculations. Singleprecision calculations can be performed by using the float datatype in C code, or by using macros from tceops.h, such as _TCE_ADDF.
If the compilation target architecture does not support these operations,
they can be emulated using integer arithmetic in software. Passing the switch swfp
to tcecc enables the software emulation library linkage.
A set of floatingpoint FU implementations is included with TCE, in a HDB file named fpu_embedded.hdb, which can be found at PREFIX/share/tce/hdb/fpu_embedded.hdb. The FUs operate with 32bit, singleprecision floating point numbers. Supported operations include addition, subtraction, negation, absolute value, multiplication, division, square root, conversion between floats and integers, and various comparisons.
The FUs are based on the VHDL2008 support library (http://www.vhdl.org/fphdl/), which is in public domain. Changes include:
The FUs are optimized for synthesis on Altera Stratix II FPGA's, and they have been benchmarked both on a Stratix II EP2S180F1020C3, and a Stratix III EP3SL340H1152C2. They have maximum frequencies between 190200 MHz on the Stratix II, and between 230280 MHz on the Stratix III. Compared to an earlier implementation based on the Milk coprocessor (coffee.cs.tut.fi), they are between 30% and 200% faster.
The FUs are not IEEE compliant, but instead comply to the less strict OpenCL Embedded Profile standard, which trades off accuracy for speed. Differences include:
The TCE Processor Simulator uses IEEEcompliant floats. With a processor simulated on GHDL or synthesized on actual hardware, the calculation results are thus slightly different from the ones from Processor Simulator.
The emphfpu_embedded and emphfpu_half function units are described in detail below.
Supported operations: addf, subf
Latency: 5
Straightforward floatingpoint adder.
Supported operations: mulf
Latency: 5
Straightforward floatingpoint multiplier.
Supported operations: divf
Latency: 15 (mw/2+3)
Radix4 floatingpoint divider.
Supported operations: macf, msuf
Latency: 6
Singleprecision fused multiplyaccumulator.
Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=ab*c. Special case handling is not yet supported.
Supported operations: macf, msuf, addf, subf, mulf
Latency: 6
Singleprecision fused multiplyaccumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpu_sp_mac_v2 will replace fpu_sp_mac completely if benchmarking shows it to be reasonably fast.
Parameters are ordered so that MACF(a,b,c,d) is equal to d=a+b*c and MSUF(a,b,c,d) to d=ab*c.
Supported operations: sqrtf
Latency: 26 (mw+3)
Floatingpoint square root FU, using Hain's algorithm.
Note that the C standard function sqrt does not take advantage of hardware acceleration; the _TCE_SQRTF macro must be used instead.
Supported operations: cif, cifu, cfi, cfiu
Latency: 4
Converts between 32bit signed and unsigned integers, and singleprecision floats. OpenCL embedded allows no loss of accuracy in these conversions, so rounding is to nearest even.
Supported operations: absf, negf, eqf, nef, gtf, gef, ltf, lef
Latency: 1
A floatingpoint comparator. Also includes the cheap absolute value and negation operations.
A set of halfprecision arithmetic units is included with tce in PREFIX/share/tce/hdb/fpu_half.hdb. In C and C++, halfprecision operations an only be invoked with tceops.h macros. It may be helpful to define a half class with overloaded operators to wrap the macros. The test case testsuite/systemtest/proge/hpu is written using such a class. There is ongoing work to add acceleration for the half datatype in OpenCL.
Like their singleprecision counterparts, the halfprecision FPUs round to zero and lack support for denormal numbers. In addition, they do not yet handle special cases such as INFs and NaNs.
The emphfpu_half function units are described in detail below.
Supported operations: cfh, chf
Latency: 1
Converter between halfprecision and singleprecision floating points.
Supported operations: addh, subh
Latency: 1
Straightforward halfprecision floatingpoint adder.
Supported operations: mulh, squareh
Latency: 2
Straightforward halfprecision floatingpoint multiplier. Also supports a squaretaking operation.
Supported operations: mach, msuh
Latency: 3
Halfprecision fused multiplyaccumulator.
Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=ab*c.
Supported operations: mach, msuh, addh, subh, mulh
Latency: 3
Halfprecision fused multiplyaccumulator. Performs addition/subtraction by multiplying by 1, and multiplication by adding 0. fpmac_v2 will replace fpmac completely if benchmarking shows it to be reasonably fast.
Parameters are ordered so that MACH(a,b,c,d) is equal to d=a+b*c and MSUH(a,b,c,d) to d=ab*c.
Supported operations: invsqrth
Latency: 5
Halfprecision fast inverse square root using Newton's iteration.
Supported operations: absh, negh, eqh, neh, gth, geh, lth, leh
Latency: 1
Halfprecision floatingpoint comparator. Also includes the absolute value and negation operations.
Most of the singleprecision FPUs have been benchmarked on the FPGAs Stratix II EP2S180F1020C3 and Stratix III EP3SL340H1152C2. As a baseline, a simple TTA processor was synthesized that had enough functionality to support an empty C program. After this, each of the FPUs was added to the baseline processor and synthesized. The results are shown below in Tables 3.2 and 3.3.

The fpu_embedded Function Units have mantissa width and exponent width as generic parameters, so they can be used for float widths other than the IEEE single precision. The FPUs are likely prohibitively slow for doubleprecision computation, and the fpu_half units should be better fit for halfprecision.
The parameters are mw (mantissa width) and ew (exponent width) for all FUs. In addition, the floatint converter FU fpu_sp_conv has a parameter intw, which decides the width of the integer to be converted.
Use of these parameters has the following caveats:
Designers of floating point TTAs should note that ttasim uses the simulator host's floating point (FP) hardware to simulate floating point operations (for speed reasons). Thus, it might or might not match the FP implementation of the actual implemented TTA as it depends on the standard compliance, the default rounding modes, and other differences between floating point implementations.
Pekka Jääskeläinen 20180312