When working on the WebAssembly port of FixScript language I've encountered a need for implementation of basic math functions. On the browser side I've initially just called into JavaScript to get access to these functions but felt a bit dirty to do that (also not a good idea for performance reasons). Later when I was adding WASI support I've realized that there are no such functions exported.
Since FixScript doesn't need that much from the C library (just a simple memory allocator and a bunch of math functions) there is no need to have an unnecessary bloat of a whole C library (small size is important).
There are of course already existing good implementations as open source. However they come with a lot of code (esp. for the whole libc implementations) or the license is not suitable for embedding (having to reproduce the license text with binaries is a major pain for various applications). I also wanted to compare the functions to some ground truth, a much more precise and accurate implementation to compare with.
Another reason is that I always wanted to know more about these things, people take these functions for granted and don't really think about them. It's also an important thing when designing a language to know every corner of it.
The task was straightforward: I needed a reasonably accurate and simple implementations of basic math functions and an ability to convert between strings and floats.
There are various algorithms available for this task, they are either quite complex, slow or inaccurate.
A simple approach is to just multiply the normalized number with the appropriate power of ten, getting the resulting integer, convert it into a string and put a decimal point at the right place.
This works (and is fast) but it's very inaccurate. Also we need to have a way to compute a very accurate powers of ten. And handle special cases. And rounding.
The more involved algorithms use either arbitrary precision arithmetic or a lot of tricks. So the implementation is quite slow and/or complex. It also paints a picture where this task seems to be impossible to handle for a mere mortal and one should never attempt it or other such silliness.
However there is a way to make the simple approach to work, we just need to compute it in a higher precision. This means for 32bit floats we need to use 64bit floats (not so bad) and for 64bit we need to use 128bit floats (uh oh).
You can test the conversion here:
For 32bit floats we need to extract up to 9 decimal digits and for 64bit floats we need up to 17 digits. In C the best way to print floats is to use this code:
printf("float=%.9g double=%.17g\n", float_value, double_value);
So if you were unsure about how many digits you should put into your constants or how many to print, here is your answer.
We need to implement them from scratch. Fortunatelly we need just three operations: a conversion from/to doubles and a multiplication. The conversion is quite straightforward, we must not forget to do the rounding.
The multiplication of floats is actually simpler than addition. Since floats are already stored in a normalized format, all it takes is to multiply both normalized numbers and sum the exponents. There is a little bit more work to do to handle overflows but it's very simple.
The addition is much more involved. You need to (virtually) align both numbers at the right position, add them and do more adjustments. Fortunatelly we don't need to do this operation.
Multiplication of two 128bit integers involves a bunch of operations but it is
straightforward: we have two numbers (a
and b
) stored
in an array of four 32bit integers, totalling of 128 bits for each. We need to
multiply the individual 32bit parts like so:
a0*b0 |
||||||
a0*b1 | ||||||
a0*b2 | a1*b0 | |||||
a0*b3 | a1*b1 | |||||
a1*b2 | a2*b0 | |||||
a1*b3 | a2*b1 | |||||
a2*b2 | a3*b0 | |||||
a2*b3 | a3*b1 | |||||
a3*b2 | ||||||
a3*b3 |
Each 32bit multiplication has a 64bit result, putting the higher 32 bits to the column on the left of the multiplication (that's why there are only 7 columns, in reality it's full 8 columns).
Then we need to sum all the columns, taking into account the proper position of each multiplication and propagate the carry from the right to the left. We will then get a 256bit integer of which only the higher 128 bits are used for the new normalized float128 number.
This task is basically just the opposite. The digits are parsed, rounded and then shifted by using the right power of ten. Since we're working with extra precision this is quite straightforward. Of course we have to be more careful here because we parse potentially malicious input.
Both conversions were tested for all possible 32bit floats with no problems. This includes a round trip test and a comparison with the C library. The 64bit, due to an impossibility to do a full test, was tested using a large prime increment to skip the low ~31 bits to maximize a chance to test oddities. As none were found and it used the same already tested algorithm it should work equally well.
FixScript needs a bunch of common math functions such as sin
, cos
, exp
,
log
, etc.
The first step was to create a very precise and accurate implementations of these functions for comparison with other faster versions. These are actually a bit harder to implement. I have always assumed there are some straightforward (but slow) implementations of these.
For some there are very accurate direct approximations (log
), for others
Taylor series can be used,
but only in a limited range so some range reduction tricks need to be done. For
exp
I had to inverse the log
function (by finding the root with
Newton-Raphson method). This
is pretty slow when using the already slow log
function but fine for a reference
implementation that is used for comparison only.
The implementation of the functions for the actual library was simpler as it could use more approximations since we can compare it to the reference implementation to measure the quality. It uses less terms to balance the speed vs the diminishing improvements when computing it directly in doubles.
Here you can test the quality of the math library on various expressions. It supports
the basic operators (+
, -
, *
, /
),
grouping using parenthesis and these functions:
log
,
log2
,
log10
,
exp
,
pow
,
sqrt
,
cbrt
,
sin
,
cos
,
tan
,
asin
,
acos
,
atan
,
min
,
max
.
You can also try some expressions such as:
1/x
,
1/((x-5)*(x+3))
,
x*0.1
or just try your own.
You can switch between the math library, JavaScript functions or unselect it to show the reference implementation that is used to determine the amount of the error. The red line at the bottom shows the error.
The reference implementation has some unresolved problems. As it is used only for comparison and the issues are not critical and not present in the actual library it is instead just listed here. If you know how to fix them feel free to leave a post in the comments.
cbrt
function has abrupt anomalies (show)You can download the code here. It is released into public domain (using CC0 license). It contains both the library and the reference implementation for comparison purposes. If you're interested to tinker with the web demo, you can download it as well.
You can also check my other public domain libraries (inflate/deflate/GZIP, PNG, binary diff/patch) here.
If you like the math library or my other public domain code you can support the effort by donating:
No comments.