Float Point Numbers as Approximates & Using Errors

NewPast

4.82/5 (12 votes)

Jan 7, 2017

CPOL

4 min read

35853

190

Floating point numbers as approximates and understanding error happened when using single and double data types

Download source code - 1.4 KB

CFloat.vb is a class to handle and explain float point numbers in VB.

What are Floating-point Numbers

Computers use floating point numbers to handle real numbers with decimal points. In some cases, the number we want to represent cannot be expressed exactly by the role of the float point sequence of binary digits. So math operations on floating point numbers may give slightly different results than what we expected. As an example: calculating 0.43 + 0.000001 - 0.430001, in C#, VB, VBA, Python, PHP, and Java will not return 0!!

Storage Diagram

The standard IEEE 754-2019 / ISO IEC 60559:2020 is set to organize the outline of floating-point arithmetic.

The [Float / Single] data type is stored in 4 bytes = 32 bits as the following binary storage diagram. S is the signal bit, E is the Exponents bit & F is the fraction bit.
Fraction is called mantissa too.

The binary format of a Single precision number is as follows:

S EEEE EEEE FFF FFFF FFFF FFFF FFFF FFFF

	Memory Hexadecimal	Signal Bit	Exponent Hexadecimal	Fraction Hexadecimal
Epsilon Smallest positive number 1.4E-45	0000 0001	0	00	000 0001
Zero; +0!	0000 0000	0	00	000 0000
Negative Zero; -0!	8000 0000	1	00	000 0000
1!	3F80 0000	0	7F	000 0000
The smallest number > 1 1.00000012!	3F80 0001	0	7F	000 0001
Not a number; NaN	FFC0 0000	1	`FF`	400 0000
Infinity; ∞	0F80 0000	0	`FF`	000 0000
Negative Infinity; -∞	FF80 0000	1	`FF`	000 0000

The binary format of Double precision number is as given below:

S EEEE EEEE EEE FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF

	Memory Hexadecimal	Signal Bit	Exponent Hexadecimal	Fraction Hexadecimal
Epsilon Smallest positive number 5E-324	0000 0000 0000 0001	0	000	0 0000 0000 0001
Zero; +0#	0000 0000 0000 0000	0	000	0 0000 0000 0000
Negative Zero; -0#	8000 0000 0000 0000	1	000	0 0000 0000 0000
1#	3FF0 0000 0000 0000	0	3FF	0 0000 0000 0000
The smallest number > 1 1.0000000000000002#	3FF0 0000 0000 0001	0	3FF	0 0000 0000 0001
Not a number; NaN	FFF8 0000 0000 0000	1	7FF	8 0000 0000 0000
Infinity; ∞	FF00 0000 0000 0000	0	7FF	0 0000 0000 0000
Negative Infinity; -∞	FFF0 0000 0000 0000	1	7FF	0 0000 0000 0000

Unexpected Code Run Caused by Floating Point

Since the floating point is approximating real numbers, then adding and mathematical operation does not always lead to the exact result.

Addition test:

//C#
public void TestAdd()
{
    double V = 0;
    V = 0.43 + 1E-06 - 0.430001;
    if (V != 0) {
	    Interaction.MsgBox("It will go here!");
    }
    float f = 0;
    f = 0.43f + 1E-06f - 0.430001f;
    if (f != 0) {
	    Interaction.MsgBox("It will go here!");
    }
}

'VB
Sub TestAdd()
    Dim V As Double
    V = 0.43# + 0.000001# - 0.430001#
    If V <> 0 Then
        MsgBox("It will go here!")
    End If
    Dim f As Single
    f = 0.43! + 0.000001! - 0.430001!
    If f <> 0 Then
        MsgBox("It will go here!")
    End If
End Sub

Special values:

'VB
Sub Test(x As Double)
    If x > 0 Then
        MsgBox("X > 0")
    ElseIf x = 0 Then
        MsgBox("X = 0")
    ElseIf x < 0 Then
        MsgBox("X < 0")
    Else
        MsgBox("This is a possible case! What is the value of X hear?")
        Dim R =
        (Double.NaN = 0) = False AndAlso
        (Double.NaN < 0) = False AndAlso
        (Double.NaN > 0) = False
    End If
End Sub

Sub Test(x As Single)
    If x > 0 Then
        MsgBox("X > 0")
    ElseIf x = 0 Then
        MsgBox("X = 0")
    ElseIf x < 0 Then
        MsgBox("X < 0")
    Else
        MsgBox("This is a possible case! What is the value of X hear?")
        Dim R =
        (Single.NaN = 0) = False AndAlso
        (Single.NaN < 0) = False AndAlso
        (Single.NaN > 0) = False
    End If
End Sub

//C#
public void Test(double x)
{
	if (x > 0) {
		Interaction.MsgBox("X > 0");
	} else if (x == 0) {
		Interaction.MsgBox("X = 0");
	} else if (x < 0) {
		Interaction.MsgBox("X < 0");
	} else {
		Interaction.MsgBox("This is a possible case! What is the value of X hear?");
		dynamic R = (double.NaN == 0) == false && (double.NaN < 0) == false && 
        (double.NaN > 0) == false;
	}
}

public void Test(float x)
{
	if (x > 0) {
		Interaction.MsgBox("X > 0");
	} else if (x == 0) {
		Interaction.MsgBox("X = 0");
	} else if (x < 0) {
		Interaction.MsgBox("X < 0");
	} else {
		Interaction.MsgBox("This is a possible case! What is the value of X hear?");
		dynamic R = (float.NaN == 0) == false && 
		(float.NaN < 0) == false && (float.NaN > 0) == false;
	}
}

How to Get Float Point Value from its Exponent and Fraction

The following function will get the double value of a float point number. The Exponent Bias and Fraction Base of the float point numbers are dependent on the type of floating point. See the table.

Type	Total Bits	Exponent bias	Fraction base
Half / Float16	16	15	2^10
Single / Float	32	127	2^23
Double	64	1023	2^52
Quad / Float128	128	16383	2^112

The value of a number is calculated from the formula:

Exponent <> 0:

Value = ±2^(Exponent - ExponentBias) * (1 + Fraction / FractionBase)

Exponent = 0:

Value = ±2^(1- ExponentBias) * ( Fraction / FractionBase)

'VB
Function GetDoubleValue(IsNegative As Boolean, Exponent As UInt16,
        Fraction As UInt64, ExponentBias As UInt16, FractionBase As UInt64) As Double
    If Exponent = 0 Then
        If Fraction = 0 Then
            Return If(IsNegative, -0#, 0#)
        End If
        Dim FractionRatio = Fraction / FractionBase
        Return If(IsNegative, -1#, 1#) * (2 ^ (1 - ExponentBias)) * FractionRatio
    Else
        If Exponent = 2 * ExponentBias + 1 Then
            If Fraction = 0 Then
                Return If(IsNegative, Double.NegativeInfinity, Double.PositiveInfinity)
            Else
                Return Double.NaN
            End If
        End If
        Dim FractionRatio = Fraction / FractionBase
        Return If(IsNegative, -1#, 1#) * _
               (2 ^ (CInt(Exponent) - ExponentBias)) * (1# + FractionRatio)
    End If
End Function

//C#
public double GetDoubleValue(bool IsNegative, UInt16 Exponent, UInt64 Fraction, 
UInt16 ExponentBias, UInt64 FractionBase)
{
	if (Exponent == 0) {
		if (Fraction == 0) {
			return IsNegative ? -0.0 : 0.0;
		}
		dynamic FractionRatio = Fraction / FractionBase;
		return IsNegative ? -1.0 : 1.0 * 
               (Math.Pow(2, (1 - ExponentBias))) * FractionRatio;
	} else {
		if (Exponent == 2 * ExponentBias + 1) {
			if (Fraction == 0) {
				return IsNegative ? double.NegativeInfinity : double.PositiveInfinity;
			} else {
				return double.NaN;
			}
		}
		dynamic FractionRatio = Fraction / FractionBase;
		return IsNegative ? -1.0 : 1.0 * 
           (Math.Pow(2, (Convert.ToInt32(Exponent) - ExponentBias))) * 
           (1 + FractionRatio);
	}
}

Database Primary Key and Float Point

Since the floating point is an approximation of real numbers; it is not a good idea to use it as the primary key in a database.
For example:
If we have a table named Customers with a field id with the data type single, then referring to one row may result in no record even if the record is in the database.
When we insert 0.4301 as id, the real id will be different a little and may thus result in unexpected results in a database update or select.
The flowing SQL may result in no record and this depends on the database drive and how it converts numbers from decimal to floating-point (float) type.

Select Customer From Customers where id = 0.4301