CS Float Representation

Notation & Number Bases

_2 denotes numbers in base 2 (binary).
_10 denotes numbers in base 10 (decimal).

For example:

13_10 means thirteen in decimal.
1101_2 means the binary number “1101.”

Decimal Normalization (Scientific Notation)

In decimal scientific notation, we represent a number so that there is only one nonzero digit to the left of the decimal point. This “normalized” form looks like this:

Number=Sign×Mantissa×10^{Exponent}

Examples:

13_10 can be written as: (We moved the decimal point 1 place to the left.)

$1.3 \times 10^1$

1500_10 can be written as: (We moved the decimal point 3 places to the left.)

$1.5 \times 10^3$

Moving the Decimal Point with Powers of 10

Positive Exponent: $4.2 \times 10^3$ means move the decimal point 3 places to the right: $4200$ .
Negative Exponent: $4.2 \times 10^{-3}$ means move the decimal point 3 places to the left: $0.0042$ .

The general rule for positive and negative exponents of 10:

Positive exponents:

$(10^n, n > 0): 10^n = \underbrace{10 \times 10 \times \cdots \times 10}_{n \text{ times}}$

Negative exponents:

$(10^{-n}, n > 0): 10^{-n} = \frac{1}{10^n} = \frac{1}{\underbrace{10 \times 10 \times \cdots \times 10}_{n \text{ times}}}$

$(10^{-n}, n > 0): 10^{-n} = \frac{1}{10} * \frac{1}{10 } * .... \frac{1}{n}$

Binary Normalization & Its Relation to Scientific Notation

Floating‑point numbers in computers work similarly to scientific notation—but in base 2. In binary, normalized numbers are written as:

$1.(fraction \space bits) ×2 ^{Exponent}$

Key Points:

Normalized Form (Binary): There is always one nonzero digit to the left of the binary point. In binary, that digit is always 1 (except for special cases called subnormals).
Implicit Leading 1: Because the leftmost digit in any normalized binary number is always 1, it is not stored. This “hidden bit” saves space and improves precision.

Example:

The number 13_10 in binary is 1101_2. Normalizing it:

$1101_2 = 1.101 _2 × 2 ^3$

Notice that the “1” before the binary point is implicit when stored.

Components of a Floating-Point Number

An IEEE 754 floating‑point number (e.g., in 32‑bit single precision) is divided into three main parts:

Sign Bit (1 bit):
- 0 means positive.
- 1 means negative.
Exponent Field (k bits):
- Stored as an unsigned integer.
- Uses a bias to allow for both positive and negative exponents.
Mantissa (Significand) (remaining bits):
- Contains the fractional part of the normalized number.
- The implicit leading “1” is not stored for normalized numbers.

Understanding the Bias

Why Use a Bias?

Problem: Exponents can be positive or negative (e.g., 2^3 or 2^-5). However, hardware prefers to work with unsigned integers.
Solution: Add a fixed bias to the actual exponent to store it as an unsigned number.

Bias Formula:

where k is the number of exponent bits.

$Bias = 2^{(k-1)} - 1$

For 32-bit floats (k = 8): Bias = 2^7 - 1 = 127
For 64-bit floats (k = 11): Bias = 2^10 - 1 = 1023

How It Works:

Storing: Store Exponent = Actual Exponent + Bias
Retrieving: Store Exponent = Actual Exponent - Bias

Example:

If the actual exponent is 3 (in single precision), the stored exponent is: 3 + 127 = 130
If we read a stored exponent of 120, then the actual exponent is: 120 - 127 = -7

How Does the Bias Help with Comparisons?

Because each exponent value is stored as a non-negative (biased) integer, we can compare exponents by simply comparing their stored (unsigned) integer representations—provided the sign bit for the whole floating-point number is the same.

Key idea: If we keep the number format in [sign | exponent + bias | fraction], then for two positive floating-point numbers:

If the stored exponent of AAA is greater than the stored exponent of BBB, then A>BA > BA>B, regardless of the fraction (assuming both are normalized and not edge cases like NaN or infinity).
If the stored exponent of AAA equals the stored exponent of BBB, then comparing the fraction bits determines which is bigger.

Hence, hardware can do much simpler comparisons by treating the exponent field as an unsigned magnitude. We avoid messing around with a separate sign bit for the exponent, and the ordering of floats becomes more straightforward under typical cases.

Why the Mantissa Always Starts with 1

Normalization: In binary, to have only one nonzero digit to the left of the binary point, we adjust the number so it is always in the form:

$1.xxxxx_2 × 2^{Exponent}$

Storage Efficiency: Since the leading digit is always 1 (in normalized numbers), it does not need to be stored. This saves one bit and allows more bits to be used for the fraction.
Exception: Subnormal Numbers When the exponent field is all zeros, the number is too small to be normalized. In this case, the leading digit is 0 instead of 1.

Converting Between Numbers and IEEE 754 Representation

Let’s walk through two detailed examples.

Example A: Converting 13.25 to IEEE 754 Single Precision

A.1. Convert 13.25 to Binary

Integer Part (13): $13_{10} = 1101_2$
Fractional Part (0.25): Use the “multiply-by‑2” method:
- 0.25 x 2 = 0.5 => Bit: 0 because 0.5 < 1
- 0.5 x 2 = 1.0 => Bit: 1 (since we reached 1, subtract 1 and continue; here, remainder becomes 0)
- So, $0.25_{10} = 0.01_2$
Combine: $13.25_{10} = 1101.01_2$

A.2. Normalize the Binary Number

Normalize to the form $1.fraction \times 2^{Exponent}$

$1101.01_2 = 1.10101 \times 2^3$

Normalized significand: $1.10101$ And Actual exponent: $3$

A.3. Determine the IEEE 754 Fields

Sign Bit: 13.25 is positive → Sign = 0.
Exponent Field:
- Actual exponent = 3
- Bias (for single precision) = 127
- Stored Exponent = 3 + 127 = 130 in binary (8 bits): $10000010_2$
Mantissa (Fraction) Field:

From normalized significand: $1.10101$
Stored mantissa = bits after the 1. = $10101$
The mantissa field is 23 bits, so pad with zeros: $101010000…_2$

A.4. Final IEEE 754 Single Precision Representation

Combine the fields:

Sign: $0$
Exponent: $10000010$
Mantissa: $10101000000000000000000$ Final 32-bit representation: $0 10000010 10101000000000000000000$

A.5. Converting Back from Binary to Number

Extract the Fields:

Sign: $0$ → positive
Exponent field: $10000010_2 = 130_{10}$ => Actual exponent = $130 - 127 = 3$
Mantissa: $10101$ (plus 18 zeros) → Normalized significand = $1.10101_2$

Compute the Value:

$Value = (+1) × (1.10101_2 ) × 2 ^3$

So Converting $1.10101_2$ to decimal gives $1.65625$ , and multiplying by $2^3$ = $8$ yields $13.25$ .

To convert the binary number $1.10101_2$ to its decimal equivalent $1.65625_10$ , we can perform two different operations:

Method 1: Weighted Sum (Bit-by-Bit) Method

Write the number as $1.10101_2$ . Separate it into the integer part and the fractional part.
The integer part $1$ is simply $1 × 2^0 = 1$ .
For the fractional part $10101$ :
- The first digit after the point ( $1$ ) represents $1 × 2^-1 = 0.5$ .
- The second digit ( $0$ ) represents $0 × 2^-2 = 0.$
- The third digit ( $1$ ) represents $1 × 2^-3 = 0.125$ .
- The fourth digit ( $0$ ) represents $0 × 2^-4 = 0.$
- The fifth digit ( $1$ ) represents $1 × 2^-5 = 0.03125$ .
Add these contributions together: $1 (integer \space part) + 0.5 + 0 + 0.125 + 0 + 0.03125 = 1.65625$ .

Method 2: Arithmetic Division Method

Write the number as $1.10101_2$ , where the fractional part is $10101$ .
Convert the fractional part $10101$ directly from binary to decimal as an integer:
- $10101_2 = (1×2^4) + (0×2^3) + (1×2^2) + (0×2^1) + (1×2^0) = 16 + 0 + 4 + 0 + 1 = 21$ .
Count the number of digits in the fractional part. Here, there are 5 digits, so the denominator will be $2^5 = 32$ .
Compute the fractional value by dividing the integer (21) by 32: $21 ÷ 32 = 0.65625$ .
Add the integer part (which is 1) to this fraction: $1 + 0.65625 = 1.65625$ .

Both methods show that $1.10101_2$ equals $1.65625_10$ in decimal.

Rust Code Example:


/// A structure that decomposes a 32-bit floating-point number (f32)
/// into its sign, exponent, and mantissa components.
#[derive(Debug, Clone, Copy)]
pub struct FloatRepr {
    /// `true` if the number is negative.
    sign: bool,
    /// The raw 8-bit exponent (stored with a bias of 127).
    exponent: u8,
    /// The raw 23-bit mantissa (fractional part).
    mantissa: u32,
}

impl FloatRepr {
    /// Constructs a `FloatRepr` from an `f32` value by extracting its underlying bit fields.
    ///
    /// # Details
    /// - The sign is the most significant bit.
    /// - The next 8 bits represent the exponent (with a bias of 127).
    /// - The remaining 23 bits represent the mantissa.
    pub fn from_f32(value: f32) -> Self {
        let bits = value.to_bits();
        let sign = (bits >> 31) != 0;
        let exponent = ((bits >> 23) & 0xFF) as u8;
        let mantissa = bits & 0x7FFFFF; // 23 bits
        Self {
            sign,
            exponent,
            mantissa,
        }
    }

    /// Returns the unbiased exponent.
    ///
    /// The exponent in IEEE 754 format is stored with a bias of 127.
    /// Subtracting 127 yields the actual exponent.
    pub fn unbiased_exponent(&self) -> i32 {
        self.exponent as i32 - 127
    }

    /// Computes the fractional part of the mantissa using arithmetic division.
    ///
    /// The mantissa is stored as a 23-bit integer representing the fractional part.
    /// Dividing by 2²³ normalizes it to the range [0, 1).
    ///
    /// # Note
    /// For normalized numbers, the full significand is `1 + mantissa_fraction()`.
    pub fn mantissa_fraction_arithmetic(&self) -> f32 {
        // 1 << 23 is equivalent to 2^23, it is a bitwise left shift operation
        self.mantissa as f32 / (1 << 23) as f32
    }

    /// Computes the fractional part of the mantissa using a weighted bit-by-bit summation.
    ///
    /// This method iterates over each bit (0 to 22) of the 23-bit mantissa.
    /// Each bit at position `i` represents a fraction with weight 2^(i - 23),
    /// meaning that the least-significant bit (position 0) has a weight of 2^(-23)
    /// and the most significant bit (position 22) has a weight of 2^(-1).
    ///
    /// For each bit position, we test if the bit is set (i.e. equals 1).
    /// The expression `(self.mantissa >> i) & 1 == 1` does that:
    /// - `self.mantissa >> i` shifts the mantissa right by `i` bits, moving the i-th bit to the rightmost position.
    /// - `& 1` masks all other bits, isolating just that least significant bit.
    /// - We then compare the result with 1. If the result is 1, then bit at position `i` was set.
    ///
    /// The weighted contribution for that bit if it is set is computed as 2^(i - 23)
    /// and added to the running total. In the end, the sum of all contributions represents
    /// the decoded fractional part of the mantissa.
    pub fn mantissa_fraction_weighted(&self) -> f32 {
        let mut result = 0.0;
        for i in 0..23 {
            if (self.mantissa >> i) & 1 == 1 {
                // If the bit is set, we add its fractional contribution
                result += 2f32.powf(i as f32 - 23.0);
            }
        }
        result
    }
}

Methods:

The methods mantissa_fraction_arithmetic and mantissa_fraction_weighted implement two equivalent techniques to convert the mantissa back to its fractional decimal form, and we leverage some bitwise magic. These functions reflect the methods described earlier in the notes.

The mantissa_fraction_arithmetic :
- Conciseness: A single, clear arithmetic expression.
- Efficiency: Likely optimized by the compiler and directly mapped to a hardware division.
- Readability: It directly maps the mathematical operation, making it easier to grasp once we're familiar with the theory.
The mantissa_fraction_weighted :
- Educational Value: It clearly shows the bit-level contributions to the overall fraction, which can help in understanding how binary fractions work.
- Explicitness: We see the logic behind the summation of weighted bits.
- Verbosity: It takes several lines and a loop to achieve what one arithmetic division does.
- Performance: The loop and conditional checks can be less efficient compared to a single division.
- Bitwise visualization help:

Mantissa: 0b101 (5 in decimal)
Bit positions: [2][1][0]

i=0 (weight 2^-3 = 0.125)
┌───────────────┐
│Shift right 0: 101 │ → LSB=1 → add 0.125
└───────────────┘

i=1 (weight 2^-2 = 0.25)
┌───────────────┐
│Shift right 1: 10  │ → LSB=0 → no addition
└───────────────┘

i=2 (weight 2^-1 = 0.5)
┌───────────────┐
│Shift right 2: 1   │ → LSB=1 → add 0.5
└───────────────┘

Total: 0.125 + 0.5 = 0.625 (equivalent to 5/8)

Some considerations:

IEEE 754 Standard: Defines the format for 32‑bit (single precision) and 64‑bit (double precision) floats, as well as special values such as NaN (Not a Number), ±Infinity, and ±0.
Subnormal (Denormal) Numbers: When the exponent field is all zeros, numbers are too small to be normalized. In these cases, the leading bit is assumed to be 0, which allows representation of values very close to zero.
Precision Limitations: Not all decimal numbers can be exactly represented in binary (e.g., 0.1 is repeating in binary). This leads to rounding errors in arithmetic operations.
Range vs. Precision:
- More exponent bits → Wider range of representable numbers.
- More mantissa bits → Higher precision (more exact fractional representation).
Common Pitfalls:
- Avoid direct equality checks with floats (e.g., instead of if a == b, use a tolerance check like abs(a - b) < 1e-6).
- Be aware of rounding errors when converting between decimal and binary.

CS Float Representation

Notation & Number Bases

Decimal Normalization (Scientific Notation)

Examples:

Moving the Decimal Point with Powers of 10

The general rule for positive and negative exponents of 10:

Binary Normalization & Its Relation to Scientific Notation

Key Points:

Example:

Components of a Floating-Point Number

Understanding the Bias

Why Use a Bias?

Bias Formula:

How It Works:

Example:

How Does the Bias Help with Comparisons?

Why the Mantissa Always Starts with 1

Converting Between Numbers and IEEE 754 Representation

Example A: Converting 13.25 to IEEE 754 Single Precision

A.1. Convert 13.25 to Binary

A.2. Normalize the Binary Number

A.3. Determine the IEEE 754 Fields

A.4. Final IEEE 754 Single Precision Representation

A.5. Converting Back from Binary to Number

Method 1: Weighted Sum (Bit-by-Bit) Method

Method 2: Arithmetic Division Method

Rust Code Example:

Methods:

Some considerations:

Resources