Notation & Number Bases
_2
denotes numbers in base 2 (binary)._10
denotes numbers in base 10 (decimal).
For example:
13_10
means thirteen in decimal.1101_2
means the binary number “1101.”
Decimal Normalization (Scientific Notation)
In decimal scientific notation, we represent a number so that there is only one nonzero digit to the left of the decimal point. This “normalized” form looks like this:
Examples:
13_10
can be written as: (We moved the decimal point 1 place to the left.)
1500_10
can be written as: (We moved the decimal point 3 places to the left.)
Moving the Decimal Point with Powers of 10
- Positive Exponent: means move the decimal point 3 places to the right: .
- Negative Exponent: means move the decimal point 3 places to the left: .
The general rule for positive and negative exponents of 10:
- Positive exponents:
- Negative exponents:
OR
Binary Normalization & Its Relation to Scientific Notation
Floating‑point numbers in computers work similarly to scientific notation—but in base 2. In binary, normalized numbers are written as:
Key Points:
-
Normalized Form (Binary): There is always one nonzero digit to the left of the binary point. In binary, that digit is always 1 (except for special cases called subnormals).
-
Implicit Leading 1: Because the leftmost digit in any normalized binary number is always 1, it is not stored. This “hidden bit” saves space and improves precision.
Example:
- The number
13_10
in binary is1101_2
. Normalizing it:
Notice that the “1” before the binary point is implicit when stored.
Components of a Floating-Point Number
An IEEE 754 floating‑point number (e.g., in 32‑bit single precision) is divided into three main parts:
- Sign Bit (1 bit):
- 0 means positive.
- 1 means negative.
- Exponent Field (k bits):
- Stored as an unsigned integer.
- Uses a bias to allow for both positive and negative exponents.
- Mantissa (Significand) (remaining bits):
- Contains the fractional part of the normalized number.
- The implicit leading “1” is not stored for normalized numbers.
Understanding the Bias
Why Use a Bias?
- Problem: Exponents can be positive or negative (e.g.,
2^3
or2^-5
). However, hardware prefers to work with unsigned integers. - Solution: Add a fixed bias to the actual exponent to store it as an unsigned number.
Bias Formula:
where k
is the number of exponent bits.
- For 32-bit floats (k = 8): Bias =
2^7 - 1 = 127
- For 64-bit floats (k = 11): Bias =
2^10 - 1 = 1023
How It Works:
- Storing:
Store Exponent = Actual Exponent + Bias
- Retrieving:
Store Exponent = Actual Exponent - Bias
Example:
- If the actual exponent is
3
(in single precision), the stored exponent is:3 + 127 = 130
- If we read a stored exponent of
120
, then the actual exponent is:120 - 127 = -7
How Does the Bias Help with Comparisons?
Because each exponent value is stored as a non-negative (biased) integer, we can compare exponents by simply comparing their stored (unsigned) integer representations—provided the sign bit for the whole floating-point number is the same.
Key idea: If we keep the number format in [sign | exponent + bias | fraction], then for two positive floating-point numbers:
- If the stored exponent of AAA is greater than the stored exponent of BBB, then A>BA > BA>B, regardless of the fraction (assuming both are normalized and not edge cases like NaN or infinity).
- If the stored exponent of AAA equals the stored exponent of BBB, then comparing the fraction bits determines which is bigger.
Hence, hardware can do much simpler comparisons by treating the exponent field as an unsigned magnitude. We avoid messing around with a separate sign bit for the exponent, and the ordering of floats becomes more straightforward under typical cases.
Why the Mantissa Always Starts with 1
- Normalization: In binary, to have only one nonzero digit to the left of the binary point, we adjust the number so it is always in the form:
- Storage Efficiency: Since the leading digit is always 1 (in normalized numbers), it does not need to be stored. This saves one bit and allows more bits to be used for the fraction.
- Exception: Subnormal Numbers When the exponent field is all zeros, the number is too small to be normalized. In this case, the leading digit is 0 instead of 1.
Converting Between Numbers and IEEE 754 Representation
Let’s walk through two detailed examples.
Example A: Converting 13.25 to IEEE 754 Single Precision
A.1. Convert 13.25 to Binary
- Integer Part (13):
- Fractional Part (0.25): Use the “multiply-by‑2” method:
- 0.25 x 2 = 0.5 => Bit: 0 because 0.5 < 1
- 0.5 x 2 = 1.0 => Bit: 1 (since we reached 1, subtract 1 and continue; here, remainder becomes 0)
- So,
- Combine:
A.2. Normalize the Binary Number
Normalize to the form
Normalized significand: And Actual exponent:
A.3. Determine the IEEE 754 Fields
-
Sign Bit:
13.25
is positive →Sign = 0
. -
Exponent Field:
- Actual exponent =
3
- Bias (for single precision) =
127
- Stored Exponent =
3 + 127 = 130
in binary (8 bits):
- Actual exponent =
-
Mantissa (Fraction) Field:
- From normalized significand:
- Stored mantissa = bits after the 1. =
- The mantissa field is 23 bits, so pad with zeros:
A.4. Final IEEE 754 Single Precision Representation
Combine the fields:
- Sign:
- Exponent:
- Mantissa: Final 32-bit representation:
A.5. Converting Back from Binary to Number
Extract the Fields:
- Sign: → positive
- Exponent field: => Actual exponent =
- Mantissa: (plus 18 zeros) → Normalized significand =
Compute the Value:
So Converting to decimal gives , and multiplying by = yields .
To convert the binary number to its decimal equivalent , we can perform two different operations:
Method 1: Weighted Sum (Bit-by-Bit) Method
-
Write the number as . Separate it into the integer part and the fractional part.
-
The integer part is simply .
-
For the fractional part :
- The first digit after the point () represents .
- The second digit () represents
- The third digit () represents .
- The fourth digit () represents
- The fifth digit () represents .
-
Add these contributions together: .
Method 2: Arithmetic Division Method
- Write the number as , where the fractional part is .
- Convert the fractional part directly from binary to decimal as an integer:
- .
- Count the number of digits in the fractional part. Here, there are 5 digits, so the denominator will be .
- Compute the fractional value by dividing the integer (21) by 32: .
- Add the integer part (which is 1) to this fraction: .
Both methods show that equals in decimal.
Rust Code Example:
/// A structure that decomposes a 32-bit floating-point number (f32)
/// into its sign, exponent, and mantissa components.
#[derive(Debug, Clone, Copy)]
pub struct FloatRepr {
/// `true` if the number is negative.
sign: bool,
/// The raw 8-bit exponent (stored with a bias of 127).
exponent: u8,
/// The raw 23-bit mantissa (fractional part).
mantissa: u32,
}
impl FloatRepr {
/// Constructs a `FloatRepr` from an `f32` value by extracting its underlying bit fields.
///
/// # Details
/// - The sign is the most significant bit.
/// - The next 8 bits represent the exponent (with a bias of 127).
/// - The remaining 23 bits represent the mantissa.
pub fn from_f32(value: f32) -> Self {
let bits = value.to_bits();
let sign = (bits >> 31) != 0;
let exponent = ((bits >> 23) & 0xFF) as u8;
let mantissa = bits & 0x7FFFFF; // 23 bits
Self {
sign,
exponent,
mantissa,
}
}
/// Returns the unbiased exponent.
///
/// The exponent in IEEE 754 format is stored with a bias of 127.
/// Subtracting 127 yields the actual exponent.
pub fn unbiased_exponent(&self) -> i32 {
self.exponent as i32 - 127
}
/// Computes the fractional part of the mantissa using arithmetic division.
///
/// The mantissa is stored as a 23-bit integer representing the fractional part.
/// Dividing by 2²³ normalizes it to the range [0, 1).
///
/// # Note
/// For normalized numbers, the full significand is `1 + mantissa_fraction()`.
pub fn mantissa_fraction_arithmetic(&self) -> f32 {
// 1 << 23 is equivalent to 2^23, it is a bitwise left shift operation
self.mantissa as f32 / (1 << 23) as f32
}
/// Computes the fractional part of the mantissa using a weighted bit-by-bit summation.
///
/// This method iterates over each bit (0 to 22) of the 23-bit mantissa.
/// Each bit at position `i` represents a fraction with weight 2^(i - 23),
/// meaning that the least-significant bit (position 0) has a weight of 2^(-23)
/// and the most significant bit (position 22) has a weight of 2^(-1).
///
/// For each bit position, we test if the bit is set (i.e. equals 1).
/// The expression `(self.mantissa >> i) & 1 == 1` does that:
/// - `self.mantissa >> i` shifts the mantissa right by `i` bits, moving the i-th bit to the rightmost position.
/// - `& 1` masks all other bits, isolating just that least significant bit.
/// - We then compare the result with 1. If the result is 1, then bit at position `i` was set.
///
/// The weighted contribution for that bit if it is set is computed as 2^(i - 23)
/// and added to the running total. In the end, the sum of all contributions represents
/// the decoded fractional part of the mantissa.
pub fn mantissa_fraction_weighted(&self) -> f32 {
let mut result = 0.0;
for i in 0..23 {
if (self.mantissa >> i) & 1 == 1 {
// If the bit is set, we add its fractional contribution
result += 2f32.powf(i as f32 - 23.0);
}
}
result
}
}
Methods:
The methods mantissa_fraction_arithmetic and mantissa_fraction_weighted implement two equivalent techniques to convert the mantissa back to its fractional decimal form, and we leverage some bitwise magic. These functions reflect the methods described earlier in the notes.
- The
mantissa_fraction_arithmetic
:- Conciseness: A single, clear arithmetic expression.
- Efficiency: Likely optimized by the compiler and directly mapped to a hardware division.
- Readability: It directly maps the mathematical operation, making it easier to grasp once we're familiar with the theory.
- The
mantissa_fraction_weighted
:- Educational Value: It clearly shows the bit-level contributions to the overall fraction, which can help in understanding how binary fractions work.
- Explicitness: We see the logic behind the summation of weighted bits.
- Verbosity: It takes several lines and a loop to achieve what one arithmetic division does.
- Performance: The loop and conditional checks can be less efficient compared to a single division.
- Bitwise visualization help:
Mantissa: 0b101 (5 in decimal)
Bit positions: [2][1][0]
i=0 (weight 2^-3 = 0.125)
┌───────────────┐
│Shift right 0: 101 │ → LSB=1 → add 0.125
└───────────────┘
i=1 (weight 2^-2 = 0.25)
┌───────────────┐
│Shift right 1: 10 │ → LSB=0 → no addition
└───────────────┘
i=2 (weight 2^-1 = 0.5)
┌───────────────┐
│Shift right 2: 1 │ → LSB=1 → add 0.5
└───────────────┘
Total: 0.125 + 0.5 = 0.625 (equivalent to 5/8)
Some considerations:
- IEEE 754 Standard: Defines the format for 32‑bit (single precision) and 64‑bit (double precision) floats, as well as special values such as NaN (Not a Number), ±Infinity, and ±0.
- Subnormal (Denormal) Numbers: When the exponent field is all zeros, numbers are too small to be normalized. In these cases, the leading bit is assumed to be 0, which allows representation of values very close to zero.
- Precision Limitations: Not all decimal numbers can be exactly represented in binary (e.g., 0.1 is repeating in binary). This leads to rounding errors in arithmetic operations.
- Range vs. Precision:
- More exponent bits → Wider range of representable numbers.
- More mantissa bits → Higher precision (more exact fractional representation).
- Common Pitfalls:
- Avoid direct equality checks with floats (e.g., instead of if a == b, use a tolerance check like abs(a - b) < 1e-6).
- Be aware of rounding errors when converting between decimal and binary.