IEEE 754 Formats, Special Values and Rounding Modes

From MediaWiki
Revision as of 14:39, 20 October 2025 by Bfh-sts (talk | contribs) (Created page with "= IEEE 754 Formats, Special Values and Rounding Modes = This page covers the standard IEEE 754 formats for floating-point numbers, including their bit structure, special values, and rounding behaviors. == IEEE 754 Formats == IEEE 754 defines several binary floating-point formats. The three most common are: <syntaxhighlight lang='text'> Format | Bits | Sign | Exponent | Mantissa | Bias ------- | ---- | ---- | -------- | -------- | ---- Single pr...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

IEEE 754 Formats, Special Values and Rounding Modes

This page covers the standard IEEE 754 formats for floating-point numbers, including their bit structure, special values, and rounding behaviors.

IEEE 754 Formats

IEEE 754 defines several binary floating-point formats. The three most common are:

Format              | Bits | Sign | Exponent | Mantissa | Bias
-------             | ---- | ---- | -------- | -------- | ----
Single precision    | 32   | 1    | 8        | 23       | 127
Double precision    | 64   | 1    | 11       | 52       | 1023
Quadruple precision | 128  | 1    | 15       | 112      | 16383

Due to the *hidden bit*, the mantissa precision is actually one bit higher than the stored value (e.g., 24 bits for single precision).

Special Values

Certain bit patterns are reserved for special cases:

Exponent          | Mantissa | Value             | Description
------------------|----------|-------------------|---------------------------
00000000          | 000...0  | ±0                | Zero (positive or negative)
00000000          | ≠0       | ±0.m × 2^(1−bias) | Denormalized small number
00000001–11111110 | any      | ±1.m × 2^(e−bias) | Normalized number
11111111          | 000...0  | ±∞                | Infinity
11111111          | ≠0       | NaN               | Not a Number

Special values are important in computation, as they define behavior for overflows, underflows, and invalid operations.

Comments on Special Values

  • **Zero (±0):** Distinguishes direction in some calculations (e.g., underflows).
  • **Infinity (±∞):** Used when a result exceeds representable range.
  • **NaN (Not a Number):** Represents undefined or invalid results (e.g., 0/0).
  • **Denormalized values:** Extend range for very small numbers but reduce precision and can slow computation.

Example behaviors:

  • +∞ + 3 = +∞
  • 5 ÷ 0 = +∞
  • 0 ÷ 0 = NaN

IEEE 754 Operations with Special Values

Operation     | Result
--------------|--------
x ÷ ±∞        | 0
±∞ × ±∞       | ±∞
±non-zero ÷ 0 | ±∞
±0 ÷ ±0       | NaN
∞ + ∞         | ∞
∞ − ∞         | NaN
±∞ ÷ ±∞       | NaN
±∞ × 0        | NaN

Rounding Modes

Because most results cannot be represented exactly, IEEE 754 defines five rounding modes:

Mode                                  | Description
--------------------------------------| ------------
Round to nearest, ties to even        | Round to nearest representable value; if exactly halfway, choose the one with even least significant bit (default).
Round to nearest, ties away from zero | Round to nearest; if halfway, round upward (away from zero).
Round toward zero                     | Truncates fractional bits (rounds toward zero).
Round toward +∞                       | Always rounds upward.
Round toward −∞                       | Always rounds downward.

Rounding Examples

Rounding mode                   | +11.5 | +12.5 | −11.5 | −12.5
--------------------------------| ------| ----- | ----- | -------
To nearest, ties to even        | +12.0 | +12.0 | −12.0 | −12.0
To nearest, ties away from zero | +12.0 | +13.0 | −12.0 | −13.0
Toward zero                     | +11.0 | +12.0 | −11.0 | −12.0
Toward +∞                       | +12.0 | +13.0 | −11.0 | −12.0
Toward −∞                       | +11.0 | +12.0 | −12.0 | −13.0

Significance of Rounding

  • Rounding introduces small precision errors that accumulate over repeated operations.
  • Different CPUs or compilers can produce slightly different results due to rounding mode defaults.
  • Non-associativity: (a + b) + c may not equal a + (b + c).