Text & Data Types: Interpreting Bits

From MediaWiki
Jump to navigation Jump to search

Text & Data Types: Interpreting Bits

This page explains what a data type is, how the same bit pattern can mean very different things, and why consistent interpretation is critical for text processing and security.

Why data types matter

Inside a computer, everything is stored as 0s and 1s. The meaning comes from the data type: how we interpret the bit pattern and which operations we allow on it.

Typical questions:

  • 01000010011010010110010101101100 — is this an integer, a date, a float, or the ASCII text Biel?
  • 00110100 00110010 — is this two ASCII characters '4' and '2', or the binary representation of the integer 42?

Three perspectives on data types

A data type can be viewed along three complementary axes:

  • Interpretation and encoding: how the bit pattern maps to a value (e.g. UTF-8 character, IEEE-754 float, two’s-complement integer).
  • Allowed set of values: which bit patterns are valid (e.g. weekday has only seven valid values; some encodings reserve ranges).
  • Permitted operations: which operations are meaningful (dividing a string by 3 is not; concatenation is).

Same bits, different meanings

The same bytes can decode differently depending on type and encoding.

Bytes (hex) Interpreted as Meaning
42 00 00 00 32-bit little-endian unsigned int 66
00 00 00 42 32-bit big-endian unsigned int 66
34 32 ASCII text (UTF-8) "42"
42 6F 6F 6B ASCII/UTF-8 "Book"
42 6F 6F 6B IEEE-754 float32 nonsensical as text; decodes to a float value

Key takeaway: writer and reader must agree on the type, the byte order, and (for text) the character encoding.

Numbers vs text: canonical encodings

Numbers

  • Integers: often two’s-complement, fixed width (8/16/32/64 bits), endianness matters when serialized.
  • Floating point: IEEE-754 (float32, float64), has sign, exponent, significand; not all decimal values are exact.

Text

  • Sequence of code points from a coded character set (e.g. Unicode).
  • Serialized using a character encoding (e.g. UTF-8, UTF-16LE/BE, UTF-32).

Dates/times

  • Commonly epoch-based counts (e.g. Unix time = seconds since 1970-01-01 00:00:00 UTC).
  • Human-readable forms are text, not numbers.

Data type errors and their impact

When producer and consumer disagree on the type or encoding, errors occur:

  • Misrendered text (mojibake), e.g. Zürich becomes Zürich when UTF-8 bytes are decoded as ISO-8859-1.
  • Wrong arithmetic or comparisons when text "10" is compared lexicographically instead of numerically.
  • Corrupted data if endianness is mismatched while reading binary integers.

Security-relevant type confusions

  • Injection attacks occur when a program confuses data with instructions.
  • Examples: SQL injection, shell injection, cross-site scripting.
  • Technically both instruction and user input are strings, but semantically they are different types (code vs data). Inputs must be encoded/escaped or, better, passed as typed parameters (prepared statements).

Text as data: from characters to bytes

Working with text involves at least two layers:

  • Abstract characters (graphemes/characters): human symbols like A, ä, あ, 🙂
  • Encodings: concrete byte sequences representing those characters for storage/transmission.

Minimum checklist for safe text handling:

  • Know the source encoding; standardize to UTF-8 where possible.
  • Validate and normalize text if your domain requires it.
  • Be explicit about end-of-line conventions when exchanging files (LF on Unix, CRLF on Windows; many tools accept both).

Practical exercises

1) Identify the type You receive the bytes 43 41 46 45 C0 DE. Decide whether this is text or binary. Hint: 43 41 46 45 decodes as "CAFE" in ASCII/UTF-8; C0 DE is not valid UTF-8 continuation, so the full sequence is likely binary structured data rather than plain UTF-8 text.

2) Round-trip safety Serialize the 32-bit integer 0x1A2B3C4D to bytes and back on big-endian and little-endian systems. Verify that the numeric value is unchanged only when you use the correct byte order on both sides.

3) Encoding identification Given a file that displays as Zürich in your editor, determine the actual bytes and encoding. Then convert to UTF-8 so it displays Zürich. Tools to try: iconv, recode, or your editor’s encoding switch.

Summary

  • Bits are meaningless without an agreed data type, byte order, and (for text) character encoding.
  • The same bytes can represent integers, floats, dates, or text depending on interpretation.
  • Treat user input as data, not code; use typed parameters and proper escaping to prevent injection.
  • For text, prefer Unicode and UTF-8 end-to-end to avoid ambiguity.