Text & Data Types: Interpreting Bits
Text & Data Types: Interpreting Bits
This page explains what a data type is, how the same bit pattern can mean very different things, and why consistent interpretation is critical for text processing and security.
Why data types matter
Inside a computer, everything is stored as 0s and 1s. The meaning comes from the data type: how we interpret the bit pattern and which operations we allow on it.
Typical questions:
- 01000010011010010110010101101100 — is this an integer, a date, a float, or the ASCII text Biel?
- 00110100 00110010 — is this two ASCII characters '4' and '2', or the binary representation of the integer 42?
Three perspectives on data types
A data type can be viewed along three complementary axes:
- Interpretation and encoding: how the bit pattern maps to a value (e.g. UTF-8 character, IEEE-754 float, two’s-complement integer).
- Allowed set of values: which bit patterns are valid (e.g. weekday has only seven valid values; some encodings reserve ranges).
- Permitted operations: which operations are meaningful (dividing a string by 3 is not; concatenation is).
Same bits, different meanings
The same bytes can decode differently depending on type and encoding.
| Bytes (hex) | Interpreted as | Meaning |
|---|---|---|
| 42 00 00 00 | 32-bit little-endian unsigned int | 66 |
| 00 00 00 42 | 32-bit big-endian unsigned int | 66 |
| 34 32 | ASCII text (UTF-8) | "42" |
| 42 6F 6F 6B | ASCII/UTF-8 | "Book" |
| 42 6F 6F 6B | IEEE-754 float32 | nonsensical as text; decodes to a float value |
Key takeaway: writer and reader must agree on the type, the byte order, and (for text) the character encoding.
Numbers vs text: canonical encodings
Numbers
- Integers: often two’s-complement, fixed width (8/16/32/64 bits), endianness matters when serialized.
- Floating point: IEEE-754 (float32, float64), has sign, exponent, significand; not all decimal values are exact.
Text
- Sequence of code points from a coded character set (e.g. Unicode).
- Serialized using a character encoding (e.g. UTF-8, UTF-16LE/BE, UTF-32).
Dates/times
- Commonly epoch-based counts (e.g. Unix time = seconds since 1970-01-01 00:00:00 UTC).
- Human-readable forms are text, not numbers.
Data type errors and their impact
When producer and consumer disagree on the type or encoding, errors occur:
- Misrendered text (mojibake), e.g. Zürich becomes Zürich when UTF-8 bytes are decoded as ISO-8859-1.
- Wrong arithmetic or comparisons when text "10" is compared lexicographically instead of numerically.
- Corrupted data if endianness is mismatched while reading binary integers.
Security-relevant type confusions
- Injection attacks occur when a program confuses data with instructions.
- Examples: SQL injection, shell injection, cross-site scripting.
- Technically both instruction and user input are strings, but semantically they are different types (code vs data). Inputs must be encoded/escaped or, better, passed as typed parameters (prepared statements).
Text as data: from characters to bytes
Working with text involves at least two layers:
- Abstract characters (graphemes/characters): human symbols like A, ä, あ, 🙂
- Encodings: concrete byte sequences representing those characters for storage/transmission.
Minimum checklist for safe text handling:
- Know the source encoding; standardize to UTF-8 where possible.
- Validate and normalize text if your domain requires it.
- Be explicit about end-of-line conventions when exchanging files (LF on Unix, CRLF on Windows; many tools accept both).
Practical exercises
1) Identify the type You receive the bytes 43 41 46 45 C0 DE. Decide whether this is text or binary. Hint: 43 41 46 45 decodes as "CAFE" in ASCII/UTF-8; C0 DE is not valid UTF-8 continuation, so the full sequence is likely binary structured data rather than plain UTF-8 text.
2) Round-trip safety Serialize the 32-bit integer 0x1A2B3C4D to bytes and back on big-endian and little-endian systems. Verify that the numeric value is unchanged only when you use the correct byte order on both sides.
3) Encoding identification Given a file that displays as Zürich in your editor, determine the actual bytes and encoding. Then convert to UTF-8 so it displays Zürich. Tools to try: iconv, recode, or your editor’s encoding switch.
Summary
- Bits are meaningless without an agreed data type, byte order, and (for text) character encoding.
- The same bytes can represent integers, floats, dates, or text depending on interpretation.
- Treat user input as data, not code; use typed parameters and proper escaping to prevent injection.
- For text, prefer Unicode and UTF-8 end-to-end to avoid ambiguity.