# Binary Data

## Introduction

• All data is stored as 1's and 0's
• But those 1's and 0's may represent:
• Characters that can be displayed as text
• Something else
• That “something else” is (misleadingly) called binary data
• Usually means “anything you can't manipulate with a standard text editor”
• This lecture describes how binary values are stored and manipulated
• Please, don't write code to manipulate binary formats unless you absolutely have to
• Good libraries exist for working with every image, sound, and video format out there

## You Can Skip This Lecture If...

• You know what two's complement is
• You know what bit shifting is
• You know that roundoff errors are not random
• You know how to pack and unpack binary values

## Why Binary?

• Size: `"10239472"` is 8 bytes long, but the 32-bit integer it represents is 4 bytes
• Speed: takes dozens of operations to add the integer represented by `"34"` to the one represented by `"57"`
• Hardware interfaces: someone has to convert the electrical signal from the gas chromatograph to a readable number
• Lack of anything better
• It's possible to represent images as text (ASCII art, PostScript)
• But sound? Or movies?

## How Numbers Are Stored

• Positive numbers stored in base-2 format
• 10012 is (1×23)+(0×22)+(0×21)+(1×20) = 9
• Could use sign-and-value for negative numbers
• First bit is 0 for positive, 1 for negative
• 00112 is 310, and 10112 is -310
• Problem: there are two zeroes (0000 and 1000)

## Two's Complement

• Almost all computers use two's complement instead
• “Roll over” when going below zero, like a car's odometer
• 11112 is -110, 11102 is -210, etc.
• 10002 is the most negative 4-bit integer, 01112 the most positive
• Figure 18.1: Two's Complement

• Asymmetric: there is one more negative number than positive
• Since there has to be room for 0 in the middle
• Can still tell whether a number is positive or negative by looking at the first bit

## Bitwise Operators

• Like most languages, Python has four operators that work on bits
• NameSymbolPurposeExample
And`&`1 if both bits are 1, 0 otherwise`0110 & 1010 = 0010`
Or`|`1 if either bit is 1`0110 & 1010 = 1110`
Xor`^`1 if the bits are different, 0 if they're the same`0110 & 1010 = 1100`
Not`~`Flip each bit`~0110 = 1001`
Table 18.1: Bitwise Operators in Python
• The name “xor” is short for “exclusive or”, i.e., either/or
• Use these to write a function that displays the bits in an integer
• ```def format_bits(val, width=1):
'''Create a base-2 representation of an integer.'''
result = ''
while val:
if val & 0x01:
result = '1' + result
else:
result = '0' + result
val = val >> 1
if len(result) < width:
result = '0' * (width - len(result)) + result
return result

tests = [
[ 0, None, '0'],
[ 0, 4,    '0000'],
[ 5, None, '101'],
[19, 8,    '00010011']
]

for (num, width, expected) in tests:
if width is None:
actual = format_bits(num)
else:
actual = format_bits(num, width)
print '%4d %8s %10s %10s' % (num, width, expected, actual)
```
```   0     None          0          0
0        4       0000       0000
5     None        101        101
19        8   00010011   00010011
```

## Shifting

• Shifting an integer's bits left N places written as `x << N`
• Each leftward shift corresponds to multiplying by 2
• Just as shifting a decimal number left corresponds to multiplying by 10
• Example: 3<<2 is 00112<<2, or 11002, which is 12
• Shifting a number right corresponds to division by 2 (throwing away the remainder)
• 710>>1 is 01112>>1, or 00112, which is 310

## Cautions

• Shifting is not more efficient than multiplication and division on modern computers
• What happens if the top bit changes value as a result of a shift?
• 610<<1 = 01102<<1 = 11002
• On a 4-bit machine, this is -410, not 1210
• Some machines preserve the sign bit when shifting down
• So 11002>>1 = 11102, instead of 01102
• Depends on the hardware being used
• Java provides a separate operator for this

## Setting and Clearing Bits

• Can use bitwise `and`, `or`, and `not` to set specific bits to 1 or 0
• Do the same things to bit that logical operations do to Booleans
• To set the ith of `x` to 1:
• Create a value `mask` in which bit i is 1 and all others are 0
• Use `x = x | mask`
• To set the ith of `x` to 0:
• Create a value `mask` in which bit i is 1 and all others are 0
• Negate it using `~`, so that the ith bit is 0, and all the others are 1
• Use `x = x & mask`
• Figure 18.2: Setting and Clearing Bits

## Bit Flags

• Can use bitwise operators to store several Boolean flags in a single integer
• Slower than storing each in a separate variable
• But uses much less space
• Example: need to record whether a sample contains any mercury, phosphorus, or chlorine
• Define constants to test for particular elements
• Use bit 1 for mercury, bit 2 for phosphorus, bit 3 for chlorine
• Figure 18.3: Using Bits to Record Sets of Flags

• ```#            hex     binary
MERCURY    = 0x01  # 0001
PHOSPHORUS = 0x02  # 0010
CHLORINE   = 0x04  # 0100

# Sample contains mercury and chlorine
sample = MERCURY | CHLORINE
print 'sample: %04x' % sample

# Check for various elements
for (flag, name) in [[MERCURY,    "mercury"],
[PHOSPHORUS, "phosphorus"],
[CHLORINE,   "chlorine"]]:
if sample & flag:
print 'sample contains', name
else:
print 'sample does not contain', name
```
```sample: 0005
sample contains mercury
sample does not contain phosphorus
sample contains chlorine
```

## Floating Point

• Floating point numbers are (much) more complicated
• A 32-bit float has:
• One bit for the sign
• 23 bits for the mantissa (or value)
• 8 bits for the exponent
• Figure 18.4: Floating Point Representation

• Floating point numbers are not real numbers
• Fixed number of bits per value means that only a limited set of values can be represented
• If the actual value isn't in that set, you must settle for the closest available approximation

## Floating Point Spacing

• Consequence #1: values are unevenly spaced
• Less absolute precision for numbers with larger magnitudes
• Example: 1 sign, 3 mantissa, 2 exponent bits for each number
• Figure 18.5: Uneven Spacing of Floating-Point Numbers

## Floating Point Roundoff

• Consequence #2: roundoff errors
• 6-bit system can represent 6, and ¼, but not 5¾
• So 6 - 0.25 is 6, not 5.75
• And if 6 - 0.25 - 0.25 - 0.25 - 0.25 is evaluated left to right, the answer is still 6
• This is not random
• Happens exactly the same way every time
• But it is very hard to reason about
• Which is why people get Ph.D.'s in numerical analysis

## Binary I/O

• I/O routines seen so far are line-based
• Can also use byte-oriented routines
• `f.read(N)` reads (up to) next N bytes
• Result is returned as a string, but there's no guarantee its contents are characters
• If the file `f` is empty, returns `None`
• `f.write(str)` writes the bytes in the string `str`

## Binary I/O Mode

• Caution: must open files in binary mode on Windows
• `input = open(filename, 'rb')` (and similarly for output)
• Otherwise, the low-level routines Python relies on convert Windows line endings `"\r\n"` to Unix-style `"\n"`
• …which is an unkind thing to do to an image
• Example: open a file using `"r"`, then in `"rb"`
• Identical on Unix, but different on Windows
• ```import sys
print sys.platform
for mode in ('r', 'rb'):
f = open('open_binary.py', mode)
f.close()
print repr(s)
```
```cygwin
'import sys\r\nprint sys.platform\r\nfor mode'
```
```linux
'import sys\nprint sys.platform\nfor mode in '
```

## Packing and Unpacking

• In C and Fortran, an integer is a raw 32-bit value
• `fwrite(&array, sizeof(int), 3, file)` will write 3 4-byte integers to a file
• Python, Java, and other languages usually don't use raw values
• There's no guarantee that things like lists are stored contiguously in memory…
• …so programs need to pack data into contiguous bytes for writing…
• …and unpack those bytes to recreate the structures when reading
• Figure 18.6: C Storage vs. Python Storage

## Packing Data

• Packing looks a lot like formatting a string
• A format specifies the data types being packed (including sizes, where appropriate)
• This format exactly determines how much memory is required by the packed representation
• The result of packing is a chunk of bytes
• Stored as a string in Python
• But as mentioned above, it's not a string of characters
• Figure 18.7: Packing Data

## Unpacking Data

• Unpacking reverses this process
• Read bytes from a “string” according to a format
• Use the data in these bytes to create Python data structures
• Return the result as a tuple of values

## The struct Module

• Use Python's `struct` module to pack and unpack
• `pack(fmt, v1, v2, …)` packs the values `v1`, `v2`, etc. according to `fmt`, returning a string
• `unpack(fmt, str)` unpacks the values in `str` according to `fmt`, returning a tuple
• ```import struct

fmt = 'hh' # two 16-bit integers
x = 31
y = 65
binary = struct.pack(fmt, x, y)
print 'binary representation:', repr(binary)
normal = struct.unpack(fmt, binary)
print 'back to normal:', normal
```
```binary representation: '\x1f\x00A\x00'
back to normal: (31, 65)
```

• What's `'\x1f\x00A\x00'`?
• If Python finds a character in a string that doesn't have a printable representation, it prints a 2-digit hexadecimal (base-16) escape sequence
• Uses the letters A-F (or a-f) to represent the digits from 10 to 15
• So this string represents the four bytes `['\x1f', '\x00', 'A', '\x00']`
• 1f16 is (1×16 + 15), or 31
• ASCII code for the letter `"A"` is 6510

## Format Specifiers

FormatMeaning
`"c"`Single character (i.e., string of length 1)
`"B"`Unsigned 8-bit integer
`"h"`Short (16-bit) integer
`"i"`32-bit integer
`"f"`32-bit float
`"d"`Double-precision (64-bit) float
`"2"`String of fixed size (see below)
Table 18.2: Packing Format Specifiers
• Any format can be preceded by a count
• E.g., `"4i"` is four integers
• How much data is packed is specified by the format
• Can pack the lowest 8 or 16 bits of an integer using `"B"` or `"h"` instead of the full 32

## Calculating Sizes

• Must always specify the size of strings
• E.g., `"4s"` for a 4-character string
• Otherwise, how would `unpack` know how much data to use?
• `calcsize(fmt)` calculates how large (in bytes) the data produced using `fmt` will be
• Data sizes can vary from platform to platform
• And the computer is better at doing arithmetic than you are

## Endianness

• Note that the least significant byte of the integer comes first
• This is called little-endian, and is used by all Intel processors
• Other chips put the most significant byte first (big-endian)
• If you move data from one architecture to another, it's your responsibility to flip the bytes…
• …because the machine doesn't know what the bytes mean
• ```import struct

packed = struct.pack('4c', 'a', 'b', 'c', 'd')
print 'packed string:', repr(packed)

left16, right16 = struct.unpack('hh', packed)
print 'as two 16-bit integers:', left16, right16

all32 = struct.unpack('i', packed)
print 'as a single 32-bit integer', all32[0]

float32 = struct.unpack('f', packed)
print 'as a 32-bit float', float32[0]
```
```packed string: 'abcd'
as two 16-bit integers: 25185 25699
as a single 32-bit integer 1684234849
as a 32-bit float 1.67779994081e+22
```

## Packing Variable-Length Data

• How to store a variable-length vector of integers?
• Store the number of elements in a fixed-size header
• Then store that many integers one by one
• Figure 18.8: Packing a Variable-Length Vector

• Packing is easy:
```def pack_vec(vec):
buf = struct.pack('i', len(vec))
for v in vec:
buf += struct.pack('i', v)
return buf
def unpack_vec(buf):

# Get the count of the number of elements in the vector.
int_size = struct.calcsize('i')
count = struct.unpack('i', buf[0:int_size])[0]

# Get 'count' values, one by one.
pos = int_size
result = []
for i in range(count):
v = struct.unpack('i', buf[pos:pos+int_size])
result.append(v[0])
pos += int_size

return result```

## Unpacking Variable-Length Data

• Unpacking is a little harder
• Have to step up to the right location in the packed string on each pass through the unpacking loop
• ```def unpack_vec(buf):

# Get the count of the number of elements in the vector.
int_size = struct.calcsize('i')
count = struct.unpack('i', buf[0:int_size])[0]

# Get 'count' values, one by one.
pos = int_size
result = []
for i in range(count):
v = struct.unpack('i', buf[pos:pos+int_size])
result.append(v[0])
pos += int_size

return result```

## Dynamic Formats

• Problem: what if you want to pack strings, but don't know their length in advance?
• Solution: create the format string on the fly, and save the string's length as well as its characters
• ```def pack_strings(strings):
result = ''
for s in strings:
length = len(s)
format = 'i%ds' % length
result += struct.pack(format, length, s)
return result
def unpack_strings(buf):
int_size = struct.calcsize('i')
pos = 0
result = []
while pos < len(buf):
length = struct.unpack('i', buf[pos:pos+int_size])[0]
pos += int_size
format = '%ds' % length
s = struct.unpack(format, buf[pos:pos+length])[0]
pos += length
result.append(s)
return result```

## Unpacking Dynamic Formats

• Unpacking is the same as it was for vectors
• ```def unpack_strings(buf):
int_size = struct.calcsize('i')
pos = 0
result = []
while pos < len(buf):
length = struct.unpack('i', buf[pos:pos+int_size])[0]
pos += int_size
format = '%ds' % length
s = struct.unpack(format, buf[pos:pos+length])[0]
pos += length
result.append(s)
return result```

• Metadata literally means “data about data”
• I.e., data that describes other data, such as the date it was collected, or its format
• When creating binary files, put a header at the start of the file that describes the format of the data the file contains
• One parser handles all data files
• Can't lose the format: programs come and go, but data is forever
• Slower (generality always is)
• Reader is more complicated than a single special-purpose reader would be…
• …but simpler than the sum of all the special-purpose readers you'd have to write…
• …and you only have to debug it once

## Metadata File Structure

• Files have a three-part structure:
• Integer (fixed size) recording the length of the metadata
• Metadata (N bytes) describing the format of the records in the file
• The records themselves
• Figure 18.9: Structure of a Binary File With Metadata

## Packing with Metadata

• First step is to store a list of identically-structured records to a file
```def store(outf, format, values):
'''Store a list of lists, each of which has the same structure.'''
length = struct.pack('i', len(format))
outf.write(length)
outf.write(format)
for v in values:
temp = [format] + v
binary = struct.pack(*temp)
outf.write(binary)```
• Notice how `struct.pack` is called
• It takes each value to be packed as a separate argument, rather than taking a list of values
• First argument has to be the format
• So create a list with the format, and the values to be packed, and apply `struct.pack` to it
• Common pattern when using variable number of arguments

## Unpacking with Metadata

• Second step is to unpack the bytes created by `store`
• Read the size of the metadata, then the metadata, then the data
• ```def retrieve(inf):
'''Retrieve data from a self-describing file.'''
format_length = struct.unpack('i', data)[0]
record_size = struct.calcsize(format)
result = []
while True:
if not data:
break
values = list(struct.unpack(format, data))
result.append(values)
return result```

## Testing

• Final step is to test that everything works
• Just as important as steps 1 and 2
• ```from cStringIO import StringIO
tests = [
['i',  [[17]]],
['ii', [[17, 18]]],
['ii', [[17, 18], [19, 20], [21, 22]]],
['if', [[17, 18.0], [19, 20.0]]]
]
for (format, original) in tests:
storage = StringIO()
temp = store(storage, format, original)
storage.seek(0)
final = retrieve(storage)
assert original == final```
• Note that there's no output: tests should only ask for attention when something goes wrong

## Summary

• Binary data is to programming what chemistry is to biology
• You don't want to spend any more time thinking at its level than you have to…
• …but when you do have to, there's no substitute
• Remember: libraries already exist to handle (almost) every binary format ever created
• The easiest code to debug is the code you didn't actually have to write