This post originally appeared on steadbytes.com
See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.
Challenge 1
Design a small address book database (name, phone number, and so on) using a straightforward binary representation in your language of choice. Do this before reading the rest of this challenge.
- Translate that format into a plain-text format using XML or JSON.
- For each version, add a new, variable-length field called directions in which you might enter directions to each person’s house.
What issues come up regarding versioning and extensibility? Which form was easier to modify? What about converting existing data?
Full code can be found on GitHub.
Version 1
Data Model
Each address book record is represented by a Person
class containing basic personal information and address fields. A unique Id is also provided for each record using a UUID. Storing addresses universally is quite complex, however as this is not a challenge about data modelling I have assumed a very basic model of a UK address:
# address_book/models.py from dataclasses import dataclass, field from uuid import uuid4 @dataclass class Person: first_name: str last_name: str phone_number: str house_number: str street: str town: str postcode: str id: str = field(default_factory=lambda: str(uuid4()))
I'm using Python 3.7 Dataclasses because Person
is mainly (apart from Id
generation) a Data Transfer Object(DTO). Usage:
>>> Person("Ben", "Steadman", "+1-087-184-1440", "1", "A Road", "My Town", "CB234") Person(first_name='Ben', last_name='Steadman', phone_number='+1-087-184-1440', house_number='1', street='A Road', town='My Town', postcode='CB234', id='a14fe77b-b5d2-46e7-b42c-9392b4bbec28')
To aid testing, generate_people
will generate arbitrary People
instances using the excellent Faker library:
# address_book/models.py from faker import Faker fake = Faker("en_GB") def generate_people(n: int) -> Iterable[Person]: for _ in range(n): yield Person( fake.first_name(), fake.last_name(), fake.phone_number(), fake.building_number(), fake.street_name(), fake.city(), fake.postcode(), )
Usage:
>>> list(generate_people(2)) [ Person( first_name="Victor", last_name="Pearce", phone_number="01184960739", house_number="2", street="Mohamed divide", town="Charleneburgh", postcode="LS7 0DJ", id="cb242277-44dd-4836-98c7-ddbe10183fb4", ), Person( first_name="Stanley", last_name="Ashton", phone_number="(0131) 496 0908", house_number="2", street="Karen bridge", town="Port Gailland", postcode="L3J 2YF", id="ef85cfd1-08eb-4629-8747-3d8be1580fc7", ), ]
Binary Representation
As this challenge is about data formats and not building a database, I'm interpreting address book database as a file containing a list of address book records - not a DBMS.
To convert between the Person
class and a binary representation the Python struct
can be used.
Performs conversions between Python values and C structs represented as Python bytes opjects.
-- Python struct documentation
Person
can be represented using the following Struct
:
import struct PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s")
Which corresponds to the following C struct:
struct Person { char first_name[50]; char last_name[50]; char phone_number[30]; char house_number[10]; char street[50]; char town[50]; char postcode[10]; char id[36]; };
Binary packing/unpacking usage:
>>> as_bytes = PersonStruct.pack(b'Ben', b'Steadman', b'+44(0)116 4960124', b'1', b'A Road', b'My Town', b'CB234', b'b36cb798-946e-4dca-b89c-f393616feb7b') >>> as_bytes b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00b36cb798-946e-4dca-b89c-f393616feb7b' >>> PersonStruct.unpack(as_bytes)(b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'1\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'CB234\x00\x00\x00\x00\x00', b'b36cb798-946e-4dca-b89c-f393616feb7b')
- Note how the values of the
tuple
returned fromPersonStruct.unpack
are padded with\x00
(null bytes) due to the struct format specifying a larger length string than the original values provided. These will need to be removed during unpacking intoPerson
objects.
To provide a higher level of abstraction over these raw bytes, the conversion functionality can be wrapped up into some functions which deal with Person
objects:
# address_book/binary.py import struct from dataclasses import astuple from models import Person PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s") def from_bytes(buffer: bytes) -> Person: return Person( *( # remove null bytes added by string packing x.decode("utf-8").rstrip("\x00") for x in PersonStruct.unpack(buffer) ) ) def to_bytes(p: Person) -> bytes: return PersonStruct.pack( *(s.encode("utf-8") for s in astuple(p)) )
Usage:
>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234") >>> as_bytes = to_bytes(me) >>> me_again = from_bytes(me) >>> me == me_again True
These Person
conversion functions can be used in higher level functions to read and write an entire address book database:
# address_book/binary.py from functools import partial from pathlib import Path from typing import Iterable, List def read_address_book(db: Path) -> List[Person]: people = [] with db.open("rb") as f: for chunk in iter(partial(f.read, PersonStruct.size), b""): people.append(from_bytes(chunk)) return people def write_address_book(db: Path, people: Iterable[Person]): with db.open("wb") as f: f.write(b"".join(to_bytes(p) for p in people))
Usage:
>>> people = list(generate_people(50)) >>> db = Path("data/address-book.bin") >>> write_address_book(db, people) >>> people_again = read_address_book(db) >>> people == people_again True
Plain Text Representation
I've chosen JSON as the plain text format due to the excellent Python standard library json
module making it easy to work with. Using the same Person
model, the functions from_dict
and to_dict
are analogous to from_bytes
and to_bytes
respectively as the json
module converts JSON objects to and from Python dictionaries.
# address_book/plain_text.py from dataclasses import asdict from .models import Person def from_dict(d: dict) -> Person: return Person(**d) def to_dict(p: Person) -> dict: return asdict(p)
Usage:
>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234") >>> as_dict = to_dict(me) >>> me_again = from_dict(me) >>> me == me_again True
These can then be used to create JSON versions of read_address_book
and write_address_book
:
# address_book/plain_text.py import json from functools import partial from pathlib import Path from typing import Iterable, List def read_address_book(db: Path) -> List[Person]: with db.open() as f: return [from_dict(d) for d in json.load(f)] def write_address_book(db: Path, people: Iterable[Person]): with db.open("w") as f: json.dump([to_dict(p) for p in people], f)
Usage:
>>> people = list(generate_people(50)) >>> db = Path("data/address-book.json") >>> write_address_book(db, people) >>> people_again = read_address_book(db) >>> people == people_again True
Tests
Each implementation is also covered by a set of simple unit tests, asserting the correctness of the conversions to and from their respective formats:
import pytest from address_book import binary, plain_text from address_book.models import Person, generate_people @pytest.mark.parametrize("p", generate_people(50)) def test_to_bytes_inverts_from_bytes(p): p_bytes = binary.to_bytes(p) p_again = binary.from_bytes(p_bytes) assert p == p_again @pytest.mark.parametrize("p", generate_people(50)) def test_to_dict_inverts_from_dict(p): p_dict = plain_text.to_dict(p) p_again = plain_text.from_dict(p_dict) assert p == p_again @pytest.mark.parametrize( "module,fname", [(binary, "address-book.bin"), (plain_text, "address-book.json")] ) def test_write_address_book_inverts_read_address_book(module, fname, tmp_path): db = tmp_path / fname # sanity check assert db.exists() is False people = list(generate_people(50)) module.write_address_book(db, people) assert db.exists() is True assert db.stat().st_size > 0 people_again = module.read_address_book(db) assert people == people_again
Version 2 (variable length directions
)
Adding the additional directions
field to the model is simple enough:
from dataclasses import dataclass, field from typing import Iterable from uuid import uuid4 from faker import Faker fake = Faker("en_GB") @dataclass class Person: first_name: str last_name: str phone_number: str house_number: str street: str town: str postcode: str directions: str # new id: str = field(default_factory=lambda: str(uuid4())) def generate_people(n: int) -> Iterable[Person]: for _ in range(n): yield Person( fake.first_name(), fake.last_name(), fake.phone_number(), fake.building_number(), fake.street_name(), fake.city(), fake.postcode(), # new fake.text(), # random latin is about as useful as most directions )
Binary Representation
Since the struct
module deals with C structs, strings are represented as C char
arrays of a fixed length specified in the format string i.e. struct.pack("11s", "hello world")
. To achieve this in generality is quite an involved process and if you need to this for a real application, using a third party library such as NetStruct would be recommended. For the purpose of this challenge, however, I won't be using it and nor will I be implementing a general solution - the code for packing/unpacking records is very tightly coupled to the structure of the records and I would not recommend following this approach in a real application. However, it does demonstrate the difficulties that can arise when using binary formats.
Since the size of the directions
field is variable, the complete format string for packing/unpacking of records using struct
must be dynamically created:
>>> me = Person( "Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234", "Take a left at the roundabout", ) >>> fmt = "50s50s30s10s50s50s10s{}s36s".format(len(me.directions)) >>> struct.pack(fmt, *(s.encode("utf-8") for s in astuple(me))) b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00Take a left at the roundaboutbfe3c3e5-8b65-4e49-8d26-3981257a0dee'
Furthermore, since each packed record will be of a different size the database file cannot cannot simply be read in equal sized chunks and passed to from_bytes
as in the first implementation. To solve this, each record is preceded by it's size in bytes. This value can be used to determine the next chunk size to read from the file and pass to from_bytes
. :
# address_book/binary.py PERSON_STRUCT_FMT = "50s50s30s10s50s50s10s{}s36s" def to_bytes(p: Person) -> Tuple[bytes, int]: # dynamically add size to format for variable length directions field fmt = PERSON_STRUCT_FMT.format(len(p.directions)) return ( struct.pack(fmt, *(s.encode("utf-8") for s in astuple(p))), struct.calcsize(fmt), ) RecordSizeStruct = struct.Struct("I") def write_address_book(db: Path, people: Iterable[Person]): with db.open("wb") as f: records_with_sizes = ( RecordSizeStruct.pack(size) + p_bytes for p_bytes, size in (to_bytes(p) for p in people) ) f.write(b"".join(records_with_sizes))
to_bytes
still receives a buffer
of bytes representing an entire packed record, however to handle the variable length directions
field it needs to calculate the position within buffer
at which the directions
field must begin, split it accordingly and unpack each section individually:
# address_book/binary.py def from_bytes(buffer: bytes) -> Person: # calculate sizes of non-variable formats before_fmt, after_fmt = PERSON_STRUCT_FMT.split("{}s") before_start = struct.calcsize(before_fmt) after_start = len(buffer) - struct.calcsize(after_fmt) before, direction, after = ( buffer[:before_start], buffer[before_start:after_start], buffer[after_start:], ) # dynamically build struct format string for variable length field direction_fmt = "{}s".format(len(direction)) data = ( struct.unpack(before_fmt, before) + struct.unpack(direction_fmt, direction) + struct.unpack(after_fmt, after) ) return Person(*(x.decode("utf-8").rstrip("\x00") for x in data)) def read_address_book(db: Path)->List[Person]: people = [] with db.open("rb") as f: while True: # each record preceded by its size in bytes, use to determine number # of bytes to read from db for the entire record size_buf = f.read(RecordSizeStruct.size) if not size_buf: break # reached end of db record_size = RecordSizeStruct.unpack(size_buf)[0] people.append(from_bytes(f.read(record_size))) return people
A slight adjustment to the tests is needed to account for to_bytes
now returning a tuple
:
@pytest.mark.parametrize("p", generate_people(50)) def test_to_bytes_inverts_from_bytes(p): p_bytes, size = binary.to_bytes(p) p_again = binary.from_bytes(p_bytes) assert p == p_again
Plain Text Representation
Other than the changes to the Person
class, no further changes are required to support the new variable length field.
Summary
Though I already agreed with the authors preference for plain text formats, this challenge certainly demonstrated that for most cases plain text is the appropriate format to use.
The binary representation is more difficult to extend and (at least in this example) required breaking changes to do so. This made any data written using the first version (prior to the introduction of the variable length directions
field) incompatible with any data written using the second version. A versioning scheme would need to be devised and represented within the binary format, for example using a pre-defined 'header' block of bytes to contain some metadata.
The plain text representation was simple to implement using standard, built in tools and was simple to extend. If the directions
field is deemed optional any data written in the first version is fully compatible with that of the second version. Converting the data would be a simple text transformation and could in fact be achieved directly in the shell using a tool such as jq. Here's an example to add the directions
field, setting it to a default of null
:
$ cat data/address-book.json | jq 'map(. + {"directions": null})' [ { "first_name": "Fiona", "last_name": "Power", "phone_number": "01314960440", "house_number": "91", "street": "Sam fields", "town": "North Shanebury", "postcode": "M38 1FH", "directions": null, "id": "264bfab6-f1a5-4adc-a86b-28ae8e41817b" }, { "first_name": "Lorraine", "last_name": "Richards", "phone_number": "+448081570114", "house_number": "9", "street": "Ashleigh loaf", "town": "North William", "postcode": "M4H 5PW", "directions": null, "id": "b0b98056-c8ff-4b4e-a68b-b31e8ae43ac3" }, ... ]
Top comments (0)