art with code

2008-10-08

I/O in programming languages: structs and serialization

Hi, I'm going to talk about structured I/O and serialization today. By structured I/O I mean reading and writing values that aren't bytes. At the simplest level, we can treat the data we read as native values. For example, if you wanted to read an int (in binary form) from stdin in C, you could do int num; if (fread(&num, sizeof(num), 1, stdin) != sizeof(num)) error();. Not only is that easy, it's very efficient as well, and you can write any flat value the same way.

See also the other parts in this series of posts talking about I/O:
Part 1: open and read
Part 2: writing
Part 3: basic piping
Part 4: piping to processes
Part 5: structs and serialization -- you are here

Here's an example that reads and writes an array of structs:

typedef struct {
int x;
int y;
float z;
char name[40];
} foo;

foo s[20];

FILE *fd = fopen("foos.dat", "r+");
fread(&s, sizeof(foo), 20, fd);

s[4].z = 24.8;

fseek(fd, -(sizeof(foo)*20), SEEK_CUR);
fwrite(&s, sizeof(foo), 20, fd);

fclose(fd);

It's even easier if you use mmap to treat the file as an array:

// Note that foos.dat needs to be at least as large as the mmapped area for the
// edits to have any effect, i.e. mmap can't append to a file.

int fd = open("foos.dat", O_RDWR);
foo *s = (foo*)mmap(NULL, sizeof(foo)*20, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

s[4].z = 24.8;

munmap(s, sizeof(foo)*20);
close(fd);

And yes, mmap is a syscall.

.equ MMAP2, 192
# ...
movl $MMAP2, %eax
movl $0, %ebx # NULL
movl $length, %ecx
movl $3, %edx # PROT_READ = 1, PROT_WRITE = 2, 1|2 = 3
movl $1, %esi # MAP_SHARED = 1
movl $filedes, %edi
movl $0, %ebp # offset in pages
int $0x80

Fun with ntohl and htonl


If you need to write a portable way to store ints, you need to specify the size and the endianness of the number. Glibc has the tools for that: the datatype uint32_t and the functions htonl and ntohl. Htonl stands for "host to network long" and converts a host-endian integer into a big-endian integer. Ntohl is "network to host long" and converts a big-endian integer into a host-endian integer.

To use htonl and ntohl to write a number to a file, you'd do uint32_t num = 10; uint32_t num_be = htonl(num); fwrite(&num_be, sizeof(uint32_t), 1, fd);. To read the number, the procedure is uint32_t num, num_be; fread(&num_be, sizeof(uint32_t), 1, fd); num = ntohl(num_be);.

Ntohl and htonl share the same implementation: reverse bytes of the value if on a little-endian machine, otherwise do nothing. The glibc implementation is a couple preprocessor macros in two source files: netinet/in.h and bits/byteswap.h. In netinet/in.h, if the current architecture is big-endian, ntohl and htonl are identity operators, i.e. #define ntohl(x) (x). If the architecture is little-endian, ntohl(x) and htonl(x) equate to __bswap_32(x), which is a preprocessor macro in bits/byteswap.h.

On i486 and above, there's a bswap opcode, so the implementation of __bswap_32 is a single assembly instruction, bswap %0. On i386, the implementation is three rotate instructions: rorw $8, %w0; rorw $16, %w0; rorw $8, %w0. The first instruction rotates the first word (sixteen bits) by 8 bits, swapping its bytes (12 34 -> 21 34.) The second instruction rotates the whole 32-bit int by 16 bits, swapping the words (21 34 -> 34 21.) The third instruction again swaps the bytes of the first word (34 21 -> 43 21.)

Pack and unpack


To move up a bit, Ruby (and Perl) has an interesting way to deal with binary data. Ruby's String#unpack converts a string into an array of values, and Array#pack converts an array of values into a string. They use a small data definition language to specify how to do the conversion.

For example, if you wanted to write an ipv4 address as big-endian long, hostname length as big-endian short and the hostname string to a file, you could do File.open("addrs", "a"){|f| f.write( [ip, hostname.length, hostname].pack("Nna*") ) }. To read such a struct from the file, you'd first read the address and the hostname length and then read the hostname: File.open("addrs", "r"){|f| ip,hlen = f.read(6).unpack("Nn"); hostname = f.read(hlen) }. For details on the Perl version, read the (very comprehensive) perldoc page for pack().

Serialization


To store and load arbitrary values, we need a serializer and a deserializer. A serializer takes a value and turns it into a string. A deserializer takes a string and turns it into a value. To use Ruby as an example, you serialize an object with s = Marshal.dump(my_object) and deserialize by Marshal.load(s). OCaml does it in a very similar fashion as well, let s = Marshal.to_string my_value [] and let my_value = Marshal.from_string s 0. The way to serialize objects in Python is pickle, e.g. import pickle; s = pickle.dumps(my_obj); obj = pickle.loads(s).

If you want a more human-readable serialization format, JSON is IMHO the nicest format for simple data (read: 99% of data.) There are JSON libraries in pretty much all languages, JSON uses Unicode strings, the output is reasonably small (compresses well too), and you can easily pass it to JavaScript in a web app. In Ruby you need to install the json library and then can do require 'json'; s = my_hash.to_json; h = JSON.parse(s).

I do also have to mention XML. There.

Madness


Let's finish off by writing a serializer. Or more like, I'll paste two hundred lines of C that serializes and deserializes a binary tree of (int, string)-nodes in a platform-independent manner, and you'll mutter under your breath every time my code does something stupid. Have fun reading!

#include <unistd.h>
#include <string.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

typedef struct {
uint32_t length;
char *ptr;
} string;

/* simple binary tree */
struct btree {
uint32_t id;
string name;
struct btree *left;
struct btree *right;
};
typedef struct btree *btree_t;

/* tags for the data types */
const char NULL_TAG = 0;
const char INT_TAG = 1;
const char STRING_TAG = 2;
const char BTREE_TAG = 3;

string create_string(size_t);
void destroy_string(string);
string tail_string(string, size_t);
string concat_strings(const string *, size_t);


/* int serializer */

string dump_uint32_t(uint32_t i) {
uint32_t j = htonl(i);
string s = create_string(sizeof(uint32_t)+1);
s.ptr[0] = INT_TAG;
*(uint32_t*)(&s.ptr[1]) = j;
return s;
}

uint32_t load_uint32_t(string s, size_t *advance) {
assert(s.ptr[0] == INT_TAG);
assert(s.length >= sizeof(uint32_t)+1);
uint32_t j = *(uint32_t*)(&s.ptr[1]);
*advance = sizeof(uint32_t)+1;
return ntohl(j);
}


/* string serializer */

string dump_string(string s) {
string ds = create_string(s.length+sizeof(uint32_t)+1);
ds.ptr[0] = STRING_TAG;
*(uint32_t*)(ds.ptr+1) = htonl(s.length);
memcpy(ds.ptr+sizeof(uint32_t)+1, s.ptr, s.length);
return ds;
}

string load_string(string ds, size_t *advance) {
assert(ds.ptr[0] == STRING_TAG);
assert(ds.length >= sizeof(uint32_t)+1);
size_t len = ntohl(*(uint32_t*)(ds.ptr+1));
assert(ds.length >= sizeof(uint32_t)+1 + len);
string s = create_string(len);
memcpy(s.ptr, ds.ptr+sizeof(uint32_t)+1, s.length);
*advance = 1+sizeof(uint32_t)+s.length;
return s;
}


/* btree serializer */

string null() {
string null_s = create_string(1);
null_s.ptr[0] = NULL_TAG;
return null_s;
}

string dump_btree(btree_t t) {
string strings[5], dump;
strings[0] = create_string(1);
strings[0].ptr[0] = BTREE_TAG;
strings[1] = dump_uint32_t(t->id);
strings[2] = dump_string(t->name);
strings[3] = (t->left != NULL) ? dump_btree(t->left) : null();
strings[4] = (t->right != NULL) ? dump_btree(t->right) : null();
dump = concat_strings(strings, 5);
int i;
for (i=0; i<5; i++) destroy_string(strings[i]);
return dump;
}

btree_t load_btree(string s, size_t *adv) {
if (s.ptr[0] == NULL_TAG) {
*adv = 1;
return NULL;
}
assert(s.ptr[0] == BTREE_TAG);
size_t advance;
size_t idx = 1;

btree_t t = (btree_t)malloc(sizeof(struct btree));
t->id = load_uint32_t(tail_string(s, idx), &advance); idx += advance;
t->name = load_string(tail_string(s, idx), &advance); idx += advance;
t->left = load_btree(tail_string(s, idx), &advance); idx += advance;
t->right = load_btree(tail_string(s, idx), &advance); idx += advance;

*adv = idx;
return t;
}



/* string utils */

string create_string(size_t length) {
string s;
s.ptr = (char*)malloc(length);
assert(s.ptr != NULL);
s.length = length;
return s;
}

string cstr(const char cstr[]) {
size_t len = strlen(cstr);
string s = create_string(len);
memcpy(s.ptr, cstr, len);
return s;
}

void destroy_string(string s) {
s.length = 0;
if (s.ptr != NULL) free(s.ptr);
}

string tail_string(string s, size_t first_idx) {
string s2;
s2.length = s.length - first_idx;
assert(s2.length >= 0);
s2.ptr = s.ptr + first_idx;
return s2;
}

string concat_strings(const string ss[], size_t len) {
size_t i, total = 0;
for (i=0; i<len; i++) total += ss[i].length;
string s = create_string(total);
total = 0;
for (i=0; i<len; i++) {
memcpy(s.ptr+total, ss[i].ptr, ss[i].length);
total += ss[i].length;
}
return s;
}



/* btree utils */

btree_t create_btree(int id, string name) {
btree_t b = (btree_t)malloc(sizeof(struct btree));
b->id = id;
b->name = name;
b->left = NULL;
b->right = NULL;
return b;
}
void destroy_btree(btree_t b) {
destroy_string(b->name);
if (b->left != NULL) destroy_btree(b->left);
if (b->right != NULL) destroy_btree(b->right);
free(b);
}

int btree_sum_ids(btree_t b) {
int bleft = (b->left == NULL) ? 0 : btree_sum_ids(b->left);
int bright = (b->right == NULL) ? 0 : btree_sum_ids(b->right);
return b->id + bleft + bright;
}


/* test program */

int main (int argc, char *argv[]) {
btree_t b;
b = create_btree(1, cstr("root"));
b->left = create_btree(2, cstr("l"));
b->left->left = create_btree(3, cstr("ll"));
b->left->right = create_btree(4, cstr("lr"));
printf("created tree\n");
int bsum = btree_sum_ids(b);
printf("original sum: %d\n", bsum);

size_t advance;
string ds = dump_btree(b);
printf("dumped tree, %d bytes\n", ds.length);
btree_t c = load_btree(ds, &advance);
printf("loaded tree, advance %d bytes\n", advance);
destroy_string(ds);

int csum = btree_sum_ids(c);
printf("loaded sum: %d\n", csum);

destroy_btree(b);
destroy_btree(c);

return ((bsum == csum) ? 0 : 2);
}

What this exciting program prints out is even more exciting!

created tree
original sum: 10
dumped tree, 58 bytes
loaded tree, advance 58 bytes
loaded sum: 10

Now, a smarter person would've created a data definition language that generates code for the structs and serializers, something like:

serializable btree {
uint id;
string name;
btree left;
btree right;
};

An even smarter person would've made a serializer that converts a serialized string into the data structure in-place by making pointers relative and adding the serialized string's address to each pointer to make them alive. That way you could mmap a file as MAP_PRIVATE, breathe life to it and get fast and easy read access. And your data structure will be contiguous in memory too, what could be better?! (Well, the malloc-recreated data structure is also contiguous in memory, and it's certainly less painful...)

And now I go running off into the autumn afternoon sunshine, laughing. Ha ha ha ha ~~

1 comment:

hyperlogic said...

Nice post. You mention code generation & pointer fixup toward the end. Most console game developers have hand rolled tools that do this. I'm having a hard time finding any open source implementations tho...

Blog Archive