fhtr: October 2008

2008-10-08

I/O in programming languages: structs and serialization

Hi, I'm going to talk about structured I/O and serialization today. By structured I/O I mean reading and writing values that aren't bytes. At the simplest level, we can treat the data we read as native values. For example, if you wanted to read an int (in binary form) from stdin in C, you could do int num; if (fread(&num, sizeof(num), 1, stdin) != sizeof(num)) error();. Not only is that easy, it's very efficient as well, and you can write any flat value the same way.

See also the other parts in this series of posts talking about I/O:
Part 1: open and read
Part 2: writing
Part 3: basic piping
Part 4: piping to processes
Part 5: structs and serialization -- you are here

Here's an example that reads and writes an array of structs:


typedef struct {
  int x;
  int y;
  float z;
  char name[40];
} foo;

foo s[20];

FILE *fd = fopen("foos.dat", "r+");
fread(&s, sizeof(foo), 20, fd);

s[4].z = 24.8;

fseek(fd, -(sizeof(foo)*20), SEEK_CUR);
fwrite(&s, sizeof(foo), 20, fd);

fclose(fd);

It's even easier if you use mmap to treat the file as an array:


// Note that foos.dat needs to be at least as large as the mmapped area for the
// edits to have any effect, i.e. mmap can't append to a file.

int fd = open("foos.dat", O_RDWR);
foo *s = (foo*)mmap(NULL, sizeof(foo)*20, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

s[4].z = 24.8;

munmap(s, sizeof(foo)*20);
close(fd);

And yes, mmap is a syscall.


.equ MMAP2, 192
# ...
movl   $MMAP2, %eax
movl       $0, %ebx # NULL
movl  $length, %ecx
movl       $3, %edx # PROT_READ = 1, PROT_WRITE = 2, 1|2 = 3
movl       $1, %esi # MAP_SHARED = 1
movl $filedes, %edi
movl       $0, %ebp # offset in pages
int $0x80

Fun with ntohl and htonl

If you need to write a portable way to store ints, you need to specify the size and the endianness of the number. Glibc has the tools for that: the datatype uint32_t and the functions htonl and ntohl. Htonl stands for "host to network long" and converts a host-endian integer into a big-endian integer. Ntohl is "network to host long" and converts a big-endian integer into a host-endian integer.

To use htonl and ntohl to write a number to a file, you'd do uint32_t num = 10; uint32_t num_be = htonl(num); fwrite(&num_be, sizeof(uint32_t), 1, fd);. To read the number, the procedure is uint32_t num, num_be; fread(&num_be, sizeof(uint32_t), 1, fd); num = ntohl(num_be);.

Ntohl and htonl share the same implementation: reverse bytes of the value if on a little-endian machine, otherwise do nothing. The glibc implementation is a couple preprocessor macros in two source files: netinet/in.h and bits/byteswap.h. In netinet/in.h, if the current architecture is big-endian, ntohl and htonl are identity operators, i.e. #define ntohl(x) (x). If the architecture is little-endian, ntohl(x) and htonl(x) equate to __bswap_32(x), which is a preprocessor macro in bits/byteswap.h.

On i486 and above, there's a bswap opcode, so the implementation of __bswap_32 is a single assembly instruction, bswap %0. On i386, the implementation is three rotate instructions: rorw $8, %w0; rorw $16, %w0; rorw $8, %w0. The first instruction rotates the first word (sixteen bits) by 8 bits, swapping its bytes (12 34 -> 21 34.) The second instruction rotates the whole 32-bit int by 16 bits, swapping the words (21 34 -> 34 21.) The third instruction again swaps the bytes of the first word (34 21 -> 43 21.)

Pack and unpack

To move up a bit, Ruby (and Perl) has an interesting way to deal with binary data. Ruby's String#unpack converts a string into an array of values, and Array#pack converts an array of values into a string. They use a small data definition language to specify how to do the conversion.

For example, if you wanted to write an ipv4 address as big-endian long, hostname length as big-endian short and the hostname string to a file, you could do File.open("addrs", "a"){|f| f.write( [ip, hostname.length, hostname].pack("Nna*") ) }. To read such a struct from the file, you'd first read the address and the hostname length and then read the hostname: File.open("addrs", "r"){|f| ip,hlen = f.read(6).unpack("Nn"); hostname = f.read(hlen) }. For details on the Perl version, read the (very comprehensive) perldoc page for pack().

Serialization

To store and load arbitrary values, we need a serializer and a deserializer. A serializer takes a value and turns it into a string. A deserializer takes a string and turns it into a value. To use Ruby as an example, you serialize an object with s = Marshal.dump(my_object) and deserialize by Marshal.load(s). OCaml does it in a very similar fashion as well, let s = Marshal.to_string my_value [] and let my_value = Marshal.from_string s 0. The way to serialize objects in Python is pickle, e.g. import pickle; s = pickle.dumps(my_obj); obj = pickle.loads(s).

If you want a more human-readable serialization format, JSON is IMHO the nicest format for simple data (read: 99% of data.) There are JSON libraries in pretty much all languages, JSON uses Unicode strings, the output is reasonably small (compresses well too), and you can easily pass it to JavaScript in a web app. In Ruby you need to install the json library and then can do require 'json'; s = my_hash.to_json; h = JSON.parse(s).

I do also have to mention XML. There.

Madness

Let's finish off by writing a serializer. Or more like, I'll paste two hundred lines of C that serializes and deserializes a binary tree of (int, string)-nodes in a platform-independent manner, and you'll mutter under your breath every time my code does something stupid. Have fun reading!


#include <unistd.h>
#include <string.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

typedef struct {
  uint32_t length;
  char *ptr;
} string;

/* simple binary tree */
struct btree {
  uint32_t id;
  string name;
  struct btree *left;
  struct btree *right;
};
typedef struct btree *btree_t;

/* tags for the data types */
const char NULL_TAG = 0;
const char INT_TAG = 1;
const char STRING_TAG = 2;
const char BTREE_TAG = 3;

string create_string(size_t);
void destroy_string(string);
string tail_string(string, size_t);
string concat_strings(const string *, size_t);


/* int serializer */

string dump_uint32_t(uint32_t i) {
  uint32_t j = htonl(i);
  string s = create_string(sizeof(uint32_t)+1);
  s.ptr[0] = INT_TAG;
  *(uint32_t*)(&s.ptr[1]) = j;
  return s;
}

uint32_t load_uint32_t(string s, size_t *advance) {
  assert(s.ptr[0] == INT_TAG);
  assert(s.length >= sizeof(uint32_t)+1);
  uint32_t j = *(uint32_t*)(&s.ptr[1]);
  *advance = sizeof(uint32_t)+1;
  return ntohl(j);
}


/* string serializer */

string dump_string(string s) {
  string ds = create_string(s.length+sizeof(uint32_t)+1);
  ds.ptr[0] = STRING_TAG;
  *(uint32_t*)(ds.ptr+1) = htonl(s.length);
  memcpy(ds.ptr+sizeof(uint32_t)+1, s.ptr, s.length);
  return ds;
}

string load_string(string ds, size_t *advance) {
  assert(ds.ptr[0] == STRING_TAG);
  assert(ds.length >= sizeof(uint32_t)+1);
  size_t len = ntohl(*(uint32_t*)(ds.ptr+1));
  assert(ds.length >= sizeof(uint32_t)+1 + len);
  string s = create_string(len);
  memcpy(s.ptr, ds.ptr+sizeof(uint32_t)+1, s.length);
  *advance = 1+sizeof(uint32_t)+s.length;
  return s;
}


/* btree serializer */

string null() {
  string null_s = create_string(1);
  null_s.ptr[0] = NULL_TAG;
  return null_s;
}

string dump_btree(btree_t t) {
  string strings[5], dump;
  strings[0] = create_string(1);
  strings[0].ptr[0] = BTREE_TAG;
  strings[1] = dump_uint32_t(t->id);
  strings[2] = dump_string(t->name);
  strings[3] = (t->left != NULL) ? dump_btree(t->left) : null();
  strings[4] = (t->right != NULL) ? dump_btree(t->right) : null();
  dump = concat_strings(strings, 5);
  int i;
  for (i=0; i<5; i++) destroy_string(strings[i]);
  return dump;
}

btree_t load_btree(string s, size_t *adv) {
  if (s.ptr[0] == NULL_TAG) {
    *adv = 1;
    return NULL;
  }
  assert(s.ptr[0] == BTREE_TAG);
  size_t advance;
  size_t idx = 1;

  btree_t t = (btree_t)malloc(sizeof(struct btree));
  t->id = load_uint32_t(tail_string(s, idx), &advance); idx += advance;
  t->name = load_string(tail_string(s, idx), &advance); idx += advance;
  t->left = load_btree(tail_string(s, idx), &advance); idx += advance;
  t->right = load_btree(tail_string(s, idx), &advance); idx += advance;

  *adv = idx;
  return t;
}



/* string utils */

string create_string(size_t length) {
  string s;
  s.ptr = (char*)malloc(length);
  assert(s.ptr != NULL);
  s.length = length;
  return s;
}

string cstr(const char cstr[]) {
  size_t len = strlen(cstr);
  string s = create_string(len);
  memcpy(s.ptr, cstr, len);
  return s;
}

void destroy_string(string s) {
  s.length = 0;
  if (s.ptr != NULL) free(s.ptr);
}

string tail_string(string s, size_t first_idx) {
  string s2;
  s2.length = s.length - first_idx;
  assert(s2.length >= 0);
  s2.ptr = s.ptr + first_idx;
  return s2;
}

string concat_strings(const string ss[], size_t len) {
  size_t i, total = 0;
  for (i=0; i<len; i++) total += ss[i].length;
  string s = create_string(total);
  total = 0;
  for (i=0; i<len; i++) {
    memcpy(s.ptr+total, ss[i].ptr, ss[i].length);
    total += ss[i].length;
  }
  return s;
}



/* btree utils */

btree_t create_btree(int id, string name) {
  btree_t b = (btree_t)malloc(sizeof(struct btree));
  b->id = id;
  b->name = name;
  b->left = NULL;
  b->right = NULL;
  return b;
}
void destroy_btree(btree_t b) {
  destroy_string(b->name);
  if (b->left != NULL) destroy_btree(b->left);
  if (b->right != NULL) destroy_btree(b->right);
  free(b);
}

int btree_sum_ids(btree_t b) {
  int bleft = (b->left == NULL) ? 0 : btree_sum_ids(b->left);
  int bright = (b->right == NULL) ? 0 : btree_sum_ids(b->right);
  return b->id + bleft + bright;
}


/* test program */

int main (int argc, char *argv[]) {
  btree_t b;
  b = create_btree(1, cstr("root"));
  b->left = create_btree(2, cstr("l"));
  b->left->left = create_btree(3, cstr("ll"));
  b->left->right = create_btree(4, cstr("lr"));
  printf("created tree\n");
  int bsum = btree_sum_ids(b);
  printf("original sum: %d\n", bsum);

  size_t advance;
  string ds = dump_btree(b);
  printf("dumped tree, %d bytes\n", ds.length);
  btree_t c = load_btree(ds, &advance);
  printf("loaded tree, advance %d bytes\n", advance);
  destroy_string(ds);

  int csum = btree_sum_ids(c);
  printf("loaded sum: %d\n", csum);

  destroy_btree(b);
  destroy_btree(c);
  
  return ((bsum == csum) ? 0 : 2);
}

What this exciting program prints out is even more exciting!


created tree
original sum: 10
dumped tree, 58 bytes
loaded tree, advance 58 bytes
loaded sum: 10

Now, a smarter person would've created a data definition language that generates code for the structs and serializers, something like:


serializable btree {
  uint id;
  string name;
  btree left;
  btree right;
};

An even smarter person would've made a serializer that converts a serialized string into the data structure in-place by making pointers relative and adding the serialized string's address to each pointer to make them alive. That way you could mmap a file as MAP_PRIVATE, breathe life to it and get fast and easy read access. And your data structure will be contiguous in memory too, what could be better?! (Well, the malloc-recreated data structure is also contiguous in memory, and it's certainly less painful...)

And now I go running off into the autumn afternoon sunshine, laughing. Ha ha ha ha ~~

2008-10-05

I/O in programming languages: piping to processes

This post is a part of a series where your intrepid host looks at I/O in different programming languages in search of understanding and interesting abstractions.

Part 1: open and read
Part 2: writing
Part 3: basic piping
Part 4: piping to processes -- you are here
Part 5: structs and serialization

Continuing from piping files, today we pipe to external processes. Piping data between programs is one of the basic tenets of Unix. A second basic tenet is "everything is a file." The two aren't really that different though, writing to a file is roughly equivalent to writing data to a filesystem program (if you thought of FUSE just now, have a cookie.)

Bash is built for process-to-process interaction, so it's very easy to write pipelines and do IO redirection. To demonstrate, here's a pipeline that builds a histogram of some HTTP return codes. It has five processes talking to each other (the fifth being bash itself):


cat /var/log/apache2/access.log | egrep -o ' (200|302|304) ' | sort | uniq -c

But how does that work on the low level -- how does Bash build such pipelines? Let's find out.

To write data to a program, we need to have a pipe such that we hold the write end and the other program holds the read end. The pipe syscall gives us the two ends of a pipe, which we can then pass to a forked process. Here's a pipe syscall example in x86 assembler (32-bit this time.)

First, just the syscall:


.equ PIPE, 42
.section .bss
.lcomm pipes, 8
# ...
# The pipe syscall.
# Writes the pipe file descriptors to the array pointed by ebx.
movl $PIPE, %eax
movl $pipes, %ebx 
int $0x80

Then a small demo program that creates a pipe, writes a string to its input, reads the string from its output, and prints out the read string (...hey, I had to come up with some kind of simple test):


# Define some globals
.equ EXIT, 1
.equ READ, 3
.equ WRITE, 4
.equ CLOSE, 6
.equ PIPE, 42

.equ STDOUT, 0

.equ PIPE_READ, 0
.equ PIPE_WRITE, 4

.section .data  # static memory segment
hello:
  .ascii "Hello\n"

.section .bss   # dynamic memory segment
.lcomm buf, 6   # buffer to read into
.lcomm pipes, 8 # array for the pipe file descriptors (4 bytes each)

.section .text  # program segment
.globl _start
_start:
    # The pipe syscall.
    # Writes the pipe file descriptors to the array pointed by ebx.
    movl $PIPE, %eax
    movl $pipes, %ebx
    int $0x80

    # Write 6 bytes from hello to the input pipe.
    movl $WRITE, %eax
    movl $pipes, %ecx
    movl PIPE_WRITE(%ecx), %ebx
    movl $hello, %ecx
    movl $6, %edx
    int $0x80

    # Close the input pipe. The fd is already in ebx.
    movl $CLOSE, %eax
    int $0x80

    # Read 6 bytes from the output pipe into buf. Buf should equal hello after this.
    movl $READ, %eax
    movl $pipes, %ecx
    movl PIPE_READ(%ecx), %ebx
    movl $buf, %ecx
    movl $6, %edx
    int $0x80

    # Close the output pipe.
    movl $CLOSE, %eax
    int $0x80

    # Write the contents of buf to stdout. Should print "Hello" with a linebreak.
    movl $WRITE, %eax
    movl $STDOUT, %ebx
    movl $buf, %ecx
    movl $6, %edx
    int $0x80

    # And we're done, exit with a zero.
    movl $EXIT, %eax
    movl $0, %ebx
    int $0x80

Now, to talk to an external process, we first need two pipes. Then we need to fork a new process and reopen its stdin and stdout to point to our pipes. Finally we call exec and use our ends of the pipes to talk to the new process. Here's a C example that does the equivalent of echo 'Hello there!' | tr '[:lower:]' '[:upper:]', it even has error checking in parts:


#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>

/* forks the command and returns pipes connected to the command's stdin and stdout */

pid_t popen2(const char *command, char *const argv[], FILE **readable, FILE **writable)
{
  int fds_writable[2];
  int fds_readable[2];
  pid_t pid;

  // make two pipes, writable is connected to child stdin, readable to child stdout
  if (pipe(fds_writable) == -1) exit(1);
  if (pipe(fds_readable) == -1) exit(1);

  pid = fork();
  if (pid == -1) exit(2);

  if (pid == 0) { // child process
    // close the pipe ends used by the parent
    if (close(fds_readable[0]) == -1) exit(3);
    if (close(fds_writable[1]) == -1) exit(3);

    if (dup2(fds_readable[1], STDOUT_FILENO) == -1) exit(4); // connect to stdout
    if (dup2(fds_writable[0], STDIN_FILENO) == -1) exit(4); // connect to stdin
    execv(command, argv); // replace the current process with the command
    exit(1); // shouldn't be reached
  }

  // close the pipe ends used by the child
  if (close(fds_readable[1]) == -1) exit(3);
  if (close(fds_writable[0]) == -1) exit(3);

  *readable = fdopen(fds_readable[0], "r");
  *writable = fdopen(fds_writable[1], "w");
  return pid;
}

int main(int argc, char *argv[])
{
  FILE *r, *w;
  char command[] = "/usr/bin/tr",
       message[] = "Hello there!";
  char *args[] = { "tr", "[:lower:]", "[:upper:]", NULL };
  char *buf;

  int len = strlen(message);
  pid_t pid = popen2(command, args, &r, &w);

  buf = (char*)malloc(20);

  fwrite(message, 1, len, w);
  fclose(w);

  fgets(buf, 20, r);
  fclose(r);

  wait(pid);
  puts(buf);

  free(buf);

  return 0;
}

Whew, that was a pain. Which is why most languages built for Unix scripting have an equivalent of popen2 in their standard library. For example, Ruby has IO.popen, simplifying our task significantly:


IO.popen("tr '[:lower:]' '[:upper:]'", "r+"){|pipe|
  pipe.write "Hello there!"
  pipe.close_write
  puts pipe.read
}

As even that is often cumbersome, Ruby and Perl have a backtick operator that does something like popen("/bin/sh -c " + cmd, "r"){|p| p.read }. The following works fine in both Ruby and Perl:


print(`echo 'Hello there!' | tr '[:lower:]' '[:upper:]'`)

In the OCaml prelude.ml library, I have utilities to do similar things:


let () =
  withCmd ["tr"; "[:lower:]"; "[:upper:]"] (fun ic oc ->
    output_string oc "Hello there!";
    close_out oc;
    puts (readAll ic))

There's even a backtick equivalent:


let () = puts (readRawCmd "echo 'Hello there!' | tr '[:lower:]' '[:upper:]'")

Here is the implementation of withCmd and friends, it's got a base abstraction with a convenience garden built on top:


let withRawCmd cmd f =
  let ic,oc = Unix.open_process cmd in
  finally (fun _ -> close_out_noerr oc; close_in_noerr ic)
          (f ic) oc

let withRawCmdStdin args f = withRawCmd args (fun ic oc -> close_in_noerr ic; f oc)
let withRawCmdStdout args f = withRawCmd args (fun ic oc -> close_out_noerr oc; f ic)

let withCmd args = withRawCmd (escape_cmd args)
let withCmdStdin args = withRawCmdStdin (escape_cmd args)
let withCmdStdout args = withRawCmdStdout (escape_cmd args)

let readCmd args = withCmdStdout args readAll
let readRawCmd args = withRawCmdStdout args readAll

Let's build a multi-process pipeline next. What we want is a way to send all data from one pipe to another. We could spawn a thread for each connection, reading from one process and writing to another, but that's a bit of a pain to implement. Routing the data through the parent process also causes extra copying, which we would like to avoid. Let's do the piping like Bash and start the processes in sequence, setting a processes' stdin be the stdout of the previous process.

First we build a pipeline from a list of commands. The pipeline builder is a fold over commands, accumulating with a pid list and the previous command's stdout. The body of the fold is a connector that connects the current command's stdin to the previous command's stdout. The connector takes a command and a read-pipe and returns a pid and a read-pipe. The pipeline builder returns the list of pids in the pipeline and a read-pipe hooked to the end of the pipeline.

After executing the pipeline, we iterate over the pids and wait on them to get rid of the [defunct] processes that would otherwise be left hanging around. Sounds easy enough, let's get cracking!

I used OCaml this time, so there's some Unix vs. channels -junk there.

First some generic fork + dup2 utils from unix.ml:


open Prelude

(* adapted from unix.ml *)
let try_set_close_on_exec fd =
  try Unix.set_close_on_exec fd; true with Invalid_argument _ -> false

let open_proc cmd input output toclose =
  let cloexec = List.for_all try_set_close_on_exec toclose in
  match Unix.fork () with
     0 -> if input <> Unix.stdin then begin Unix.dup2 input Unix.stdin; Unix.close input end;
          if output <> Unix.stdout then begin Unix.dup2 output Unix.stdout; Unix.close output end;
          if not cloexec then List.iter Unix.close toclose;
          begin try Unix.execvp (head cmd) (Array.of_list cmd)
          with _ -> exit 127
          end
  | id -> id

Then a function to create a pipe segment:


(* Fork cmd, setting cmd's stdin to read_fd. Return the cmd's stdout. *)
let popenWithStdin ?(toclose=[]) read_fd cmd =
  let (in_read, in_write) = Unix.pipe () in
  let pid = open_proc cmd read_fd in_write (in_read::toclose) in
  Unix.close in_write;
  Unix.close read_fd;
  (pid, in_read)

And the pipeline builder and a withPipeline function for creating and running pipelines:


let rec buildPipeline ?toclose ifd =
  foldl (fun (pids, ifd) cmd ->
          let pid,ifd = popenWithStdin ?toclose ifd cmd in
          (pid::pids, ifd))
        ([], ifd)

let withPipeline cmds f =
  let ifd, ofd = Unix.pipe () in
  (* if we don't close ofd in every process, all will hang *)
  let pids, ifd = buildPipeline ~toclose:[ofd] ifd cmds in 
  let ic = Unix.in_channel_of_descr ifd
  and oc = Unix.out_channel_of_descr ofd in
  finally (fun _ ->  close_out_noerr oc;
                     close_in_noerr ic;
                     iter (fun pid -> ignore (Unix.waitpid [] pid)) pids)
          (f ic) oc

And a small main program to test it out:


let () =
  let pipeline = [
    ["tr"; "[:upper:]"; "[:lower:]"];
    ["grep"; "-o"; "ello"];
  ] in
  withPipeline pipeline
  (fun ic oc ->
    output_string oc "hElLo JeLlo";
    close_out oc;
    puts (readAll ic)
  )

That was quite educational to write, I spent hours debugging a bug that was caused by not closing the withPipeline ofd in the subprocesses. You really should close all unneeded fds before exec.

Speaking of which, Python's subprocess module is nice. It takes care of the hard parts. It even closes file descriptors. All we have to do is snap the processes together:


from subprocess import Popen, PIPE

p1 = Popen(["tr", "[:upper:]", "[:lower:]"], stdin=PIPE, stdout=PIPE, close_fds=True)
p2 = Popen(["grep", "-o", "ello"], stdin=p1.stdout, stdout=PIPE, close_fds=True)
p1.stdin.write("hElLo JeLlo")
p1.stdin.close()
output = p2.stdout.read()
p2.stdout.close()
print output

And that's it for today. The example programs got a bit too long for my liking, I just hope that they weren't too onerous to read. I thought of doing a Haskell example as well, but writing one pipe-passing popen fold is quite enough for one day, thank you very much. Structured I/O next (as in, reading something else than arrays of bytes.)

2008-10-03

I/O in programming languages: basic piping


import System.IO
import Char

main = do
  input <- hGetContents stdin
  let output = map (Char.toUpper) input
  hPutStr stdout output

Try running that and writing a couple lines of text to the program. It reads a character from stdin, uppercases it, and writes the uppercased character to stdout. Though, as stdin and stdout are line-buffered, it seems like it only does a line at a time.

The interesting part about the above Haskell program is that it runs in constant space, thanks to the lazy lists. With an eager list, you would need an explicit processing loop, like in this OCaml version:


let optE f x = try Some (f x) with _ -> None
let optEx ex f x = try Some (f x) with e when e = ex -> None
let optEOF = optEx End_of_file

let rec process f ic oc =
  match optEOF input_char ic with
    | None -> ()
    | Some c ->
      begin 
        match optE (fun c -> output_char oc c; flush oc) (f c) with
          | None -> ()
          | Some () -> process f ic oc
      end

let () = process Char.uppercase stdin stdout

Simple, no? Not to be outdone, Haskell has a function named interact that pipes stdin to stdout:


import Char
main = interact (map Char.toUpper)

I suppose the true winner here is Bash, as it is built around piping:


tr '[:lower:]' '[:upper:]'

Writing filename-way file -> file -maps using the above methods is quite trivial. Starting with Bash:


tr '[:lower:]' '[:upper:]' <src >dst

The Haskell version isn't quite as succint but it's no beast either:


import Char

main = mapFiles (map Char.toUpper) "src" "dst"

mapFiles f src dst = writeFile dst =<< mapFile f src
mapFile f filename = return . f =<< readFile filename

Using the above definition of process and withFiles from prelude.ml, the OCaml translation is straightforward:


let process_files f = withFiles (process f)

let () = process_files Char.uppercase "src" "dst"

We can apply the same strategy to a C version, first define a function to transform streams, then wrap it inside fopen-fclose pairs.


#include <stdio.h>

void process(char (*f)(char), FILE *src, FILE *dst)
{
  while(!feof(src))
    fputc(f(fgetc(src)), dst);
}

void process_files(char (*f)(char), const char *src, const char *dst)
{
  FILE *sf = fopen(src, "r");
    FILE *df = fopen(dst, "w");
      process(f, sf, df);
    fclose(df);
  fclose(sf);
}

char uppercase(char c) {
  char upper = 'A' - 'a';
  return (c >= 'a' && c <= 'z') ? (c + upper) : c;
}

int main(int argc, char *argv[])
{
  process_files(uppercase, "src", "dst");
  return 0;
}

So, piping characters, no big problem. How about something a bit more structured, like lines? Well, in Haskell it isn't a problem:


main = mapFiles (unlines . map reverse . lines) "src" "dst"

And Bash exports all lexing to external programs:


rev <src >dst

Then to the problematic cases. The process that we defined above operates on a per-character basis. What we want is something that operates on a per-line basis. We could write a process_lines that uses input_line instead of input_char, but that's a bit repetitive.

Taking a step back, we could also add in a tokenizer that reads a value from the input channel and calls the processing function with it, writing the output to the output channel. Something like this:


open Prelude

let rec untilEOF f x =
  match optEOF f x with
    | None -> ()
    | Some () -> untilEOF f x

let tokenized_pipe read write f ic oc =
  untilEOF (fun ic -> write oc (f (read ic))) ic

let output_line oc s = output_string oc s; output_char oc '\n'

let process_lines = tokenized_pipe input_line output_line

let () = withFiles (process_lines srev) "src" "dst"

Or extend the Stream module to remap a character stream into a line stream:


open Prelude

type 'a accum = Acc of 'a | Done of 'a

let optMap f x = match x with None -> None | Some y -> Some (f y)

let string_of_char c = String.make 1 c
let (^.) s c = s ^ string_of_char c

module Stream =
struct
  include Stream

  let nextOpt st =
    match peek st with
      | None -> None
      | x -> (junk st; x)

  let rec foldlBreak f init st =
    match nextOpt st with
      | None -> init
      | Some x ->
        begin match f init x with
          | Acc a -> foldlBreak f a st
          | Done b -> b
        end

  let rec foldl f init st = foldlBreak (fun s i -> Acc (f s i)) init st

  let map f st = from (fun _ -> optMap f (nextOpt st))

  let remap f init st =
    from (fun _ ->
      match peek st with
        | None -> None
        | Some x -> Some (foldlBreak f init st))

  let join s st = from (fun i -> if i mod 2 = 0 then nextOpt st else Some s)

end

let lines =
  Stream.remap (fun s c -> if c = '\n' then Done s else Acc (s ^. c)) ""

let () =
  withFiles (fun ic oc ->
    let s = Stream.of_channel ic in
    let rev_lines = Stream.map srev (lines s) in
    Stream.iter (output_string oc) (Stream.join "\n" rev_lines)
  ) "src" "dst"

Here's a Haskell version of remap:


data Accum a = Acc a | Done (a, a)

remap t acc [] = [acc]
remap t acc (x:xs) =
  case t acc x of
    Acc a -> remap t a xs
    Done (d, a) -> (d : remap t a xs)

line_lexer acc c =
  case c of
    '\n' -> Done (acc, "")
    otherwise -> Acc (acc ++ [c])

my_lines s = remap line_lexer "" s

main = do
  c <- readFile "src"
  writeFile "dst" (concatMap (\l -> reverse l ++ "\n") $ my_lines c)

And a less crack-rock version of my_lines:


my_lines "" = []
my_lines s = 
  (line : my_lines (chomp '\n' rest))
  where
    (line, rest) = break ('\n' ==) s

chomp v (x:xs) | v == x = xs
chomp _ xs = xs

...sorry, got a bit derailed there. You can actually do a quite OCaml-like tokenized process in C as well, though it is a bit more involved due to manual memory management and "arrays are pointers to memory."


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* first some utils, feel free to skip ahead */

typedef struct {
  size_t length;
  char *ptr;
} string;

string create_string(size_t length) {
  string s;
  s.length = length;
  s.ptr = (char*)malloc(length);
  if (s.ptr == NULL) exit(1);
  return s;
}

string reverse_string(string s)
{
  int i, nidx, hl = s.length / 2;
  char tmp;
  for (i=0; i < hl; i++) {
    tmp = s.ptr[i];
    nidx = s.length-1-i;
    s.ptr[i] = s.ptr[nidx];
    s.ptr[nidx] = tmp;
  }
  return s;
}

void destroy_string(string s) {
  s.length = 0;
  if (s.ptr != NULL) free(s.ptr);
  s.ptr = NULL;
}


/* the gist of the code */

void tokenize_pipe
(
  string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string),
  FILE *src, FILE *dst
)
{
  while (!feof(src))
    if (!write(f(read(src)), dst)) break;
}

void tokenize_pipe_files
(
  string (*read)(FILE*), int (*write)(string, FILE*), string (*f)(string), 
  const char *src, const char *dst
)
{
  FILE *sf = fopen(src, "r");
    FILE *df = fopen(dst, "w");
      tokenize_pipe(read, write, f, sf, df);
    fclose(df);
  fclose(sf);
}

/* reads a string from f */
string alloc_and_read(FILE *f) 
{
  string s = create_string(256);
  if (fgets(s.ptr, s.length, f) == NULL) 
    s.length = 0;
  else
    s.length = strlen(s.ptr);
  return s;
}

/* writes s to f and destroys s */
int write_and_free(string s, FILE *f)
{
  fwrite(s.ptr, 1, s.length, f);
  destroy_string(s);
  return 1;
}

int main(int argc, char *argv[])
{
  tokenize_pipe_files(alloc_and_read, write_and_free, reverse_string, "src", "dst");
  return 0;
}

I think I'll leave it there for basic piping. The next subjects will probably be: talking with external processes, structured I/O and mmap, event loop IO and sockets.

Drop examples in the comments if you want to add better ways to do things / more languages here. Pointers to interesting ways to do the above things are also most welcome.

Bis dann, have fun!

2008-10-01

I/O in programming languages: writing

This post is a part of a series where your intrepid host looks at I/O in different programming languages in search of understanding and interesting abstractions.

Part 1: open and read
Part 2: writing -- you are here
Part 3: basic piping
Part 4: piping to processes
Part 5: structs and serialization

Continuing from the previous post -- where we looked at opening, closing and reading files in languages ranging from GNU Assembly to Haskell -- today we'll do some writing. The languages of the day are ASM, Bash, C, Clean, Factor, Haskell, OCaml, Ruby and SML.

But before that, a small Perl example for line-wise reading, courtesy of Philip Taylor:


open $fd, "my_file";
while (<$fd>) {
  chomp;
  print scalar reverse;
  print $/;
}

It's quite similar to the Pythonic for line in fd:, the <$fd> reading the next line from $fd. Though, in a Perlish twist, the loop uses the $_ implicit argument to call the string munging functions, which makes the code look a bit like a stack language. See this Factor version for example:


USING: io io.encodings.utf8 io.files sequences ;

"my_file" utf8
file-lines [
  reverse
  print
] each

Both the Perl loop and the Factor loop use an implicit variable ($_ in Perl, the stack in Factor) to determine what value a procedure call should take as its argument.

But back to our subject for the day. The write syscall takes a file descriptor, a buffer and the buffer length, as demonstrated by this ASM version of "Hello, world!":


.equ EXIT, 1
.equ WRITE, 4
.equ STDOUT, 1

.section .data
    hello:
        .ascii "Hello, world!\n"
    .equ hello_len, 14

.section .text
.globl _start
_start:
    movq     $WRITE, %rax
    movq    $STDOUT, %rbx
    movq     $hello, %rcx
    movq $hello_len, %rdx
    int $0x80 

    movq $EXIT, %rax
    movq    $0, %rbx
    int $0x80

The C version is more convenient to write:


#include <stdio.h>
#include <unistd.h>

int main (int argc, char *argv[])
{
  char hello[] = "Hello, world!\n";
  write(STDOUT_FILENO, hello, strlen(hello));
  return 0;
}

Moving up the abstraction ladder, OCaml does away with strlen and return 0:


let () = output_string stdout "Hello, world!\n"

The SML version is very similar, but uses a tuple instead of currying:


val () = TextIO.output (TextIO.stdOut, "Hello, world!\n")

Haskell uses PutStr where the ML derivatives use output. What's wrong with "write", anyhow?


import System.IO
main = hPutStr stdout "Hello, world!\n"

Ruby has no equivalent of main, top-level expressions are executed in the order they are found:


STDOUT.write "Hello, world\n"

Bash uses > to pipe output to a file, let's use that in this homespun version of echo (that uses echo for good measure...):


echo -n -e 'Hello, world!\n' > /dev/stdout

Clean uses uniqueness types to preserve referential transparency. In the following program, we get the standard IO pipes from the world and then have to use the return value of IO actions to do the next IO action:


module hello
import StdEnv

Start :: *World -> *World
Start world
  # (console, world) = stdio world
  # console1         = fwrites "Hello, world!\n" console
  # (ok,world)       = fclose console1 world
  | not ok           = abort "Cannot close console"
  | otherwise        = world

If we bungle the IO passing, the result is a broken program. I'll split the write into two parts to demonstrate. First a working version:


  # console1          = fwrites "Hello,"    console
  # console2          = fwrites " world!\n" console1
  # (ok,world)        = fclose              console2 world

If, instead of console1, I continue to use the original console, I end up with a screwed up program. Behold: the following snippet prints out only " world!\n" (it doesn't cause a compilation error though):


  # console1         = fwrites "Hello,"    console
  # console2         = fwrites " world!\n" console
  # (ok,world)       = fclose              console2 world

Moving on, the most common filename-level write operations are writing a string into the named file and appending a string to the named file. You sometimes need to do prepend as well, but that tends to make library writers squirm as most file systems only do fast truncates and appends.

Prelude.ml has utilities, as you might expect:


open Prelude
let () = 
  let fn = "my_file" in
  writeFile fn "hello there";
  appendFile fn "!";
  prependFile fn "Why, ";
  puts (readFile fn)

Ruby leverages the open mode flag, and I'm doing a simple memory-hungry prepend:


fn = "my_file"
File.open(fn, "w"){|f| f.write "hello there" }
File.open(fn, "a"){|f| f.write "!" }
File.open(fn, "r+"){|f|
  d = f.read
  f.truncate 0
  f.write "Why, "
  f.write d
}
puts File.read(fn)

A C version that uses fopen mode strings is pretty much the same as the Ruby version, except a lot more verbose (I don't even have error handling here):


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>

int write_file(const char *fn, const char *buf, size_t len)
{
  FILE *fd = fopen(fn, "w");
  int rv = fwrite(buf, 1, len, fd);
  fclose(fd);
  return rv;
}

int append_file(const char *fn, const char *buf, size_t len)
{
  FILE *fd = fopen(fn, "a");
  int rv = fwrite(buf, 1, len, fd);
  fclose(fd);
  return rv;
}

int read_file(const char *fn, char **buf, size_t *len)
{
  struct stat st;
  FILE *fd = fopen(fn, "r");
  fstat(fileno(fd), &st);
  *buf = (char*)malloc(st.st_size);
  int rv = fread(*buf, 1, st.st_size, fd);
  fclose(fd);
  *len = st.st_size;
  return rv;
}

int prepend_file(const char *fn, const char *buf, size_t len)
{
  char* tmp;
  size_t tlen;
  read_file(fn, &tmp, &tlen);
  write_file(fn, buf, len);
  int rv = append_file(fn, tmp, tlen);
  free(tmp);
  return rv;
}

int main ( int argc, char *argv[] )
{
  char fn[] = "my_file";
  char a[] = "hello there",
       b[] = "!",
       c[] = "Why, ";
  char *buf;
  size_t len;

  write_file(fn, a, strlen(a));
  append_file(fn, b, strlen(b));
  prepend_file(fn, c, strlen(c));
  read_file(fn, &buf, &len);

  fwrite(buf, 1, len, stdout);
  fwrite("\n", 1, 1, stdout);

  free(buf);
  return 0;
}

In Bash, writing and appending are easy, prepending requires a temp file:


echo -n 'hello there' > my_file;
echo -n '!' >> my_file;

echo -n 'Why, ' | cat - my_file > tmp &&
mv tmp my_file;

cat my_file && echo;

Haskell has writeFile and appendFile, but prependFile causes some extra work due to the lazy readFile; the naive buf <- readFile fn; writeFile fn (s ++ buf) fails with "openFile: resource busy (file is locked)":


import System.IO

main = do
  let fn = "my_file"
  writeFile fn "hello there"
  appendFile fn "!"
  prependFile fn "Why, "
  putStrLn =<< readFile fn

prependFile fn s = do
  buf <- readFileStrict fn
  writeFile fn (s ++ buf)

readFileStrict fn = do
  h <- openFile fn ReadMode
  buf <- hGetContents h
  let !b = buf
  hClose h
  return b

SML doesn't have the convenience functions, so we need to implement them:


fun bracket v finally f =
  let
    val rv = f v
    val () = finally v
  in
    rv
  end handle x => let
    val () = finally v handle _ => ()
  in
    raise x
  end

fun withTextIn file f = bracket (TextIO.openIn file) TextIO.closeIn f
fun withTextOut file f = bracket (TextIO.openOut file) TextIO.closeOut f
fun withTextAppend file f = bracket (TextIO.openAppend file) TextIO.closeOut f

fun readFile file = withTextIn file TextIO.inputAll
fun writeFile file s = withTextOut file (fn f => TextIO.output (f, s))
fun appendFile file s = withTextAppend file (fn f => TextIO.output (f, s))
fun prependFile file s =
  let
    val buf = readFile file
    val () = writeFile file s
    val () = appendFile file buf
  in
    ()
  end

val () =
  let
    val file = "my_file"
    val () = writeFile file "hello there"
    val () = appendFile file "!"
    val () = prependFile file "Why, "
  in
    TextIO.print (readFile file ^ "\n")
  end

Which is pretty much how prelude.ml works as well, except that, in prependFile, if the file is larger than 32 megabytes, prelude.ml uses the tempfile strategy (like the Bash version.)

Factor has set-file-contents for writing a file, for appending we need to use with-file-appender (compare with the withTextAppend above.) By the way, Factor's REPL workspace is really nice, it has incremental search for words and links them to a documentation browser. It also shows the current data stack, has separate panes for input and output and a built-in profiler. Here's the Factor version:


USING: io io.files io.encodings.utf8 ;

"hello there" "my_file" utf8 set-file-contents

"my_file" utf8 [ "!" write ] with-file-appender

"my_file" utf8 file-contents
"Why, " "my_file" utf8 set-file-contents
"my_file" utf8 [ write ] with-file-appender

"my_file" utf8 file-contents "\n" append write

I tried writing a Clean versions of readFile, writeFile, appendFile and prependFile, but my program segfaults when I run it. C'est la vie~

Here's the segfaulting version anyhow, maybe lazyweb knows what the problem with it is:


module why

import StdEnv

DoFile f mode filename files
  # (ok,file1,files1) = fopen filename mode files
  | not ok = abort ("Failed to open '" +++ filename +++ "'")
  # (res,file2) = f file1
    (closeok,files2) = fclose file2 files1
  | not closeok = abort ("Failed to close '" +++ filename +++ "'")
  | otherwise = (res,files2)

WriteFile_ mode filename str files =
  snd (DoFile (\f = (False, fwrites str f)) mode filename files)
WriteFile = WriteFile_ FWriteText
AppendFile = WriteFile_ FAppendText

flength f
  # (ok, f1) = fseek f 0 FSeekEnd
  | not ok = abort "seek failed"
  # (pos, f2) = fposition f1
  # (ok, f3) = fseek f2 0 FSeekSet
  | not ok = abort "seek failed"
  | otherwise = (pos, f3)

ReadAll f
  # (len, f1) = flength f
  = freads f1 len

ReadFile fn files = DoFile ReadAll FReadText fn files

PrependFile fn str files
  # (old, files1) = ReadFile fn files
  # files2 = WriteFile fn (str +++ old) files1
  = files2

Start world
  # (console,world1) = stdio world
  # world2 = WriteFile fn "hello there" world1
  # world3 = AppendFile fn "!" world2
  # world4 = PrependFile fn "Why, " world3
  # (str,world5) = ReadFile fn world4
  # console1 = fwrites (str +++ "\n") console
  # (ok,world6) = fclose console1 world5
  | not ok = abort "Cannot close console."
  | otherwise = world6
where
  fn = "my_file"

And that's it for simple writing. There wasn't all that much variation, the Bash version being perhaps the most novel. C and SML used imperative writes [OCaml stdlib too], Ruby and prelude.ml wrapped them in higher-order functions, Clean threaded state to preserve purity, and Haskell used monads for the same.

Piping in the next installment. Should give stream-oriented IO and lazy lists a workout.

fhtr