art with code

2008-09-30

I/O in programming languages: open and read

This post is a part of a series where your intrepid host looks at I/O in different programming languages in search of understanding and interesting abstractions.

Part 1: open and read -- you are here
Part 2: writing
Part 3: basic piping
Part 4: piping to processes
Part 5: structs and serialization

Different programming languages have different takes on handling I/O. The simplest way is to wrap kernel syscalls, like this open() call (in assembler):

# define some globals
.equ SYS_OPEN, 5
.equ LINUX_SYSCALL, 0x80
.section .data
filename:
.ascii "my_file\0"

# and a main that calls SYS_OPEN
.globl _start
_start:
movq $SYS_OPEN, %rax
movq $filename, %rbx
movq $0, %rcx # open as read-only
movq $0666, %rdx # permissions for the file
int $LINUX_SYSCALL


C has syntax sugar for doing function calls:

#include <stdio.h>

int main(int argc, char *argv[])
{
int fd;
fd = open("my_file", O_RDONLY);
}


In Python, yet more boilerplate is eliminated:

fd = open("my_file")


To read from a file, you need a file descriptor and a buffer to read into. The assembler goes:

.equ STDIN, 0
.equ SYS_READ, 3
.equ BUFSIZE, 4096
.section .bss
.lcomm buffer, BUFSIZE
.globl _start
_start:
movq $SYS_READ, %rax
movq $STDIN, %rbx # file descriptor
movq $buffer, %rcx # buffer
movq $BUFSIZE, %rdx # max bytes to read
int $0x80


In C, we call read():

const int bufsize = 4096;

int main(int argc, char *argv[])
{
char buffer[bufsize];
read(STDIN_FILENO, buffer, bufsize);
}


In Python, you read by calling the read-method of the file object. The buffer is allocated automatically:

fd = open("my_file")
buffer = fd.read(4096)


In Ruby, the procedure is much the same, except that Ruby can do the file open as a higher-order function, closing the file when the continuation exits:

buffer = File.open("my_file"){|f| f.read(4096) }


Liam Clarke notified me that Python 2.6 has a with-keyword for doing the equivalent:

with open("my_file") as f:
buffer = f.read(4096)

You can use with in Python 2.5 by doing from __future__ import with_statement.

The function passing technique for dealing with files probably comes from Common Lisp's WITH-OPEN-FILE.

In ASM and C, you need to close files explicitly. If you use the with-open-file-style in Python and Ruby, it takes care of closing the file. Both Python and Ruby close file descriptors on GC as well (along with Perl and probably most other garbage collected languages), so there's no real need to close a file if you know that you won't run into problems. Possible problems here being two: 1) running into the process open file limit, and 2) unflushed writes.

Closing a file in ASM:

.equ SYS_CLOSE, 6
movq $SYS_CLOSE, %rax
movq $my_fd, %rbx
int $0x80


The C version is shorter:

close(fd);


Python is object-oriented here:

fd.close()


As is Ruby:

fd.close


OCaml's standard library can't make up its mind on how to deal with files, it has IO channels in Pervasives and C-style file descriptors in the Unix library. The Unix way to read from a file is:

let buffer_size = 4096

let read_data =
let fd = Unix.openfile "my_file" [Unix.O_RDONLY] 0o666 in
let buffer = String.create buffer_size in
let read_count = Unix.read fd buffer 0 4096 in
Unix.close fd;
String.sub buffer 0 read_count

Or with the quirky input channels:

let data =
let rec my_read ic buf count total =
if count = total then String.sub buf 0 count
else
let rb = input ic buf count (total-count) in
match rb with
| 0 -> String.sub buf 0 count
| n -> my_read ic buf (n+count) total in
let ic = open_in "my_file" in
let buffer = String.create buffer_size in
let data = my_read ic buffer 0 buffer_size in
close_in ic;
data

Prelude.ml uses a WITH-OPEN-FILE variant:

let data = withFile "my_file" (read 4096)


SML's IMPERATIVE_IO interface makes things straightforward:

val data =
let
val input = BinIO.openIn "my_file"
val data = BinIO.inputN (input, 4096)
val () = BinIO.closeIn input
in
data
end

By replacing BinIO by TextIO, you can read in Strings and Chars instead of Word8Vectors and Word8s. You could also write a with-open-file without much trouble, see Part 2 for an example.

SML also has a STREAM_IO interface for doing functional reading from input streams, sort of like a lazy list. Conceptually it lies somewhere inside the IMPERATIVE_IO - Clean - Haskell -triangle. I'll try to add an example to Part 3 when I manage.

Haskell actually does treat file contents as lazy lists, and wraps them inside the IO monad to isolate file access from side-effect-free code:

import System.IO

main = do
fd <- openFile "my_file" ReadMode
contents <- hGetContents fd
let buffer = take 4096 contents


Manually closing files in Haskell is error-prone because hClose is strict and reads are lazy, so it's possible to close a file before a preceding read is evaluated. For example, the following code prints out nothing:

import System.IO

main = do
fd <- openFile "my_file" ReadMode
contents <- hGetContents fd
let buffer = take 4096 contents -- read up to 4096 bytes into buffer when buffer is first used
hClose fd -- close fd
putStrLn buffer -- oops

The buffer is empty because fd is closed before buffer is needed by putStrLn, hence nothing has been yet read into buffer, resulting in putStrLn trying to read a closed file. We can fix this either by strictness annotations, a WITH-OPEN-FILE or by using readFile.

Strictness annotations:

let !buffer = take 4096 contents -- forces eager evaluation of buffer

WITH-OPEN-FILE:

import Control.Exception
main =
bracket (openFile "my_file" ReadMode) hClose
(\fd -> do contents <- hGetContents fd
let buffer = take 4096 contents
putStrLn buffer)

readFile:

main = do
contents <- readFile "my_file"
let buffer = take 4096 contents
putStrLn buffer


Using readFile as a segue, let's see how to read the whole file into a memory buffer in the other languages. A simple C version would stat the file to get its size, allocate a buffer and call read a single time. Let's use the FILE functions this time for the heck of it:

int main(int argc, char *argv[])
{
stat st;
off_t size;
char *contents;
FILE *fd;
size_t read_sz;

if (0 != stat("my_file", &st))
exit(1);

contents = (char*)malloc(st.st_size);
if (NULL == contents) exit(3);

fd = fopen("my_file", "r");
if (NULL == fd) exit(2);

/* read st.st_size 1-byte elements into buf from fd */
read_sz = fread(contents, 1, st.st_size, fd);
fclose(fd);

if (read_sz != st.st_size)
exit(4);

/* do stuff */
}


Python gets rid of a good deal of the code:

fd = open("my_file")
contents = fd.read()
close(fd)


Ruby has a convenience function for this:

contents = File.read("my_file")


With the OCaml Unix library, it's close to the C version:

let () =
let sz = (Unix.stat "my_file").Unix.st_size in
let contents = String.create sz in
let fd = Unix.openfile "my_file" [O_RDONLY] 0o666 in
let rb = Unix.read fd contents 0 sz in
if rb <> sz then failwith "read wrong amount of bytes";
Unix.close fd
(* do stuff *)


The OCaml input channels version is like the Unix version but reads into a Buffer and calls Buffer.contents on End_of_file. In principle, the Unix version and the C version should do it this way as well, the C version calling feof to see if it is at the end of the file.

let buf_sz = 256

let () =
let rec my_read_all ic buf res =
let br = input ic buf 0 buf_sz in
if br = 0 then Buffer.contents res
else
let () = Buffer.add_substring res buf 0 br in
my_read_all ic buf res in
let ic = open_in "my_file" in
let buf = String.create buf_sz in
let res = Buffer.create (in_channel_length ic) in
let contents = my_read_all ic buf res in
(* do stuff *)


Prelude.ml has a convenience function:

let contents = readFile "my_file"


Which is an eagerly evaluated version of Haskell's readFile:

contents <- readFile "my_file"


SML has a relatively sane IO library, so there are things like inputAll:

val contents = let
val input = TextIO.openIn "my_file"
val data = TextIO.inputAll input
val () = TextIO.closeIn input
in data end


The ASM version works much like the OCaml input channel version; reading into a buffer, calling brk to increase the data segment size, and copying the buffer to the newly allocated space.

.equ EXIT, 1
.equ READ, 3
.equ WRITE, 4
.equ OPEN, 5
.equ BRK, 45

.equ STDOUT, 1

.equ BUF_SZ, 4096

.section .data
filename:
.ascii "my_file\0"

.section .bss
.lcomm buffer, BUF_SZ

.section .text
.globl _start
_start:
movq $buffer + BUF_SZ, %r12 # save bss end address to r12
movq %r12, %r14 # and a copy to r14

movq $OPEN, %rax
movq $filename, %rbx
movq $0, %rcx
movq $0666, %rdx
int $0x80

movq %rax, %r13 # save fd to r13

read_loop:
movq $READ, %rax
movq %r13, %rbx
movq $buffer, %rcx
movq $BUF_SZ, %rdx
int $0x80

cmpq $0, %rax # end of file
je read_loop_end

movq %r12, %rbp # copy old end to rbp
addq %rax, %r12 # add read sz to r12

movq $BRK, %rax # allocate to new end
movq %r12, %rbx
int $0x80

cmpq %rax, %r12 # if rax != r12, alloc failed
jne error_exit

# copy buf to newly allocated space
# rcx has the buffer address
# rbp has the target address
copy_loop:
cmpq %r12, %rbp
je end_copy
movb (%rcx), %al # load byte from rcx to al
movb %al, (%rbp) # store byte from al to rbp
incq %rcx
incq %rbp
jmp copy_loop

end_copy:

jmp read_loop

read_loop_end: # ok exit
movq $WRITE, %rax # print the file to stdout
movq $STDOUT, %rbx
movq %r14, %rcx
movq %r12, %rdx
subq %r14, %rdx # rdx = r12 - r14 = length of the read data
int $0x80

movq $EXIT, %rax
movq $0, %rbx
int $0x80

error_exit: # error exit
movq $EXIT, %rax
movq $1, %rbx
int $0x80


All of the high-level languages (read: not ASM) we've looked at thus far have convenience functions for processing files a line at a time. Python uses a syntax-level iterator:

f = open("my_file")
for line in f:
print line[-2::-1] # reverses string, strips out linefeed
f.close()

Ruby uses a higher-order function:

File.open("my_file"){|f|
f.each_line{|line|
puts line.chomp.reverse
}
}

Haskell doesn't need to use an iterator to save memory since it does lazy IO:

import Char
main =
mapM_ putStrLn . map reverse . lines =<< readFile "my_file"

C doesn't have a string reverse function, so I'm using a reverse print function:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void rprintln(const char *buf, int len)
{
int i;
for(i=len-1; i>=0; i--)
putchar(buf[i]);
putchar('\n');
}
const int buf_size = 4096;
int main(int argc, char *argv[])
{
FILE *fd;
char buf[buf_size];
int read_sz;

fd = fopen("my_file", "r");
if (NULL == fd) exit(1);

while (!feof(fd)) {
if (NULL == fgets(buf, buf_size, fd)) break;
rprintln(buf, strlen(buf)-1);
}
fclose(fd);
}

With input channels, OCaml has input_line (but sadly no string reverse):

let reverse_string s =
let len = String.length s - 1 in
let d = String.create (len + 1) in
for i=0 to len do
String.unsafe_set d (len-i) (String.unsafe_get s i)
done;
d

let () =
let rec iter_lines ic f =
let res = try Some (f (input_line ic))
with End_of_file -> None in
if res = None then ()
else iter_lines ic f in
let ic = open_in "my_file" in
iter_lines ic (fun line -> print_endline (reverse_string line));
close_in ic

The Unix library version is much the same, using in_channel_of_descr to convert the Unix.file_descr into an in_channel:

...
let fd = Unix.openfile "my_file" [Unix.O_RDONLY] 0o666 in
let ic = in_channel_of_descr fd in
iter_lines ic (fun line -> print_endline (reverse_string line));
Unix.close fd

Prelude.ml has convenience functions:

eachLine (puts @. srev) "my_file"


SML has TextIO.inputLine, which we loop 'til the end:

fun iter_lines f ic =
case TextIO.inputLine ic of
NONE => ()
| SOME l => (f l; iter_lines f ic)

fun srev s = String.implode (List.rev (String.explode s))

val () = let
val input = TextIO.openIn "my_file"
val () = iter_lines (fn l => TextIO.print (srev l)) input
val () = TextIO.closeIn input
in () end


Not really feeling like writing the ASM version, but it'd read into a reverse buffer a byte at a time, exiting on buffer end or newline, return the amount of bytes read, and write the buffer to stdout. Easy to describe, 50 lines to write :P

I am thinking of doing a more diverse comparison, looking at different ways to do piping, sockets and writing files. I'm contemplating dropping Python (too much like C and Ruby), ASM (too verbose, I'm no expert) and OCaml stdlib (sane people use a stdlib replacement) from the languages covered, while adding Clean (uniqueness types), SML (streams) and Bash (getting things done.) Suggestions are welcome!

[edit #1] Added Python's with-block, thanks to Liam Clarke
[edit #2] Fixed the input-using OCaml functions, thanks to mfp.
[edit #3] Added SML examples per request.

2008-09-29

Basics of I/O

Files and pipes


The kernel wraps your hardware devices into two basic kinds: pipes and files. A pipe is a lazy FIFO list, a file is a lazy array (well, more like a rope; a list of arrays.) A file can be split into two arrays, a read-only array, and a write-only array. Pipes are either read-only or write-only. A socket consists of a read-only pipe and a write-only pipe.

Things like input devices and network devices are modeled as pipes, while things like RAM and hard drives are modeled as files. Graphics hardware can be either thought of as a pipe or a file, depending on whether it's treated as a computing node that executes drawing instructions or as an editable frame buffer.

The decision on whether to represent a device as a pipe or a file lies in the access patterns possible for the device. If you can access the nth byte of the device without accessing all the preceding bytes, it's a file. If you need to access the bytes from 0 to n-1 in order to access the nth byte, it's a pipe.

If you only do sequential access, you can treat files as pipes. If you want to do random access, you need to treat files as arrays. The easiest way to achieve lazy array semantics for a file is to mmap it.

Data structures


Every file and pipe is a serialized data structure. To write a file, you need to convert your data structure into an array of bytes. To read a file, you need to convert an array of bytes into the corresponding data structure. With pipes, you have a FIFO list of bytes instead of an array.

Most file and especially pipe data structures are lists of structs with a possible header. For example, /etc/passwd is a headerless newline-separated list of structs. MP3 files are lists of MP3 frames, with an optional ID3 metadata header. PNG files have a PNG header, followed by a list of PNG chunks. JPEG files are lists of JPEG markers. UTF-encoded text files are lists of UTF characters with an optional BOM header.

Some file formats encode tree data structures. HTML, XML and JSON are examples of unindexed trees. XML has a header (the "<?xml..."-bit), HTML has an optional one ("<!DOCTYPE..."), and JSON doesn't. SQLite files are indexed trees, the index enabling random access to the tree.

An array-like (i.e. every struct is the same size) data structure also makes it possible to do random access to a file. Examples of array-like file formats are ASCII encoded text, UTF-16 encoded text, PNM images and uncompressed TGAs. Array-like files also make it very easy to do in-place editing.

Serializing data structures


As every data structure is manifest in memory, it is possible to turn a data structure into a continuous array of bytes with no external references. This conversion is called serialization.

To prove that it is possible for any data structure, consider a program with a data structure in memory. Now reallocate the memory used by the program so that the parts not associated with the data structure are in a block of memory separate from the data structure. Then move all data structure blocks after each other and substract the address of the start of the data structure from all the pointers in the data structure, making them relative pointers. You now have a serialized version of the data structure.

Converting the serialized data structure into one that the program can work with is called deserialization. For the above serialization scheme, deserialization is easy: load the serialized data into memory and add the address of the start of the data to every pointer in the data structure.

To make the serialized data structure more portable, you can write a header that describes the type signature of the data structure, e.g. "array of uint32_le". To make the data structure detect errors, you can add metadata for detecting errors: the length of the data and CRC32 or a cryptographic hash (e.g. MD5.)

Using a compression format that already does error checking is likely the best way: not only do you save yourself a headache, compression also reduces the amount of IO required. You can do around 50 cycles of "free" computation per every byte streamed from disk. For a random access at 10ms, the number of cycles is around 50 million. If the compressor takes less than 50 cycles per byte, it is possible to write a compressed file faster than an uncompressed file, and similarly for decompression and reading.

On file access costs


As you need at least one random access to read a file, any file smaller than 1MB takes more time for the seek than the read. Flash drives have 0.1ms seek times, which makes the random access cost "only" a half million cycles, making the 50:50 point lie at around 10kB.

In other words, with a disk, if your file is smaller than 1MB, it's cheaper to read the whole file than to do seeking reads. With flash, reading the whole file is cheaper only when the file is smaller than 10kB.

2008-09-26

Building an OCaml array library from basic operations

In writing Prelude.ml utility functions for array-like data structures, I noticed that they can be built from a very small set of base operations. The minimum you need is empty, fold and unfold, but those work best with lists and trees. For efficient arrays you want alloc, length, get and set. We can write either way in terms of the other, interestingly enough.

Empty, foldl and unfoldl in terms of alloc, length, get and set:

let empty = alloc 0

(* recurse with get, a counter and an accumulator *)
let foldl f i a =
let rec aux f acc idx len a =
if idx = len then acc
else aux f (f acc (get a idx)) (succ idx) len a in
aux f i 0 (length a) a

let push a v =
let b = alloc (length a + 1) in
for i = 0 to (length a - 1) do set b i (get a i) done;
set b (length a) v

(* recurse with push and an accumulator *)
let unfoldl f i =
let rec aux f i res =
match f i with
| None -> res
| Some (v, acc) -> aux f acc (push res v) in
aux f i empty


Alloc, length, get and set in terms of empty, foldl and unfoldl:

let zero = 0

(* unfold with a counter and set to zero *)
let alloc l =
unfoldl (fun i ->
if i = l then None
else Some (zero, succ i)
) 0

(* fold with a counter *)
let length a = foldl (fun l _ -> l+1) 0 a

(* fold with a counter and pick the value when counter equals wanted index *)
let get a i =
let l,v =
foldl (fun (j,res) v ->
if i = j then (j, Some v) else (succ j, None)
)
(0, None) a in
v

(* unfold with a counter and set the value when counter equals wanted index *)
let set a i v =
let l = length a in
unfoldl (fun j ->
if j = l then None
else Some (
if i = j then v else get a j,
succ j
)
) 0


As you may notice, unfold looks inefficient for arrays and set and get look inefficient for lists. Not to mention that the above implementation of get returns an option, while the fold and push expect a flat value. The type of the array elements is also set at library writing time with zero. As the above code is only for the purpose of demonstrating the interchangeability of the signatures, may the kind reader forgive these omissions (presenting better examples would be very welcome too.)

Let's take a look at finding values. Finds and gets can be abstracted into a fold with an early exit. A find is a fold that exits when f v equals true. A get is a fold that exits when counter equals wanted index.


(* findWithIndex using get *)
let findWithIndex f a =
let rec aux f i len a =
if i = len then None
else
let v = get a i in
if f v i then Some (v,i)
else aux f (succ i) len a in
aux f 0 (length a) a

let maybe d f v = match v with Some x -> f x | None -> d
let optMap f v = match v with Some x -> Some (f x) | None -> None

let find f a = optMap fst (findWithIndex (fun v _ -> f v) a)
let findIndex f a = optMap snd (findWithIndex (fun v _ -> f v) a)

(* findWithIndex using fold and an exception break *)

exception Found of 'a
let findWithIndex f a =
try ignore (foldl (fun i v -> if f v i then raise (Found (v, i)) else succ i) 0 a);
None
with Found x -> Some x

let get a i = optMap fst (findWithIndex (fun _ j -> i = j) a)


Many element query functions are variants of find, here we have any, all, elem and indexOf.

(* any is a find that returns true on match, false on not found *)
let any f a = find f a <> None

(* all is a variant of any: if nothing matches the negation of f, return true *)
let all f a = not (any (fun v -> not (f v)) a)

(* elem is a special case of any *)
let elem e a = any ((=) e) a

(* indexOf is a special case of findIndex *)
let indexOf e a = findIndex ((=) e) a


To do filtering, we fold with push (or collect to a list and do of_list at the end, depending on the cost of the operations.)

let filter f a = foldl (fun b v -> if f v then push b v else b) empty a

let filter f a = of_list (List.rev (foldl (fun b v -> f v then v::b else b) [] a))


Implementing map and iter in terms of fold is trivial:

let map f a = foldl (fun b i -> push b (f i)) empty a
let iter f a = foldl (fun () i -> f i) () a


To implement map in imperative terms, we should write an init function:

let init f l =
let rec aux f idx len a =
if idx = len then a
else (set a idx (f idx); aux f (succ idx) len a) in
aux f 0 l (alloc l)

let map f a = init (fun i -> f (get a i)) (length a)

let mapWithIndex f a = init (fun i -> f (get a i) i) (length a)
(* If you trust your compiler, let map f a = mapWithIndex (fun v _ -> f v) a *)

let foldlWithIndex f i a =
let rec aux f acc idx len a =
if idx = len then acc
else aux f (f acc (get a idx) idx) (succ idx) len a in
aux f i 0 (length a) a

let iterWithIndex f a = foldlWithIndex (fun () v i -> f v i) () a
(* (fun () v i -> f v i) = (fun () -> f) = (const f), where const x y = x *)


Reverse, copy and append are easy as well (I'm going to do only array-like versions from here on):

let reverse a =
let l = length a in
init (fun i -> get a (l-1-i)) l

let copy a = init (get a) (length a)

let append a b =
let la, lb = length a, length b in
let c = alloc (la + lb)
iterWithIndex (fun v i -> set c i v) a;
iterWithIndex (fun v i -> set c (i+la) v) b;
c


It is useful to have two indexing modes: left-indexing and right-indexing. The usual way is to have 0 be the first index from left and -1 be the first index from right. Easy enough:

let normalizeIndex a i = if i >= 0 then i else (length a + 1)


Now we can write sub and slice:

let sub i len a =
let i = normalizeIndex a i in
init (fun j -> get a (j+i)) len

let slice i j a =
let i = normalizeIndex a i in
let j = normalizeIndex a j + (if j < 0 then 1 else 0) in
sub i (j-i) a


A striding sub is also useful at times:

let subStride stride i len a =
let i = normalizeIndex a i in
init (fun j -> get a ((j*stride)+i)) len

(* We could even do let sub = subStride 1 *)


How about writing some subsequence iterators, they do help performance if the compiler isn't full of candy and magic. The usual way would be a for-loop with offset and stride. For-loops are error-prone, so iterators also move the source of bugs into a centralized location. Easier to fix one function than twenty.

let subFoldl i len f acc a =
let rec aux f acc i len a =
if i >= len then acc
else aux f (f acc (get a i)) (succ i) len a in
aux f acc (normalizeIndex a i) len a

(* let foldl f acc a = subFoldl 0 (length a) f acc a *)

let subIter i len f a = subFoldl i len (fun () v -> f v) () a
let subMap i len f a =
let i = normalizeIndex a i in
init (fun j -> f (get a (j+i))) len

(* slices can be done in the same manner *)
let sliceFoldl i j f acc a =
let i = normalizeIndex a i in
let j = normalizeIndex a j + (if j < 0 then 1 else 0) in
subFoldl i (j-i) f acc a

let sliceIter i j f a = sliceFoldl i j (fun () v -> f v) () a
let sliceMap i j f a =
let i = normalizeIndex a i in
let j = normalizeIndex a j + (if j < 0 then 1 else 0) in
init (fun k -> f (get a (k+i))) (j-i)

(* iterating subsequence iterators left as an exercise for the reader *)


To make the above functions faster and safer, we should do bounds-checking before iteration and use unsafe_get and unsafe_set, here's an example of an error-checking sub:

let sub i len a =
let i = normalizeIndex i in
if i + len > length a then raise Not_found;
init (fun j -> unsafe_get a (i+j)) len


Now for some list-like utils, more folds and list conversions:

let first a = if a = empty then raise Not_found else get a 0
let head = first
let last a = if a = empty then raise Not_found else get a (length a - 1)

let tail a = sub 1 (length a - 1) a
let popped a = sub 0 (length a - 1) a

let push a v = append a (init (fun _ -> v) 1)
let pop a = (last a, popped a)

let unshift a v = append (init (fun _ -> v) 1) a
let shift a = (first a, tail a)


let foldr f acc a =
let rec aux f acc i a =
if i < 0 then acc
else aux f (f (get a i) acc) (pred i) a in
aux f acc (length a - 1) a

let subFoldr i len f acc a =
let rec aux f acc len j a =
if len < 0 then acc
else aux f (f (get a (len+j)) acc) (pred len) j a in
let i = normalizeIndex a i in
if i + len > length a then raise Not_found;
aux f acc (len-1) i a

let foldl1 f a = subFoldl 1 (length a - 1) f (first a) a
let foldr1 f a = subFoldr 0 (length a - 1) f (last a) a


let to_list a = foldr (fun l v -> v::l) [] a
let of_list l =
let rec aux i a l =
match l with
| [] -> a
| h::t -> set a i h; aux (succ i) a t in
aux 0 (alloc (List.length l)) l


And from lists, a segue into lists of arrays:

let blit src src_idx dst dst_idx len =
if src_idx >= dst_idx then
for i=0 to len-1 do set dst (dst_idx + i) (get src (src_idx + i)) done
else
for i=len-1 downto 0 do set dst (dst_idx + i) (get src (src_idx + i)) done

let concat l =
let l = List.filter ((<>) empty) l in
let len = List.fold_left (fun s a -> a + length a) 0 l in
let res = alloc len in
ignore (List.fold_left (fun i a -> blit a 0 res i (length a); i + length a) 0 l);
res

let groupsOf n a =
if n < 1 then invalid_arg "groupsOf: n < 1";
let len = length a in
List.unfoldl (fun i ->
if i >= len then None
else Some (sub i (min (len-i) n) a, i+n)
) 0

let splitInto n a =
if length a <= 1 then groupsOf 1 a
else
let plen = int_of_float (ceil (float (length a) /. float n)) in
groupsOf plen a


Continuing with the splitting, some list utils from Haskell:

(* take is a special case of sub *)
let take n a = sub 0 n a

(* drop is a special case of slice *)
let drop n a = slice n (-1) a

let splitAt n a = (take n a, drop n a)

let flip f x y = f y x
let (<<) f g x = f (g x)

(* break breaks at the index of the first true to f *)
let break f a = maybe (a, empty) (flip splitAt a) (findIndex f a)

(* span splits at the index of the first false to f *)
let span f a = break (not << f) a

(* takeWhile is a take with findIndex *)
let takeWhile f a = maybe (copy a) (flip take a) (findIndex (not << f) a)

(* dropWhile is a drop with findIndex *)
let dropWhile f a = maybe (copy a) (flip drop a) (findIndex (not << f) a)


Ranges are nice for creating test data:

let range s e =
if s <= e
then init ((+) s) (e-s+1)
else init ((-) s) (s-e+1)

let (--) = range

(* (10--15) = [|10; 11; 12; 13; 14; 15|] *)


And that's all for now. The above code likely has bugs in it, I wrote it from scratch for this blog post, so there be dragons.

2008-09-24

A month and a half of iPhone 3G

I bought an iPhone 3G in early August, and have been using it pretty much daily. I'm trying to sum up my thoughts on it in this post, I guess you could call it a review.

Conclusion: An Apple product. That is, a basic device differentiated by nice animations and destroyed by an inherent mistrust in people.

Cost: 520e for a black 16GB model, 17e/month phone bill (2e for phone, 15e for flatrate 512kbps 3G.) Total cost over the 24 month lock-in period: 928e. If I change to a more open device, minimum cost for the iPhone is 568e after dropping the 3G.

First, I'm not exactly a typical phone user, I think I've fielded two or three calls during the whole time I've had the iPhone. So I couldn't really care less about the call quality or SMS or MMS. The phone side has worked ok for me, but I bought it as more of a tiny web tablet.

The exterior design of the iPhone is very Apple. A 60s-style rounded functionalist brick, with some shiny metal accents and even fine detail in the speaker grilles. The metal Apple logo makes a nice contrast in the black plastic back. It's a quality design, but it's a very safe design; very emotionally aerodynamic. Can't really hate it, can't really love it. I suppose it's as neutral as possible to bring out the contents of the large screen.

The user interface theme is a variation of the Aqua theme, a theme that was made for the colorful plastic G4 PowerMacs and the colorful iMacs. It feels slightly out of place in today's sterile kitchenware Macs with its bright colors and slightly playful demeanor. As usual, the Aqua theme is a bit lazy in its attention to detail, especially the bottom-screen grill feels unfinished.

The animations are nice, as you'd expect. I don't know if they're using them to hide latency or just as emotional appeals, but they work as both. Good job on that. They're also quite low-key, you don't find yourself annoyed by them.

My most used application by far is the Mobile Safari web browser. It's a very mixed bag, as with all the software on the iPhone. On one hand, they look polished, but on the other, they're lacking compared to desktop apps. Browsing pages is nice for the most part. Scrolling is easy, if a bit error-prone and laggy. Pinch zoom is something you only use if the double-tap zoom fails, which it sadly often does. Tables and navigation bars especially screw up the double-tap zoom. Clicking links without zooming in is a game of luck: link small, fingertip big. And as most links you want to click are in navigation bars, in which the double-tap zoom fails, you get a recipe for unhappiness.

When you scroll, the browser doesn't render the page below, but instead shows you a checkerboard pattern. I suppose it's to keep the UI responsive? The rendering might cause jitter in the frame rate? Then again, they do have threads, right?

The browser crashes roughly once every hour, which boots you back to the home screen, selectively nukes your cookies and what little the browser remembers of your HTTP basic auth credentials. Thankfully the browser does remember what tabs you had open when it crashed.

You can't save files (except images) from the browser. You can't upload files from the browser. This bears repeating: you CAN NOT save or upload files from the browser. That alone makes the device useless for reading PDFs.

Safari's unwillingness to cache downloaded data on the huge 16GB hard drive further exuberates the problem. Read two pages of a PDF, sleep the phone, and when you open it, Safari downloads the whole PDF again. Which is not nice when the PDF is 4.5 megs and you're currently in a GPRS zone of 3kB/s bandwidth.

There's no file manager, only Photos and iPod. Photos is an application that has a single folder called "Camera roll", which has all the photos you've taken and images you've saved. You can't organize or edit the pictures, only delete them. I'm hard pressed to find any use for it.

IPod is a complex application for playing your music files. I find it difficult to use. You can't play internet streams with it. You can only upload music files with iTunes.

The only way to listen to an internet stream is for the stream provider to offer a raw MP3 stream and open it in Safari. Radio Paradise does this, I don't know of any other stations who do. You can't use Safari when you're listening to an internet stream because the Quicktime plugin takes over the full screen and trying to get around it (home key, click a home screen bookmark) stops the playback. iPod can play music while you browse though.

I haven't used the App Store or the iTunes Store, so I can't offer an opinion on those. Haven't tried writing an app either since you can't write iPhone apps without a Mac.

The mail app's screen design prioritizes sender name over subject, which along with the lack of a threaded view makes reading mailing lists impossible. Hence the mail app is useless for me. However, the desktop version of GMail works well in the browser, I recommend it over both the mail app and the iPhone version of GMail.

The map is a nice toy, but I haven't found much use for it. The timer is useful. The calculator works for basic needs. The weather app is nice. The stocks app is of no use to me. The contacts app takes a very long time to open, which is an achievement for so little functionality. The YouTube icon is patronizing and the app is clunky. Haven't used the calendar or the notes app.

Most of my use has been WiFi, with a couple times on 3G or GPRS. GPRS and Edge are very slow, as expected. 3G speed is OK, speed test said 485kbps.

The battery lasts about 3-4h of surfing, which is somewhat to be expected with a screen that big. Battery life for streaming music over WiFi is 4-5h. Probably similar for 3G streaming.

The camera is crap.

If the iPhone acted more like a wireless portable hard drive with apps and a phone, it'd be a good device. Now it's a mediocre device.

2008-09-21

Gitbug - In-repo bug tracker for git

We needed a bug tracker for our project at work, and didn't want to use trac or suchlike. Tried ticgit but couldn't get it to run. So instead we took a couple hours and wrote our own simple bug tracker - gitbug.

Gitbug keeps bugs in a directory tree under bugs/ and uses one file per bug. Bug state is set by symlinking the bug file to open/ or done/. If you haven't edited your post-commit hook, it adds a hook to close bugs with commit messages that contain FIX[bug_id].

We haven't had much use for a bug tracker though, a fixme script that greps all instances of FIXME from the source tree takes care of the simple bugs. So our only bugs in the tracker tend to be things that span several subsystems.

prelude.ml: now on GitHub

Here's the GitHub page for prelude.ml. I'm in the middle of refactoring it, so the split-into-fragments source in src/ doesn't quite compile. The prelude.ml in the root directory does work however.

2008-09-14

Slow-motion Missile Fleet

Here a speed slider. The explosions get a bit epilepsy-inducing at around a fifth away the from leftmost end.

2008-09-12

prelude.ml: further modularization

Created List, Array, String and ByteString -modules, pasted the parallelization shunt in. And fixed some stupid bugs (yay for inline unit tests.) Thinking of putting all the new functions in the modules, moving the modules to top and writing a long list of aliases for convenience in Prelude.Pervasives.

The parallelization shunt is a piece of code that I've been copypasting in the collection modules (List, Array, etc.) It gets you the parallel combinators as long as you have a compatible module signature (I'm using copypaste as a ghetto functor.) The functions you need for all the combinators are: splitInto, foldl, foldl1, foldr, foldr1, iter, map, filter, concat, groupsOf, init, length, and, for pzipWith, either unsafe_get or zipWith, depending on whether you have an array-like or a list-like data structure:

let par_mapReduce ?process_count ~combine ~process l =
let process_count = process_count |? !global_process_count in
splitInto process_count l |> par_map ~process_count process |> combine

let pmapReduce combine process = par_mapReduce ~combine ~process

let pfoldl r f init = pmapReduce (List.foldl1 r) (foldl f init)
let pfoldl1 f = pmapReduce (List.foldl1 f) (foldl1 f)
let pfoldr r f init = pmapReduce (List.foldr1 r) (foldr f init)
let pfoldr1 f = pmapReduce (List.foldr1 f) (foldr1 f)

let piter f = pmapReduce ignore (iter f)
let pmap f = pmapReduce concat (map f)
let pfilter f = pmapReduce concat (filter f)

let pfoldlSeqN ?process_count n r f init l =
List.foldl (fun acc il -> r acc (pfoldl ?process_count r f init il))
init (groupsOf n l)

let piterSeqN ?process_count n r f l =
List.iter (fun l -> iter r (pmap ?process_count f l)) (groupsOf n l)

let pinit ?process_count f l =
let process_count = process_count |? !global_process_count in
let plen = int (ceil (float l /. float process_count)) in
let process i =
let start = plen * i in
let len = min plen (l - start) in
init (fun j -> f (start + j)) len in
concat (par_map ~process_count process (0--(process_count-1)))

(* for array-likes *)
let pzipWith ?process_count f a b =
let process_count = process_count |? !global_process_count in
let len = min (length a) (length b) in
pinit ~process_count (fun i ->
f (unsafe_get a i) (unsafe_get b i)
) len

(* for list-likes *)
let pzipWith ?process_count f a b =
let process_count = process_count |? !global_process_count in
let len = min (len a) (len b) in
let plen = int (ceil (float len /. float process_count)) in
let aspl = groupsOf plen a in
let bspl = groupsOf plen b in
concat (par_map ~process_count (uncurry (zipWith f)) (zip aspl bspl))

2008-09-10

prelude.ml: more combinatorial wanking

Well, this got hard to understand really fast.

let mapReduce partition distribute process combine input =
partition input |> distribute process |> combine

let par_mapReduce ?process_count ~combine ~process =
let process_count = process_count |? !global_process_count in
mapReduce (splitInto process_count) (par_map ~process_count) process combine

let pfoldlSeqN ?process_count n r f init =
mapReduce (groupsOf n) (flip foldl init)
(fun acc l -> r acc (pfoldl ?process_count r f init l)) id

let piterSeqN ?process_count n r f =
mapReduce (groupsOf n) iter (iter r @. pmap ?process_count f) ignore

Compared to the old versions of p*SeqN it's quite WTF indeed. Does it buy us anything? Heck do I know. A headache, maybe?
(* Old versions *)

let pfoldlSeqN ?process_count n r f init l =
foldl (fun acc il -> r acc (pfoldl ?process_count r f init il)) init (groupsOf n l)

let piterSeqN ?process_count n r f l =
iter (fun l -> iter r (pmap ?process_count f l)) (groupsOf n l)

[edit] Reverted back to the two above definitions. They're shorter and cleaner.

prelude.ml: range iterators

Ported the parallel is_prime example from Keep your multi-core CPU busy with F# to OCaml, which gave me an excuse to write a Range module for iterating over number ranges. Here the current version of prelude.ml. Also retired ($) and ($.) in favour of (|>) and (|>.) to make them more consistent with (@@) and (@.).

Here's the is_prime:

open Prelude

let is_prime n =
let max = int (sqrt (float n)) in
not ( (2 --> max) |> Range.find (fun d -> n mod d = 0) )

let primes = (1000000 --> 1100000) |> Range.pfilter is_prime

let () = puts (showInt (sum primes))

0.39s using Range.filter, 0.24s using Range.pfilter.

Oh right, added the following parallel functions too: pfilter, pfoldl1, pfoldr and pfoldr1. Most of the parallel combinators could be abstracted into splitInto process_count input |> par_map f |> combine, which is what I'll probably do next.

[edit] Did the refactoring, eliminating 70+ lines, here is the new version:

let par_mapReduce ?process_count ~rf ~mf l =
let process_count = process_count |? !global_process_count in
splitInto process_count l |> par_map ~process_count mf |> rf

let pmapReduce rf mf = par_mapReduce ~rf ~mf

let pfoldl r f init = pmapReduce (foldl1 r) (foldl f init)
let pfoldl1 f = pmapReduce (foldl1 f) (foldl1 f)
let pfoldr r f init = pmapReduce (foldr1 r) (foldr f init)
let pfoldr1 f = pmapReduce (foldr1 f) (foldr1 f)

let piter f = pmapReduce ignore (iter f)
let pmap f = pmapReduce concat (map f)
let pfilter f = pmapReduce concat (filter f)

If you have a collection module with splitInto and some of the other functions, you can basically paste this right in. With Range it wasn't quite that easy because par_map returns a list, so the foldls have to use List.foldl for the reduce fold. And I haven't made foldrs for Range, so couldn't add the pfoldrs.

It feels like a promising start towards generalizing the parallel iterators though.

2008-09-09

"Shared Memory" Parallelism

One thing that amazes me in discussions extolling the performance benefit of shared memory over message passing is that no one has apparently looked at the hardware in use. Because hardware has no shared memory. None. Nada. Zilch.

How a computer works in a nutshell

Accessing memory:

CPU: Hey, I'd like to read eight bytes from memory position 0x8008 to register %RAX.
Cache manager: Sure thing, wait a couple cycles. Hey L1, have you got the cache line with 0x8008?
L1: Sorry! (2 cycles if hit)
Cache manager: Hm, how about you L2? Got 0x8008?
L2: Oh, uh, sec, lemme see... nope. (9 cycles if hit)
Cache manager: Argh, OK, shit happens. DRAM, gimme 0x8008.
DRAM: Durr, 0x8008, yeah, 0x8008, yeah...
Cache manager: DRAM?
DRAM: ...
Cache manager: DRAM! 0x8008!
DRAM: Huh? Oh. Here you go. (200 cycles if hit, a couple dozen million if read from disk)

Shared memory in a cache coherent SMP machine:

CPU A: Read 0x8000!
L1: Here!
CPU B: Write %RAX to 0x8001!
Cache manager: Oh nuts. Hey CPU A, CPU B just wrote to a cache line you have, here's the new version.
CPU A: Ghh, OK. Write 0x8000!
Cache manager: CPU B! CPU A just wrote in 0x8000, you need a new version of the cache line!
CPU B: Damn it A, stop writing on my cache line!
CPU A: Your cache line? I'm so sorry, I didn't see your name on it!
...
Significant performance loss later:
Cache manager: The person who wrote this algorithm must be a some kind of retard.

Does that look like a shared memory architecture to you? No way. It's a message passing architecture, communicating in 64-byte cache lines. Treating it as shared memory will lead to performance problems.

Theoretically, a message passing system will yield the highest performance for present-day hardware. That is, if a compiler compiles the message passing in terms of CPUs talking to memory via cache lines. Anyone know whether such a mythical beast exists?

Related reading: What every programmer should know about memory, part 2: CPU caches by Ulrich Drepper.

2008-09-08

Non-copying forked workers using Bigarrays

Prelude.ml: Wrote some utils for Bigarrays and a parallel Bigarray init pbainit that uses mmapped /dev/zero as shared memory (thanks to mfp for the pointer) to allow the worker processes to write into a shared array, eliminating result copying overhead.

Here's a small program that averages two string arrays:

open Prelude

let () =
let sz = parseInt Sys.argv.(1) in
let s1 = String.create sz in
let s2 = String.create sz in
let ba = pbainit Bigarray.char
(fun i -> char ((ord (suget s1 i) + ord (suget s2 i)) / 2) )
sz in
puts (showInt (balen ba))

I tested it with sz=1G, which creates two 1GB strings and a third 1GB string/array created by averaging the two strings together. I have 4GB of RAM, so any copying will easily cause swapping to disk. I timed versions using sequential and parallel Bigarray and String inits:
  • bainit: 24.5s (sequential Bigarray init)

  • pbainit: 16.1s (parallel Bigarray init)

  • sinit: 9.5s (sequential String init)

  • psinit: started swapping, killed it after 80s (parallel String init)

As you can see from the above, while mmapped Bigarrays aren't faster than native strings, they can share results without copying. To tip the balance towards Bigarrays, let's make the compute kernel do more work:

open Prelude

let () =
let sz = parseInt Sys.argv.(1) in
let s1 = String.create sz in
let s2 = String.create sz in
let ba = pbainit Bigarray.char
begin fun i -> (* average of 11-wide averages *)
let a = saverageSlice (max 0 (i-5)) (min (sz-1) (i+5)) s1 in
let b = saverageSlice (max 0 (i-5)) (min (sz-1) (i+5)) s2 in
char ((a+b) / 2)
end
sz in
ignore ba

And the results with 1G strings:
  • bainit: 213s

  • pbainit: 111s

  • sinit: 188s

  • psinit: Killed after 240s as it had pushed swapfile use to 1G.

Now we actually see a 70% speedup with using parallel Bigarray init compared to sequential string init.

Conclusions: Bigarrays are helpful when you need to process large in-memory structures in parallel, provided that you do enough computation to mask the parallelisation overhead. Bigarrays are also good for communicating with libraries written in C or FORTRAN.

2008-09-06

Constant-space parallel combinators in OCaml

Updated prelude.ml with a couple new parallel iterators: pfoldl, pfoldlSeqN and piterSeqN. The pfoldl function is a two-phase foldl, it splits the list into process_count pieces, foldls each and foldls the results. To convey that, it takes two functions : the normal fold function and the reduce function.

The *SeqN functions are partially sequential, they synchronize after every N elements. To use the Mandelbrot example, if you're rendering a 64000 x 64000 image, it would take more than 4 gigs of RAM to do it with pbinit. For a sequential renderer that is no problem, since you can output each byte after it's computed. A parallel renderer can't really use that strategy, since the pixels are computed in parallel and outputting them as they come would lead to an unsorted image.


Sequential output:
# iter (fun i -> sleep (3-i); puts (showInt i)) (0--3);;
0
1
2
3
- : unit = ()

Parallel output:
# piter ~process_count:4 (fun i -> sleep (3-i); puts (showInt i)) (0--3);;
3
2
1
0
- : unit = ()


What we can do is turn the problem into a sequential iteration of internally parallel blocks. E.g. render a 1MB block of pixels in parallel, print it out before doing the next block. This way memory use is limited to block size and we still get a decent parallelization speedup.


Partially sequential iteration:
# iter (fun l -> iter puts @@ pmap (fun i -> sleep (3-i); showInt i) l) (groupsOf 2 (0--3));;
0
1
2
3

Or using piterSeqN:
# piterSeqN 2 puts (fun i -> sleep (3-i); showInt i) (0--3);;
0
1
2
3


Here's a larger example, using the already familiar Mandelbrot renderer. Starting with a sequential version:

open Prelude

let niter = 64
let limit = 4.0

let mandelbrot z x y =
let cr = 2.0 *. float x /. z -. 1.5 in
let ci = 2.0 *. float y /. z -. 1.0 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = zr *. zr -. zi *. zi +. cr
and nzi = 2.0 *. zr *. zi +. ci in
if nzr *. nzr +. nzi *. nzi > limit
then int (ceil (255. *. (float (niter-i) /. float niter)))
else aux (i+1) nzr nzi in
aux 0 0.0 0.0


(*vvv Interesting part starts here vvv*)

let draw_fractal z w h =
for y = 0 to h-1 do
for x = 0 to w-1 do
output_byte stdout (mandelbrot z x y)
done
done

let print_fractal w h =
puts (sprintf "P5 %i %i 255" w h);
draw_fractal (float (min w h)) w h

let () = print_fractal (parseInt Sys.argv.(1)) (parseInt Sys.argv.(1))


To make it parallel, we make draw_fractal create all scanlines and print them out:


let draw_scanline z w y =
binit (fun x -> mandelbrot z x y) w

let draw_fractal z w h =
let scanlines = pmap (fun y -> draw_scanline z w y) (0--(h-1)) in
iter putStr scanlines


And to make it constant-space parallel (well, O(width*256)), let's process in blocks of 256 scanlines:


let draw_fractal z w h =
piterSeqN 256 putStr (fun y -> draw_scanline z w y) (0--(h-1))


Done! Let's put it all together one more time:


open Prelude

let niter = 64
let limit = 4.0

let mandelbrot z x y =
let cr = 2.0 *. float x /. z -. 1.5 in
let ci = 2.0 *. float y /. z -. 1.0 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = zr *. zr -. zi *. zi +. cr
and nzi = 2.0 *. zr *. zi +. ci in
if nzr *. nzr +. nzi *. nzi > limit
then int (ceil (255. *. (float (niter-i) /. float niter)))
else aux (i+1) nzr nzi in
aux 0 0.0 0.0

let draw_scanline z w y =
binit (fun x -> mandelbrot z x y) w

let draw_fractal z w h =
piterSeqN 256 putStr (fun y -> draw_scanline z w y) (0--(h-1))

let print_fractal w h =
puts (sprintf "P5 %i %i 255" w h);
draw_fractal (float (min w h)) w h

let () = print_fractal (parseInt Sys.argv.(1)) (parseInt Sys.argv.(1))


We replaced five lines of sequential code with three lines of parallel code and got a version that scales nearly linearly with the number of cores. Not too bad, eh?

2008-09-04

Haskell on parallel hardware

Haskell for multicores - links to different ways to do parallel programming in Haskell.

GpuGen: Bringing the Power of GPUs into the Haskell World - they compile map, scan, fold, filter, zipWith, permute, bpermute, etc. to CUDA and run it on the GPU. 10x+ performance on large matrix multiplication.

2008-09-02

Almost Burning Ship

Have an almost Burning Ship fractal.




open Prelude

let niter = 32
let limit = 4.0

let burning_ship z x y =
let cr = 2.0 *. float x /. z -. 2.1 in
let ci = 2.0 *. float y /. z -. 1.8 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = absf (zr *. zr -. zi *. zi +. cr)
and nzi = 2.0 *. absf (zr *. zi +. ci) in
if nzr *. nzr +. nzi *. nzi > limit then niter-i
else aux (i+1) nzr nzi in
aux 0 0.0 0.0

let pbinit2D f w h = pbinit (fun i -> let y,x = quot_rem i w in f x y) (w*h)
let draw_fractal z w h = pbinit2D (burning_ship (float z *. 0.8)) w h
let pgm_fractal w h = sprintf "P5 %i %i %i\n" w h niter ^ draw_fractal (min w h) w h

let () = print_string @@ pgm_fractal (parseInt Sys.argv.(1)) (parseInt Sys.argv.(2))

2008-09-01

Adaptive blur filter Mandelbrot

To finish off today's Mandelbrot madness, here's a version that does a ghetto 2D box filter with kernel size controlled by the second image (i.e. blur more where second image is white, don't blur where it's black.) Using a second core gives an 80% speed boost.



open Prelude

let niter = 255
let limit = 4.0

let mandelbrot x y w h =
let cr = 2.0 *. x /. w -. 1.5 in
let ci = 2.0 *. y /. h -. 1.0 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = zr *. zr -. zi *. zi +. cr
and nzi = 2.0 *. zr *. zi +. ci in
if nzr *. nzr +. nzi *. nzi > limit then 255-i
else aux (i+1) nzr nzi in
aux 0 0.0 0.0

let draw_fractal xoff yoff w h =
pbinit (fun i ->
let y, x = quot_rem i w in
mandelbrot (float (xoff+x)) (float (yoff+y)) (float w) (float h)
) (w*h)

let blend a b w h =
let c = pbinit (fun i ->
let y, x = quot_rem i w in
let r = ord (suget b i) / 32 in
let x1 = max (x-r) 0
and x2 = min (x+r) (w-1) in
saverage (ssub (y*w+x1) (x2-x1+1) a)
) (slen a) in
pbinit (fun i ->
let y, x = quot_rem i w in
let r = ord (suget b i) / 32 in
let y1 = max (y-r) 0
and y2 = min (y+r) (h-1) in
saverage (ssubStride w (y1*w+x) (y2-y1+1) c)
) (slen a)

let square_fractal d =
let frac1 = draw_fractal 0 0 d d in
let frac2 = draw_fractal (d/4) (-d/4) d d in
let header = sprintf "P5 %i %i 255\n" d d in
header ^ (blend frac1 frac2 d d)

let () =
let img = square_fractal (parseInt Sys.argv.(1)) in
output_string stdout img

Prelude.ml - more multicore mandelbrot


Continuing from yesterday, landed a bunch of new parallel combinators in prelude.ml along with a small smattering of convenience functions for working with arrays and strings.

The following version of the parallel mandelbrot drawer demonstrates byte strings:

open Prelude

let mandelbrot w h x y =
let cr = 2.0 *. float x /. float w -. 1.0 in
let ci = 2.0 *. float y /. float h -. 1.5 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = zr *. zr -. zi *. zi +. cr
and nzi = 2.0 *. zr *. zi +. ci in
if nzr *. nzr +. nzi *. nzi > limit then 255-i
else aux (i+1) nzr nzi in
aux 0 0.0 0.0

(* pbinit f = psinit (fun i -> char (f i)) *)
let draw_fractal xoff yoff w h =
pbinit (fun i ->
let y,x = quot_rem i w in
mandelbrot w h (x+xoff) (y+yoff)
) (w*h)

(* blend the two images by averaging the values at each pixel *)
(* pbZipWith f = psZipWith (fun x y -> char (f (ord x) (ord y))) *)
let blend a b = pbZipWith average2

let blended_fractal d =
let a = draw_fractal 0 0 d d in
let b = draw_fractal (d/4) (-d/4) d d in
blend a b

let square_fractal d =
let header = sprintf "P5 %i %i 255\n" d d in
header ^ (blended_fractal d)

And we can further refactor draw_fractal into the following:

let pbinit2D f w h = pbinit (fun i -> let y,x = quot_rem i w in f x y) (w*h)
let draw_fractal w h = pbinit2D (mandelbrot w h) w h

We could also write the drawer in terms of arrays instead of strings:

let painit2D f w h = painit (fun i -> let y,x = quot_rem i w in f x y) (w*h)
let draw_fractal w h = painit2D (mandelbrot w h) w h
let blend a b = paZipWith (chr @.. average2)
let blended_fractal d =
let a = draw_fractal 0 0 d d in
let b = draw_fractal (d/4) (-d/4) d d in
implodeArray (blend a b)

But that's not a very good idea since ints are 4-8 times wider than chars, further lowering our ratio of computation to memory accesses.

One thing I found out in writing the blend was that it doesn't really get any noticeable speed-up from using the parallel zip vs. normal zip. I suppose it's because the copying overhead is so large compared to the very trivial arithmetic: if the copy takes 4 cycles per byte and the iteration takes 3 cycles per byte, doing it with a single thread would take 3 cycles per byte, while doing it with two threads would take 1.5+4= 5.5 cycles per byte (assuming linear speed boost in iteration.)

By comparison, the mandelbrot function is very multicore-happy since it's self-contained and does no memory reads. As all it does is run a few hundred cycles worth of math in the CPU registers for every byte that it writes to memory, it scales in a very linear fashion.

Blog Archive