art with code

2008-09-01

Prelude.ml - more multicore mandelbrot


Continuing from yesterday, landed a bunch of new parallel combinators in prelude.ml along with a small smattering of convenience functions for working with arrays and strings.

The following version of the parallel mandelbrot drawer demonstrates byte strings:

open Prelude

let mandelbrot w h x y =
let cr = 2.0 *. float x /. float w -. 1.0 in
let ci = 2.0 *. float y /. float h -. 1.5 in
let rec aux i zr zi = if i >= niter then 0 else
let nzr = zr *. zr -. zi *. zi +. cr
and nzi = 2.0 *. zr *. zi +. ci in
if nzr *. nzr +. nzi *. nzi > limit then 255-i
else aux (i+1) nzr nzi in
aux 0 0.0 0.0

(* pbinit f = psinit (fun i -> char (f i)) *)
let draw_fractal xoff yoff w h =
pbinit (fun i ->
let y,x = quot_rem i w in
mandelbrot w h (x+xoff) (y+yoff)
) (w*h)

(* blend the two images by averaging the values at each pixel *)
(* pbZipWith f = psZipWith (fun x y -> char (f (ord x) (ord y))) *)
let blend a b = pbZipWith average2

let blended_fractal d =
let a = draw_fractal 0 0 d d in
let b = draw_fractal (d/4) (-d/4) d d in
blend a b

let square_fractal d =
let header = sprintf "P5 %i %i 255\n" d d in
header ^ (blended_fractal d)

And we can further refactor draw_fractal into the following:

let pbinit2D f w h = pbinit (fun i -> let y,x = quot_rem i w in f x y) (w*h)
let draw_fractal w h = pbinit2D (mandelbrot w h) w h

We could also write the drawer in terms of arrays instead of strings:

let painit2D f w h = painit (fun i -> let y,x = quot_rem i w in f x y) (w*h)
let draw_fractal w h = painit2D (mandelbrot w h) w h
let blend a b = paZipWith (chr @.. average2)
let blended_fractal d =
let a = draw_fractal 0 0 d d in
let b = draw_fractal (d/4) (-d/4) d d in
implodeArray (blend a b)

But that's not a very good idea since ints are 4-8 times wider than chars, further lowering our ratio of computation to memory accesses.

One thing I found out in writing the blend was that it doesn't really get any noticeable speed-up from using the parallel zip vs. normal zip. I suppose it's because the copying overhead is so large compared to the very trivial arithmetic: if the copy takes 4 cycles per byte and the iteration takes 3 cycles per byte, doing it with a single thread would take 3 cycles per byte, while doing it with two threads would take 1.5+4= 5.5 cycles per byte (assuming linear speed boost in iteration.)

By comparison, the mandelbrot function is very multicore-happy since it's self-contained and does no memory reads. As all it does is run a few hundred cycles worth of math in the CPU registers for every byte that it writes to memory, it scales in a very linear fashion.
Post a Comment

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes, much bad code, much bad art.

Have freelanced for Verizon, Google, Mozilla, Warner Bros, Sony Pictures, Yahoo!, Microsoft, Valve Software, TDK Electronics.

Ex-Chrome Developer Relations.