art with code

2010-08-31

Branching factor generator

A function to generate a stream of tree branching factors to approximate a non-integer branching factor.

branches' x i sum =
let v = (if sum/i > x then floor x else ceiling x) :: Int in
v : branches' x (i+1) (sum+fromIntegral v)

branches n = branches' n 0 0

e_branches = branches (exp 1)
e_100000 = take 100000 $ e_branches
average l = sum l / fromIntegral (length l)
average $ map fromIntegral e_100000 == 2.71828

You could also write the average function as

average2 l = snd $ foldl (\(i, avg) e -> (i+1, (avg*(i-1) + e)/i)) (1, 0) l

average2 $ map fromIntegral e_100000

even though that is less accurate. You can use a similar trick for the branches' function to use a running average instead of a running sum.

branches2' x i avg =
let v = (if avg > x then floor x else ceiling x) :: Int in
v : branches2' x (i+1) ((avg*(i-1)+fromIntegral v)/i)

branches2 n = branches2' n 1 0
average $ map fromIntegral $ take 100000 $ branches2 (exp 1)

To generate a tree of wanted size out of the branching factor list, you'd have a tree generator that eats values off the list and generates tree nodes breadth-first. It is kind of a pain to write.

And some Tuesday links

Man-Computer Symbiosis by J.C.R. Licklider. Continuing on the HCI vision golden oldies hitlist.

Twin hurricanes.

A Dust Mite.

2010-08-28

Folding, part 3: parallel folds

So, you can do a parallel fold if your folding function is associative.

-- Here i should be the neutral element of f, that is f i x = x.
-- E.g. 0 for + and 1 for *
parFold f i [] = i
parFold f i [x] = f i x
parFold f i xs =
l `par` r `pseq` f l r
where l = parFold f i left
r = parFold f i right
left, right = splitAt (length xs `div` 2) xs

-- See how associative functions work right with it
parFold (+) 0 [1..10] == foldl (+) 0 [1..10]
parFold (*) 1 [1..10] == foldl (*) 1 [1..10]

-- But others have problems
parFold (-) 0 [1..10] /= foldl (-) 0 [1..10]
parFold (/) 1 [1..10] /= foldl (/) 0 [1..10]

-- As - and / are defined as A + -B and A * 1/B,
-- we can work around these instances
parFold (+) 0 (map negate [1..10]) == foldl (-) 0 [1..10]
parFold (*) 1 (map recip [1..10]) == foldl (/) 1 [1..10]

But for the general case, we need something stronger. Roll in the reduceFold! It splits the foldable list into sublists, folds each sublist, and reduces the subresults into a single value.

reduceFold reduce f init [] = init
reduceFold reduce f init [x] = f init x
reduceFold reduce f init xs =
l `par` r `pseq` reduce l r
where l = reduceFold reduce f init left
r = reduceFold reduce f init right
(left, right) = splitAt (length xs `div` 2) xs

-- Associative operations work as usual
reduceFold (+) (+) 0 [1..10] == foldl (+) 0 [1..10]
reduceFold (*) (*) 1 [1..10] == foldl (*) 1 [1..10]

-- And we can now combine the subresults in a different fashion
reduceFold (+) (-) 0 [1..10] == foldl (-) 0 [1..10]
reduceFold (*) (/) 1 [1..10] == foldl (/) 1 [1..10]

Let's write some parallel list operations in terms of reduceFold.

parMap f lst = reduceFold (++) (\l i -> l ++ [f i]) [] lst
parFilter f lst = reduceFold (++) (\l i -> if f i then l++[i] else l) [] lst
parConcat lsts = reduceFold (++) (++) [] lsts
parConcatMap f lst = reduceFold (++) (\l i -> l++(f i)) [] lst

optLeft (Just x) _ = Just x
optLeft Nothing (Just x) = Just x
optLeft Nothing Nothing = Nothing

optFilter f i = if f i then Just i else Nothing

parFind f lst = reduceFold optLeft (\l i ->
optLeft l (optFilter f i)) Nothing lst

optRight x y = optLeft y x

parFindLast f lst = reduceFold optRight (\l i ->
optRight l (optFilter f i)) Nothing lst

-- Yeah, they pretty much work
divisibleBy n x = x `mod` n == 0

parMap (+ 10) [1..10] == [11..20]
parFilter (divisibleBy 3) [1..15] == [3,6,9,12,15]
parConcat [[1], [2,3,4], [5,6,7], [8,9], [10..20]] == [1..20]
parConcatMap (\x -> [x,x,x]) [1..3] == [1,1,1,2,2,2,3,3,3]
parFind (divisibleBy 127) [90,93..900] == Just 381
parFind (> 10) [1..10] == Nothing
parFindLast (divisibleBy 127) [90,93..900] == Just 762

You might've noticed that both parFold and reduceFold are creating a parallel evaluation spark for each element in the lists. This is not a very good idea as the sparking overhead is often going to dominate the execution time and kill performance.

So we need some way to control the reduce tree branching factor and clump up the leaf-level folds into longer operations. I don't know if I can manage to come up with an automagic or even semi-automagic way to do this. I would like to use yesterday's theoretical math as guidance for the tree shape optimizer. Maybe we'll find out in the next installment.

A small hill-climb optimizer


hillClimb f current delta error 0 = current
hillClimb f current delta error steps | delta < error = current
hillClimb f current delta error steps =
if curVal > prevVal && curVal > nextVal
then hillClimb f current (delta / 2) error (steps-1)
else if prevVal > nextVal
then hillClimb f (current-delta) delta error (steps-1)
else hillClimb f (current+delta) delta error (steps-1)
where curVal = f current
prevVal = f (current-delta)
nextVal = f (current+delta)

May have bugs in it, only tested on (\x -> -(x**2)).

2010-08-27

Folding, part 2: parallel reduction math

Parallel folding is a bit of an oxymoron. The fold operation sequentially builds up an accumulator value from the elements of a data structure. So if it's inherently sequential, how can it be parallelized?

If the reduce function given to fold is associative, the fold can be parallelized. If it's not associative, the fold is sequential.

Associativity means that you can turn (a + b) + c into a + (b + c), or, in function call terms, (f (f a b) c) == (f a (f b c)). When you have a long call chain like a + b + c + d + e + f + g + h, it means that you can split it into (a + b) + (c + d) + (e + f) + (g + h), evaluate the parenthesized bits in parallel, and then reduce the results to the final value. The reduction can also be parallelized, as it is the fold (a_b + c_d) + (e_f + g_h). This sort of parallel tree reduction is what map-reduce is all about btw: compute a bunch of values in parallel and reduce them to the result value.

Where it all goes by the wayside


Let's do the math to figure out the theoretical runtime for a reduction network. One parallel reduction step takes computeTime time. Moving the result to the next reduce node takes totalResultSize / bandwidth time. The variable totalResultSize is defined as resultSize * branchingFactor, as the branching factor is the amount of nodes sending their reduce results to the reduce node. We can also split it into a more useful form branchingFactor * (resultSize / bandwidth) and write transferTime = resultSize / bandwidth Finally, the amount of parallel reduction steps required to reduce a tree of size N is log_branchingFactor N (and 1 < branchingFactor <= N).

Putting it all together, we get something like

treeReduceTime n branchingFactor computeTime transferTime =
stepCount * stepTime
where stepTime = computeTime + (branchingFactor * transferTime)
stepCount = if branchingFactor == 1.0
then n
else log n / log branchingFactor

From the above we can come up with a couple of generalizations:

If computeTime is O(1) with regard to branchingFactor and you have infinite bandwidth or zero result size (both mean that transferTime is zero), you want a branching factor of N. That's easy enough to see:

With infinite bandwidth, the treeReduceTime is
(computeTime + (branchingFactor * 0)) + stepCount =
computeTime * stepCount

To minimize the above, we need to minimize stepCount.
The smallest stepCount we can have is 1.
The stepCount is 1 when branchingFactor is N.


If you have infinitely fast compute nodes (computeTime = 0), you want a branching factor of e. That's a bit tougher to prove[1] (or maybe I'm just lacking some essential bit of knowledge that would make the proof easier.)

Note that it's highly likely that computeTime depends on branchingFactor. If the reduce algorithm is O(n), computeTime = branchingFactor * elementComputeTime, and in that case the optimum branching factor is e (as elementComputeTime and transferTime have the same factor, they can be rolled together and the infinitely fast compute nodes proof applies). For O(n^2) algorithms, the optimal branching factor seems to be sqrt(e).

For O(1) computeTime, finite computational power and non-zero transfer time, the optimal branching factor is somewhere between e and N. I used a simple hill-climb optimizer to find optimal branching factors for different computation/transfer ratios, and they do seem to be independent of N (well, apart from the maximum branching factor). For 1:1 compute time : transfer time, the branching factor is around 3.6. For 1:10 (low bandwidth), the branching factor is a bit over 2.8. For 10:1 (slow compute), the branching factor is about 8.6. Here's some approximate ratios for integer branching factors: 3 -> 1:3.5, 4 -> 1.5:1, 5 -> 3:1, 6 -> 5:1, and for 1000:1 the factor is about 226.

The above approximations can be used for branching factor -dependent computation times too. You do that by replacing the computation time with per-reduce-step overhead and combining computeTime in transferTime. The new ratio is overhead : (computeTime+transferTime). For intuition on this: If the overhead dominates, you want more branching as that will decrease the contribution of the overhead. If you have very little overhead, you want a branching factor close to e.

For an example, consider an email tournament that aims to pick the best email out of ten thousand. You receive some fixed amount of emails and your task is to pick the best one, then forward it to your assigned receiver (who does the same, etc., and the last reducer sends the mail to the award ceremony). Let's estimate that the transfer time is 5 seconds, the computing time is a minute per email, and the fixed overhead is 8 hours. This gives us an overhead:processing-ratio of 443. If each receiver processes 10 emails per session, running the tournament would take 33 hours. But if each receiver does 118 emails a go, the tournament would be done in 19.5 hours. And if they do 200 emails, it'd take ... 20.2 hours. (Yeah, 118 is the best branching factor here.)

Suppose you do the same tournament but with people glued to their mail apps. Now the overhead might be just one minute. This time, doing the tournament with 118 mails per session would take a snappy 4.1 hours. But as the role of overhead has fallen, we should reduce the branching factor as well. With 4 mails per session, the tournament would take only 35 minutes. (With 2 mails per session, 42 minutes. 4 is optimal.)


With infinitely fast compute nodes, the treeReduceTime is
(branchingFactor * transferTime) * stepCount

For a branching factor of 1, that equals
transferTime * n

For other branching factors, the equation is
b * t * log_b(n), where b = branchingFactor and t = transferTime

From which we see that t is a fixed factor,
so the minimum of the above equation is the minimum of the equation
b * log_b(n)

Which can be written as
log_b n = log n / log b, b > 1
b * log_b(n) =
b * (log n / log b) =
(b / log b) * log n

From which we see that log n is a fixed factor,
so the minimum of b * t * log_b(n) is the minimum of
b / log b

To find the minimum, find the derivative
D(b/log b) = (1*log b - b*1/b) / (log b)^2
= (log b - 1) / (log b)^2

And its inflection point at b = e,
(log e - 1) / (log e)^2 = (1 - 1) / 1^2 = 0 / 1 = 0,
we check the values of b / log b on both sides of the
inflection point to be larger than the value at the inflection point,
e / log e = e =~ 2.718
2 / log 2 =~ 2.885
3 / log 3 =~ 2.731,
thus confirming it as a local minimum.

Checking the discontinuity of b / log b at b = 1,
b / 0 = infinity,
and discovering it to be larger than e / log e,
we find the global minimum of b / log b to lie at b = e.

Plugging that back into the original equation:
minimum of branchingFactor * transferTime * stepCount is
transferTime * min(n, e*ln(n))
and because n >= e*ln(n), the minimum is
transferTime * e * ln(n).

2010-08-26

Folding, part 1

The fold function is a data structure accumulator: it goes through every element in the data structure and builds up an accumulator value from them. An example for folding lists:

foldl f acc [] = acc
foldl f acc (x:xs) = foldl f (f acc x) xs

With a fold you can do maps and filters of different kinds. Here's a few examples:

-- id x = x
reverseMap f lst = foldl (\l i -> (f i):l) [] lst
reverse = reverseMap id
map f lst = reverse (reverseMap f lst)

filter f lst =
reverse (foldl (\l i ->
if f i
then i:l
else l
) [] lst)

concat a b = foldl (\l i -> i:l) b (reverse a)

concatMap f lst = foldl (\l i -> concat l (f i)) [] lst

triplicate lst = concatMap (\i -> [i,i,i]) lst

drop n lst =
reverse $ snd $ foldl (\(c,l) i ->
if c >= n
then (c, i:l)
else (c+1, l)) (0,[]) lst

You can also do find operations with fold, but they're not as efficient as you'd like, because the fold goes through all the elements in the list. To make find efficient, we need a fold with break.

findInefficient f lst =
foldl (\e i ->
case e of
Nothing -> (if f i then Just i else e)
Just x -> e) Nothing lst

data Break a = Break a | Continue a

foldlWithBreak f acc [] = acc
foldlWithBreak f acc (x:xs) =
case f acc x of
Break a -> a
Continue a -> foldlWithBreak f a xs

find f lst =
foldlWithBreak (\e i ->
if f i
then Break (Just i)
else Continue Nothing) Nothing lst

take n lst =
reverse $ snd $ foldlWithBreak (\(c, l) i ->
if c >= n
then Break (c,l)
else Continue (c+1, i:l)) (0, []) lst

any f lst =
-- maybe False (const True) (find f lst)
case find f lst of
Just x -> True
Nothing -> False

all f lst =
-- not $ any (not.f) lst
not (any (\x -> not (f x)) lst)

And that's it for today. Tomorrow, parallel folding.

FormData file uploads

XMLHttpRequest Level 2 FormData is super. Loving it.

var fd = new FormData();
fd.append("key", "value");
var files = dropEvent.dataTransfer.files;
for (var i=0; i<files.length; i++)
fd.append("file", files[i]);

var xhr = new XMLHttpRequest();
xhr.open("POST", "/upload");
xhr.send(fd);

2010-08-25

sse.h update, HTML5 fiddling, more links

I updated sse.h with license info (MIT) and added software implementations of recip() and rsqrt() to double2 and double4. This Vector Classes library might be more useful for Real Serious Use, though.

Currently fiddling with HTML5 media tags (cursed be audio and video codecs. If it were possible to implement media codecs in JS, that would be a solution.) Also doing drag-and-drop file uploads. The event API is ... interesting. To receive a drop event, you need to preventDefault other drag-related events.

Failures are strengths.


Have some links to go:

Eruptions blog readership.
Activity at Sakurajima Volcano Intensifies.
Region-based memory management, ML Kit, A Simplified Account of Region Inference, Combining Region Inference and Garbage Collection, A Safe Runtime System in Cyclone.

2010-08-23

A new blog post!

No, wait, sorry, just more link copypasta. My bad.

Join, or Die. Benjamin Franklin, truly a master of Perl. Via Hark, a vagrant.

Ice Island Calves off Petermann Glacier, via Bad Astronomy.

2010-08-22

Yet more random links

Electric blue planktom blooms off Ireland.

Employer Expectations, Peer Effects and Productivity: Evidence from a Series of Field Experiments [PDF]. A study using Amazon Mechanical Turk, one finding was that peers reward effort rather than output quality. Quote from the implications section: "For example, it may be difficult to get workers to substitute easy, correct procedures for difficult, inefficient procedures. Ironically, the difficulty itself might make an outdated procedure harder to replace, as workers who adopt the easier method might be perceived to be shirking. The finding that exposure to low-output work lowers output, combined with the finding that low-productivity reduces willingness to punish, suggests the possibility of an organizational vicious cycle: after observing idiosyncratically bad work, workers may lower their own output and punish less in response, in turn reducing other workers’ incentives to be highly productive."

Forecasting bird migration for the Finnish Air Force. "Why? Bird is a threat."

Random links for another day

PacketShader: a GPU-Accelerated Software Router - eeh?

Enceladus plumes with anigifs. Cassini just keeps taking these awesome shots of the Saturn system.

Kirkenes harbour (and googling with the name of the ship got me this. Take that for good measure as well.)

Fantoft stave church, Pre-Colombian Mexican architecture

... the deep-sea food web is fueled by a rain of dead plants and animals from surface waters. Sediment for the sediment god, let the red clay sink back into the mantle.

Cryosphere Today.

That reminds me: We inhabit a few meter thick layer at the solid-gas interface of a 6500 km radius sphere of rock and iron. Go up or down a few meters and you'll die. Go to the solid-liquid or gas-liquid interface and you'll die. Go to the solid, go to the liquid or go to the gas and you'll die.

Bismuth crystal. I quite like the Mineralogical Record print mag. Lots of good pictures.

2010-08-21

Random links of the day

Use of Ada in a Student CubeSat Project (via Ada-Europe 2008 via SPARK Conference Papers). First they built some arctic sea ice buoys and then used the same HW/SW-platform to do a CubeSat, in a high-integrity dialect of Ada. (The conference proceedings have some other wacky stuff too, like a presentation about porting a naval command system used in Royal Navy destroyers to Ada 2005. And yes, the slides are basically a long list of stuff that's changed between language versions, so it's not that exciting.)

Neptune about to complete first orbit since discovery. Discovered in 1846, orbital period 165 years, orbit complete next summer. Celebration? (Via Uncertain Principles.)

The dangers of plein air. That reminds me, I've fallen off the drawing routine. Perhaps I should unfall. Also this.

"According to the meteorologists, it's never summer in Gamvik". The coast of the Arctic Sea is a jolly good place. Full of fun and excitement and days that last for months.

2010-08-19

Random programming ideas

  • Programming is about gathering knowledge about a piece of code. What does it do, what are its limitations, how does it perform, where can it be used, what is it doing right now, how do I extend it, how much memory does it use, how do I know that it works right, etc.
  • Testing, logging, documentation and profiling as an integral part of a language. Parse each function and module into a {body, tests, docs, logstream, profilestream}-struct that can be fiddled with at runtime.
  • Automatically test a function based on its type. Call it with different args, log the results and summarize detected patterns to give programmer a quick description of what the function does and what its time-space-complexity looks like. Add a way to flag found properties as want/do not want and check that subsequent versions of the function fulfill those (basically autogenerating QuickCheck rules).
  • Typed UNIX pipes. Use HTTP headers to tell the recipient what's coming through.
  • Auto-generate type->type-pipelines by putting conversion functions in a graph, run graph search on it with edge weight given by amount of information lost in each conversion (and secondarily performance).
  • Semantic type information to encode the effects of a function. Two functions that purport to do the same thing may well have semantic differences: suppose two image scaling functions where one assumes linear color space and the other does gamma correction. Both do ((RGB w h), new_width, new_height) -> (RGB new_width new_height), but the results differ.
  • Concurrency based on HW-level message passing. Processors read and write in cache lines, cache lines can't be shared between two processors => use cache lines as messages.
  • Array programming-like parallel array computation lib. Element-wise operations on N-wide vectors are easy to parallelize and vectorize: divide N by the amount of processors and the SIMD width. Needs to compile array-level pipes ((f . g . h) A) into element-level pipes (map (f' . g' . h') A) to maximize arithmetic per memory access. Maybe Data Parallel Haskell does this stuff already.
  • Uniform batteries-included interface for different data structures. Yeah, doable with typeclasses. Yeah, something like OCaml Batteries Included Enums. You can reduce data structures to either empty, fold and unfold (for streaming data structures) or alloc, length, get and set (for random access structures).
  • Data structure -generic regular expressions (i.e. not tied to just strings), are they useful? See proof of concept. What you need is a fold over the data structure (and I guess that most data structures are foldable, how else would you free the memory their elements use?)
  • Instruction set with bounded & bordered LOADs and STOREs: LOAD reg addr minAddr maxAddr [borderValue]. Throw hardware exception if a bounded LOAD goes out of bounds, load borderValue if a bordered LOAD goes out of bounds. Bordered STORE writes to the address in borderValue instead. Faster than two compares and JMP? More compiler-friendly? Eliminates buffer overflows? (Random literature search: Security improvement in embedded systems via an efficient hardware bound checking architecture, Checking Array Bound Violation Using Segmentation Hardware, A Comprehensive Approach to Array Bounds Check Elimination for Java, x86 segmented access mode, some mainframe architectures are said to have something like that, mysteries of life.)

2010-08-16

China at #2

Hey, PRC thundered past Japan in nominal GDP this year. At this pace, China'll be passing USA and EU in a decade and maybe reach the same GDP per capita by 2040. At which point its annual growth rate will stabilize at around 4% and all the funds expecting 10% will cause a major economic crash, absorbed in part by an ascendant India? I guess the question is how China will transform from a fast-growing implementation economy into a slow-growing research one (IN THE FUTURE!!!1!one).

2010-08-10

A slow cache algorithm and a faster one

It took around 30 ms to draw some text titles. The weird thing was that not drawing the titles didn't really affect the draw time. So clearly the problem was not in the drawing speed. Investigating, my text bitmap LRU cache was hitting its max size, destroying all possible benefits of caching. But making the cache large enough didn't fix the performance problem. A mystery!

On closer look, my caching algorithm was retarded:

function makeCache(maxSize) {
return { array: [], hash: {}, maxSize: maxSize };
}

function cacheItem(cache, item) {
// shift old items off the array
while (cache.array.length >= cache.maxSize) {
var it = cache.array.shift();
delete cache.hash[it.key];
}
cache.array.push(item);
cache.hash[item.key] = item;
}

// move item with key to the end of cache array
function refresh(cache, key) {
var it = cache.hash[key];
var idx = cache.array.indexOf(it);
cache.array.splice(idx,1);
cache.array.push(it);
}

function get(cache, key) {
refresh(cache, key);
return cache.hash[key];
}

Why that is bad: O(n) get, O(n) cacheItem after hitting maxSize. If you're going through the whole cache every frame, that's O(n^2). If you're also at maxSize, double that! Aargh!

So I rewrote it to be less insane and now the drawing is nice and fast:

function makeCache(maxSize) {
return { array: [], hash: {}, maxSize: maxSize, lastUsedIndex: 0 };
}

function cacheItem(cache, item) {
// chop off the oldest items after reaching overflow limit
if (cache.array.length >= cache.maxSize * 2) {
cache.array.sort(function(a,b){ return a.lastUsed - b.lastUsed; });
var deleted = cache.array.splice(cache.maxSize);
for (var i=0; i<deleted.length; i++) {
var it = deleted[i];
delete cache.hash[it.key];
}
}
cache.array.push(item);
cache.hash[item.key] = item;
}

// update the cached item's lastUsed
function refresh(cache, key) {
cache.hash[key].lastUsed = cache.lastUsedIndex++;
}

function get(cache, key) {
refresh(cache, key);
return cache.hash[key];
}

The gets are now O(1) and cacheItem with maxSize cache runs in amortized O((2n log 2n) / n) time. Or something. You could further optimize it by using a split instead of sort. Basically run an O(n) selection algorithm to find the maxSizeth element of the cache array, then use that as a pivot and do a single quicksort partition pass. That'd give you amortized O(2n / n), which is close enough to O(1) :P

Blog Archive