Image correlation algorithm benchmark:
240 GBps -- Athlon II X4 640, 3GHz (12GHz aggregate), 2MB L2
85 GBps -- Core 2 Duo E6400, 2.1GHz (4.3GHz aggregate), 2MB L2
103 GBps -- Athlon II X4
45 GBps -- Core 2 Duo
13 GBps -- Athlon II X4
5 GBps -- Core 2 Duo
Pretty much linear scaling with clock frequency in OpenCL. Both have a 3 cycle L1 latency and the algorithm is very much an L1 cache benchmark, so this isn't too surprising. The SSE version has some bandwidth / load-balancing bottleneck going on, and the naive version is pretty much a pure memory bandwidth benchmark.