fhtr: January 2011

2011-01-27

Intel X25-V SSD

Bought an Intel X25-V 40GB SSD to do some random read benchmarking. Don't stick it into a PATA controller driven SATA port, as that limits the bandwidth to 70 MB/s and random access times to 5 ms.

Sequential read speed on the unused disk was limited by the SATA bus. So I guess it's not actually reading anything for unused blocks, just generating a string of zeroes on the controller and sending it back. Would be pretty amusing to have a special disk driver that keeps a list of the unused blocks in system RAM and generates the data on the CPU for any read accesses to them. You could probably keep it all in L1 and get nice bogus 4k random read benchmark numbers. "1 ns access latency!?? 100 GB/s random read bandwidth??? Whaaaat?" (An allocation bitfield for 80 gigs in 512k blocks is about 20 kB. Increase blocksize or add RLE for bigger volumes.)

Read speed on actual data was around 200 MB/s. Average random read time for a 4k block was 0.038 ms with 128 threads and 0.31 ms with a single thread. So the controller can do 8 reads in parallel. Which is nice for a cheapo SSD.

Didn't really test write performance apart than doing a cursory check. And yes, it is low bandwidth. 40 MB/s streaming writes. So it's best used as a random read drive.

Might make a nice random read array with 6 drives or somesuch. Hypothetically: 48 parallel reads, 160 000 4k reads per second (650 MB/s). And then all you need is software that can take advantage of that.

Getting flash chip latencies lower would be good.

2011-01-17

Stupid ideas to kick off 2011

They have those spinning disks with a couple gigs of flash as read cache, right. So, how about spending about 30e to put a bunch of fast flash chips and a couple gigs of DRAM on a motherboard to act as SATA read cache. Read the flash into DRAM during boot checks (I guess you have around 5 seconds to do it, so 0.5-1GB/s should suffice.) Ooh, mysteriously computer boots as if from ramdisk.

Or alternatively, if you like software more than hardware, use the fast flash as the OS disk, suck it to RAM during boot checks, write OS to be able to utilize that.

Also, maybe they could make a computer that's not actually a computer but a rock. And then you could have a mouse that's not a mouse but a rat and it'd bite your fingers off and give you rabies.

2011-01-15

How to take advantage of that memory?

[edit] Here's an rc.d init script I'm using to prefetch the FS to page cache: usr-prefetch. Copy to /etc/init.d and run `update-rc.d usr-prefetch start 99 2 .` to add it to runlevel 2 init scripts. I currently have 4GB of RAM, and the total size of the prefetched stuff is 3.7GB, so it might actually help. Maybe I should go buy an extra 2GB stick. It takes about a minute to do the prefetch run with cold cache, so the average read speed is around 60 MB/s. Which is pretty crappy compared to the 350 MB/s streaming read speed, maybe there's some way to make the prefetch a streaming read.

Ubuntu's ureadahead reads in the files accessed during boot in disk order. If you add all your OS files to it, it should be possible to stream the whole root filesystem to page cache in about ten seconds. I dunno.

And now I'm reading through ureadahead's sources to figure out how it sorts the files in disk order. It's using fiemap ioctl to get the physical extents for the file inodes, then sorts the files according to the first physical block used by the file. To read the files to the page cache, it uses readahead. If there was some way to readahead physical blocks instead of files, you could collapse the physical blocks into spans with > x percent occupancy, do a single streaming read per span (and cache only the blocks belonging to the files to cache).

Another easy way to do fast caching would be to put all system files on a separate partition, then cache the entire partition with a streaming read. Or allocate a ramdisk, dd the root fs onto it at boot, use it through UnionFS with writes redirected to the non-volatile fs. But whether that's worth the bother is another thing.
[/edit]

In the previous post I put together a hypothetical machine with 16GB RAM and 1.5GB/s random access IO bandwidth. Which is total overkill for applications that were built for machines with 100x less resources. What is one to do? Is this expensive hardware a complete waste of money?

The first idea I had on taking advantage of that amount of RAM was to preload the entire OS disk to page cache. If the OS+apps are around 10GB in total, it'll take less than 10s to load them up at 1.5GB/s. And you'll still have 6GB unused RAM left, so the page cache isn't going to get pushed out too easily. With 10s boot time, you'd effectively get a 20GB/s file system 20 seconds after pushing the power button.

Once you start tooling around with actual data (such as video files), that is going to push OS files out of page cache, which may not be nice. But what can you do? Set some sort of caching policy? Only cache libs and exes, use the 1.5GB/s slow path for data files? Dunno.

If you add spinning disks to the memory hierarchy, you could do something ZFS-like and have a two-level cache hierarchy for that there filesystem. 10GB L1 in system RAM, 240GB L2 on SSDs, a couple TB on spinning disks. But again, how sensible is it to cache large media files (which will likely be the primary use of the spinning disks.) IT IS A MYSTERY AARGH

It'd also be nice to have a 1-2GB of faster RAM between the CPU caches and the system RAM (with GPU-style 150GB/s bandwidth), but maybe you can only have such things if you have a large number of cores and don't care about latency.

But hmm, the human tolerance for latency is around 0.2s for discrete stuff, 0.015-0.03s for continuous movement. In 0.2s you can do 4GB worth of memory reads and 0.3GB of IO, in 0.02s the numbers are 0.4GB and 0.03GB respectively. On a 3.5 GHz quad-core CPU, you can execute around 3 billion instructions in 0.2s, which'd give you a ratio of 0.75 ops per byte for scalar stuff and 6 ops per byte for 8 op vector instructions. On GPUs the ops/byte ratio is larger, at close to 20 ops/byte for vector ops, which makes them require more computation-heavy algorithms than CPUs.

In 4GB you can fit 60 thousand uncompressed 128x128 images with some metadata. Or hmm, 6 thousand if each has ten animation frames.

2011-01-13

The computer hardware of 2011

Here's a 1500e computer for 2011. It's a build where I tried to minimize the bottlenecks and get a nice memory pyramid with GPU cache 2 TB/s, GPU RAM 300GB/s, CPU cache 200GB/s, RAM 20GB/s, IO 2GB/s. Throw in 100e worth of spinning disks there in RAID-0 for a trailing 0.2GB/s disk IO.

CPU: 4-6 cores, 3.5GHz, 150e.
Motherboard: 150e.
Case: 50e steel box with razor-sharp edges and PSU from hell.

Total: 350e.

Memory: 8GB @ 150e.
Memory bandwidth: 20GB/s.

Total: 500e.

IO subsystem: 4xSSD RAID-0 @ 500e.
2GB/s streaming IO bandwidth.
1.5GB/s random access IO bandwidth at 4k block size.

Total: 1000e.

GPU: two upper middle-class things @ 500e.
Computing power: 3 TFLOPS.
Memory bandwidth: 300GB/s.

Total: 1500e.

(Optionally 2x 500GB disks, 0.2GB/s IO bandwidth @ 100e for a total of 1600e.)

Reasons for the selections:

The performance difference between a 150e CPU and a 300e CPU is ~15%, and around 25% with overclocking. The point is more to get high memory bandwidth with minimal cost. Non-bargain mobo selected for 6Gbps SATA and dual GPUs. Only 8GB memory to cut costs. All money saved was tossed on getting random access IO onto the order-of-magnitude curve and the GPUs to cap the top end of the memory bandwidth curve (and for a disproportionate amount of computing power). The SSD numbers are based on recently announced SATA 6Gbps drives. If you don't care about GPU performance, get a 300e CPU, use IGP, double the RAM and buy an extra SSD.

Now, programming this thing is going to be interesting. Not only do the CPU and GPU require parallelism to extract reasonable performance, but the random access IO subsystem does too. The IO subsystem latencies are probably in the 0.2ms range but it can do 64 parallel accesses.

A single-threaded C program will achieve maybe 1% of peak performance of the hardware, if that.

For legacy software, you could buy a 300e computer and it'd perform just as well.

CPU: 70e, mobo: 60e, case: 50e, RAM: 20e, SSD: 90e. Toss a 40e disk in there for good measure.

Or better yet, quit computers altogether and spend all the money on furniture.

fhtr