art with code

2011-01-27

Intel X25-V SSD

Bought an Intel X25-V 40GB SSD to do some random read benchmarking. Don't stick it into a PATA controller driven SATA port, as that limits the bandwidth to 70 MB/s and random access times to 5 ms.

Sequential read speed on the unused disk was limited by the SATA bus. So I guess it's not actually reading anything for unused blocks, just generating a string of zeroes on the controller and sending it back. Would be pretty amusing to have a special disk driver that keeps a list of the unused blocks in system RAM and generates the data on the CPU for any read accesses to them. You could probably keep it all in L1 and get nice bogus 4k random read benchmark numbers. "1 ns access latency!?? 100 GB/s random read bandwidth??? Whaaaat?" (An allocation bitfield for 80 gigs in 512k blocks is about 20 kB. Increase blocksize or add RLE for bigger volumes.)

Read speed on actual data was around 200 MB/s. Average random read time for a 4k block was 0.038 ms with 128 threads and 0.31 ms with a single thread. So the controller can do 8 reads in parallel. Which is nice for a cheapo SSD.

Didn't really test write performance apart than doing a cursory check. And yes, it is low bandwidth. 40 MB/s streaming writes. So it's best used as a random read drive.

Might make a nice random read array with 6 drives or somesuch. Hypothetically: 48 parallel reads, 160 000 4k reads per second (650 MB/s). And then all you need is software that can take advantage of that.

Getting flash chip latencies lower would be good.

2011-01-15

How to take advantage of that memory?

[edit] Here's an rc.d init script I'm using to prefetch the FS to page cache: usr-prefetch. Copy to /etc/init.d and run `update-rc.d usr-prefetch start 99 2 .` to add it to runlevel 2 init scripts. I currently have 4GB of RAM, and the total size of the prefetched stuff is 3.7GB, so it might actually help. Maybe I should go buy an extra 2GB stick. It takes about a minute to do the prefetch run with cold cache, so the average read speed is around 60 MB/s. Which is pretty crappy compared to the 350 MB/s streaming read speed, maybe there's some way to make the prefetch a streaming read.

Ubuntu's ureadahead reads in the files accessed during boot in disk order. If you add all your OS files to it, it should be possible to stream the whole root filesystem to page cache in about ten seconds. I dunno.

And now I'm reading through ureadahead's sources to figure out how it sorts the files in disk order. It's using fiemap ioctl to get the physical extents for the file inodes, then sorts the files according to the first physical block used by the file. To read the files to the page cache, it uses readahead. If there was some way to readahead physical blocks instead of files, you could collapse the physical blocks into spans with > x percent occupancy, do a single streaming read per span (and cache only the blocks belonging to the files to cache).

Another easy way to do fast caching would be to put all system files on a separate partition, then cache the entire partition with a streaming read. Or allocate a ramdisk, dd the root fs onto it at boot, use it through UnionFS with writes redirected to the non-volatile fs. But whether that's worth the bother is another thing.
[/edit]

In the previous post I put together a hypothetical machine with 16GB RAM and 1.5GB/s random access IO bandwidth. Which is total overkill for applications that were built for machines with 100x less resources. What is one to do? Is this expensive hardware a complete waste of money?

The first idea I had on taking advantage of that amount of RAM was to preload the entire OS disk to page cache. If the OS+apps are around 10GB in total, it'll take less than 10s to load them up at 1.5GB/s. And you'll still have 6GB unused RAM left, so the page cache isn't going to get pushed out too easily. With 10s boot time, you'd effectively get a 20GB/s file system 20 seconds after pushing the power button.

Once you start tooling around with actual data (such as video files), that is going to push OS files out of page cache, which may not be nice. But what can you do? Set some sort of caching policy? Only cache libs and exes, use the 1.5GB/s slow path for data files? Dunno.

If you add spinning disks to the memory hierarchy, you could do something ZFS-like and have a two-level cache hierarchy for that there filesystem. 10GB L1 in system RAM, 240GB L2 on SSDs, a couple TB on spinning disks. But again, how sensible is it to cache large media files (which will likely be the primary use of the spinning disks.) IT IS A MYSTERY AARGH

It'd also be nice to have a 1-2GB of faster RAM between the CPU caches and the system RAM (with GPU-style 150GB/s bandwidth), but maybe you can only have such things if you have a large number of cores and don't care about latency.

But hmm, the human tolerance for latency is around 0.2s for discrete stuff, 0.015-0.03s for continuous movement. In 0.2s you can do 4GB worth of memory reads and 0.3GB of IO, in 0.02s the numbers are 0.4GB and 0.03GB respectively. On a 3.5 GHz quad-core CPU, you can execute around 3 billion instructions in 0.2s, which'd give you a ratio of 0.75 ops per byte for scalar stuff and 6 ops per byte for 8 op vector instructions. On GPUs the ops/byte ratio is larger, at close to 20 ops/byte for vector ops, which makes them require more computation-heavy algorithms than CPUs.

In 4GB you can fit 60 thousand uncompressed 128x128 images with some metadata. Or hmm, 6 thousand if each has ten animation frames.

Blog Archive