art with code

2011-01-15

How to take advantage of that memory?

[edit] Here's an rc.d init script I'm using to prefetch the FS to page cache: usr-prefetch. Copy to /etc/init.d and run `update-rc.d usr-prefetch start 99 2 .` to add it to runlevel 2 init scripts. I currently have 4GB of RAM, and the total size of the prefetched stuff is 3.7GB, so it might actually help. Maybe I should go buy an extra 2GB stick. It takes about a minute to do the prefetch run with cold cache, so the average read speed is around 60 MB/s. Which is pretty crappy compared to the 350 MB/s streaming read speed, maybe there's some way to make the prefetch a streaming read.

Ubuntu's ureadahead reads in the files accessed during boot in disk order. If you add all your OS files to it, it should be possible to stream the whole root filesystem to page cache in about ten seconds. I dunno.

And now I'm reading through ureadahead's sources to figure out how it sorts the files in disk order. It's using fiemap ioctl to get the physical extents for the file inodes, then sorts the files according to the first physical block used by the file. To read the files to the page cache, it uses readahead. If there was some way to readahead physical blocks instead of files, you could collapse the physical blocks into spans with > x percent occupancy, do a single streaming read per span (and cache only the blocks belonging to the files to cache).

Another easy way to do fast caching would be to put all system files on a separate partition, then cache the entire partition with a streaming read. Or allocate a ramdisk, dd the root fs onto it at boot, use it through UnionFS with writes redirected to the non-volatile fs. But whether that's worth the bother is another thing.
[/edit]

In the previous post I put together a hypothetical machine with 16GB RAM and 1.5GB/s random access IO bandwidth. Which is total overkill for applications that were built for machines with 100x less resources. What is one to do? Is this expensive hardware a complete waste of money?

The first idea I had on taking advantage of that amount of RAM was to preload the entire OS disk to page cache. If the OS+apps are around 10GB in total, it'll take less than 10s to load them up at 1.5GB/s. And you'll still have 6GB unused RAM left, so the page cache isn't going to get pushed out too easily. With 10s boot time, you'd effectively get a 20GB/s file system 20 seconds after pushing the power button.

Once you start tooling around with actual data (such as video files), that is going to push OS files out of page cache, which may not be nice. But what can you do? Set some sort of caching policy? Only cache libs and exes, use the 1.5GB/s slow path for data files? Dunno.

If you add spinning disks to the memory hierarchy, you could do something ZFS-like and have a two-level cache hierarchy for that there filesystem. 10GB L1 in system RAM, 240GB L2 on SSDs, a couple TB on spinning disks. But again, how sensible is it to cache large media files (which will likely be the primary use of the spinning disks.) IT IS A MYSTERY AARGH

It'd also be nice to have a 1-2GB of faster RAM between the CPU caches and the system RAM (with GPU-style 150GB/s bandwidth), but maybe you can only have such things if you have a large number of cores and don't care about latency.

But hmm, the human tolerance for latency is around 0.2s for discrete stuff, 0.015-0.03s for continuous movement. In 0.2s you can do 4GB worth of memory reads and 0.3GB of IO, in 0.02s the numbers are 0.4GB and 0.03GB respectively. On a 3.5 GHz quad-core CPU, you can execute around 3 billion instructions in 0.2s, which'd give you a ratio of 0.75 ops per byte for scalar stuff and 6 ops per byte for 8 op vector instructions. On GPUs the ops/byte ratio is larger, at close to 20 ops/byte for vector ops, which makes them require more computation-heavy algorithms than CPUs.

In 4GB you can fit 60 thousand uncompressed 128x128 images with some metadata. Or hmm, 6 thousand if each has ten animation frames.

No comments:

Blog Archive