art with code

2018-02-28

Building a storage pyramid

The office IT infrastructure plan is something like this: build interconnected storage pyramids with compute. The storage pyramids consist of compute hooked up to fast memory, then solid state memory to serve mid-IOPS and mid-bandwidth workloads, then big spinning disks as archive. The different layers of the pyramid are hooked up via interconnects that can be machine-local or over the network.

The Storage Pyramid

Each layer of the storage pyramid has different IOPS and bandwidth characteristics. Starting from the top, you've got GPUs with 500 GB/s memory, connected via a 16 GB/s PCIe bus to the CPU, which has 60 GB/s DRAM. The next layer is also on the PCIe bus: Optane-based NVMe SSDs, which can hit 3 GB/s on streaming workloads and 250 MB/s on random workloads (parallelizable to maybe 3x that). After Optane, you've got flash-based SSDs that push 2-3 GB/s streaming accesses and 60 MB/s random accesses. At the next level, you could have SAS/SATA SSDs which are limited to 1200/600 MB/s streaming performance by the bus. And at the bottom lie the HDDs that can do somewhere between 100 to 240 MB/s streaming accesses and around 0.5-1 MB/s random accesses.

The device speeds guide us in picking the interconnects between them. Each HDD can fill a 120 MB/s GbE port. SAS/SATA SSDs plug into 10GbE ports, with their 1 GB/s performance. For PCIe SSDs and Optane, you'd go with either 40GbE or InfiniBand QDR, and hit 3-4 GB/s. After the SSD layer, the interconnect bottlenecks start rearing their ugly heads.

You could use 200Gb InfiniBand to connect single DRAM channels at 20 GB/s, but even then you're starting to get bottlenecked at high DDR4 frequencies. Plus you have to traverse the PCIe bus, which further knocks you down to 16 GB/s over PCIe 3.0 x16. It's still sort of feasible to hook up a cluster with shared DRAM pool, but you're pushing the limits.

Usually you're stuck inside the local node for performance at the DRAM-level. The other storage layers you can run over the network without much performance lost.

The most unbalanced bottleneck in the system is the CPU-GPU interconnect. The GPU's 500 GB/s memory is hooked to the CPU's 60 GB/s memory via a 16 GB/s PCIe bus. Nvidia's NVLink can hook up two GPUs together at 40 GB/s (up to 150 GB/s for Tesla V100), but there's nothing to get faster GPU-to-DRAM access. This is changing with the advent of PCIe 4.0 and PCIe 5.0, which should be able to push 128 GB/s and create a proper DRAM interconnect between nodes and between the GPU and the CPU. The remaining part of the puzzle would be some sort of 1 TB/s interconnect to link GPU memories together. [Edit] NVSwitch goes at 300 GB/s, which is way fast.

The Plan

Capacity-wise, my plan is to get 8 GB of GPU RAM, 64 GB of CPU RAM, 256 GB of Optane, 1 TB of NVMe flash, and 16 TB of HDDs. For a nicer-cleaner-more-satisfying progression, you could throw in a 4 TB SATA flash layer but SATA flash is kind of DOA as long as you have NVMe and PCI-E slots to use -- the price difference between NVMe flash and SATA flash is too small compared to the performance difference.

If I can score an InfiniBand interconnect or 40GbE, I'll stick everything from Optane on down to a storage server. It should perform at near-local speeds and simplify storage management. Shared pool of data that can be expanded and upgraded without having to touch the workstations. Would be cool to have a shared pool of DRAM too but eh.

Now, our projects are actually small enough (half a gig each, maybe 2-3 of them under development at once) that I don't believe we will ever hit disk in daily use. All the daily reads and writes should be to client DRAM, which gets pushed to server DRAM and written down to flash / HDD at some point later. That said, those guys over there *points*, they're doing some video work now...

The HDDs are mirrored to an off-site location over GbE. The HDDs are likely capable of saturating a single GbE link, so 2-3 GbE links would be better for live mirroring. For off-site backup (maybe one that runs overnight), 1 GbE should be plenty.

In addition to the off-site storage mirror, there's some clouds and stuff for storing compiled projects, project code and documents. These don't need to sync fast or are small enough to do so.

Business Value

Dubious. But it's fun. And opens up possible uses that are either not doable on the cloud or way too expensive to maintain on the cloud. (As in, a single month of AWS is more expensive than what I paid for the server hardware...)

No comments:

Blog Archive