art with code


Even more crank on P != NP

Size limits for the input and the output

If the input is exponentially smaller than the output, the problem is not in NP. If the input is power-of-two exponentially larger than the output, the problem is in P.

In the first case, the output is exponentially longer than the input and therefore takes exponentially longer to print out, even on a non-deterministic machine. Even if you could print it out instantly, you need to check the output with a deterministic verifier that needs to run in polynomial time to the original input, and reading the exponentially larger output into the verifier would take exponential time to the original input. So exponentially longer outputs are not in NP.

In the second case, if you have a polynomial time deterministic verification algorithm, you can create a solver by adding an output space enumerator in front of it. If the output is exponentially smaller than the input, hey, you've got a polynomial time solver. If |output| = log2(|input|), enumerating the outputs takes 2|output| = 2log2(|input|) = |input| steps. From NP, you have the verifier that runs in p(|input|) so a solver that enumerates the output space and tries the verifier on each possible solution would run in 2|output| * p(|input|), which becomes |input| * p(|input|) if the output is exponentially smaller than the input like above. (So any problems in NP where the output is a single bit are also in P.)

Due to these size limits, for P ?= NP the size of the output must be polynomial(-ish??) to the size of the input.

Size limits for the solver

How about the size of the deterministic solver relative to the input. If the deterministic solver machine is by necessity exponentially larger than the input (by necessity meaning that for any given part of the machine there exists an input that makes the machine access that part), there also exists an input that causes the machine to move the tape a distance exponential to the size of the input, which takes exponential time relative to the input. So the deterministic solver needs to have a size at most polynomial to the size of the input.

For the lower bound, consider the case where the machine extracts zero information from the input. This machine just enumerates over the output space, checking each answer using the verifier. The runtime of the machine is 2|output| * p(|input|), the code is fixed in size, and it uses |output| extra tape as memory. At the other end is the answer table machine: it's an array of precomputed solutions where the key is the input. The size of the answer table machine is |output| * 2|input| (and due to the above tape movement limitations, will take exponential time to run relative to some input.)

Knowing nothing about the input is out, and knowing everything about the input is out. Every bit in the program encodes information about the input-output-relation, allowing the solution search to proceed faster than brute-force enumeration when the program encounters an input that matches the encoded information. The program can also learn from the input, but it can't learn an exponential amount of information from the input (because that'd take exponential time.) Wild guess: If the amount of information used by the program is exponentially less than the amount of information in the input-output-relation, the program would end up using brute-force search over parts of the output that are polynomial in size to the input, which would take exponential time to the input. If true, there would be no sub-polynomial-space programs for P ?= NP that run in polynomial time.

Given the existence of the deterministic polynomial time verifier, we know that you can extract information from the input-output-pair in polynomial time, even though the amount of information extracted is very small. How small? Well, at least you get |output| / 2|output| bits of info. After trying every 2|output| outputs, you would have |output| bits and a correct output. If there are several correct outputs, you'd gain numberOfOutputs * |output| / 2|output| bits. If the numberOfOutputs grows exponentially with the output size, this would give you a polynomial time solver. But, if this were the case, someone would've noticed it. Another wild guess: the amount of information we get out of the verifier shrinks exponentially in relation to the output size (and therefore the input size, which has a polynomial size relation to the output).

The number of input enumerations is 2|input|, ditto for the output 2p(|input|). We have a deterministic polynomial time verifier that can extract solution information from the input, but the amount of information extracted looks to be a logarithm of the input size. We also have a hypothetical solver program that encodes a polynomially growing amount of information about the input-output-relation, allowing it to avoid brute-force search whenever it encounters an input that matches the encoded information.

If the number of patterns that can be encoded in the input grows exponentially with the input length, then as the problem size grows towards infinity the question becomes: can you discern between an infinite number of input patterns with a zero-length solver program and a verifier that can extract zero bits of information from the input? For P to be NP, the number of patterns in the input needs to grow as a polynomial to the length of the input.


Some more rubbish on P != NP

Feeling crank today, so:

Non-deterministic machines are one-bit oracles. Given a branch in execution, they always follow the correct path. You can use this property to write a universal non-deterministic machine:


NonDeterministic done, correctOutput;

while (!done) {

Its execution time is linear to the size of the output. There are still problems that take non-polynomial time compared to the input. For instance, "given an input of n bits, print out n! bits".

The question in P ?= NP then is: "given a problem where the output size is a polynomial of the input size, is it possible to emulate a non-deterministic machine using a deterministic machine with at most a polynomial slowdown". Since the ND machine takes a number of steps linear to the output size, which is polynomial of the input size, it solves the problem in polynomial time to the input. If the deterministic machine can emulate the ND machine at a polynomial slowdown, its running time is a polynomial of the output size, which is a polynomial of a polynomial of the input size.

Maybe you could also ask if it's possible to prove that there exists a problem for which an optimal program running on a deterministic machine requires a number of steps exponential to the size of its output.

Make a bridge from output size to input size? ...exponential to the size of its output and where the output size grows linearly with input size.


New CPUs

Lots of cores! More cores! AMD bringing 32-core EPYC to the desktop, Intel bringing top-end 28-core Xeons to the desktop.

In terms of compute, it's getting pretty interesting! 32 cores at 4 GHz and SIMD + ispc to SPMD it up. With 8-wide SIMD and 1 cycle FMA, you'd be looking at 2 TFLOPS. If you could push that to 16-wide (or more usefully, double the core count), 4 TFLOPS. That's discrete GPU territory. Not to forget double precision: 1 TFLOPS DP beats all but the uber-expensive compute GPUs.

If you still have that /usr/bin/directx thing, you wouldn't need a GPU at all. Just plug a display to the PCIe bus and send frames to it.

Memory bandwidth is still an issue, a few GB of HBM would help. And it'd be nice to plug in extra CPUs and RAM to PCIe slots.


WAN OpenVPN at 915 Mbps

$ iperf -c remote -P4
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   284 MBytes   238 Mbits/sec
[  3]  0.0-10.0 sec   278 MBytes   233 Mbits/sec
[  6]  0.0-10.0 sec   275 MBytes   230 Mbits/sec
[  5]  0.0-10.1 sec   259 MBytes   216 Mbits/sec
[SUM]  0.0-10.1 sec  1.07 GBytes   915 Mbits/sec

OK, that's working.



So there's this single-socket EPYC TYAN server with 24 NVMe hotswap bays. That's... a lot of NVMe. And they're basically PCIe x4 slots.

What if you turned those NVMe boxes into small computers. Beefy well-cooled ARM SoC with 32 gigs of RAM and a terabyte of flash, wired to the ARM SoC with a wide bus. You might get 200 GB/s memory bandwidth and 10 GB/s flash bandwidth. The external connectivity would be through the PCIe 4.0 x4 bus at 8 GB/s or so.

The ARM chip would perform at around a sixth the perf of a 32-core EPYC, but it'd have a half-teraFLOP GPU on it too. With 24 of those in a 2U server, you'd get four 32-core EPYCs worth of CPU compute, and nearly a Tesla V100 of GPU compute. But. You'd also have aggregate 4.8 TB/s memory bandwidth and 240 GB/s storage bandwidth. In a 2U. Running at, what, 10 W per card?

Price-wise, the storage and RAM would eclipse the price of the ARM SoC -- maybe $700 for the RAM and flash, then $50 for the SoC. Put two SoCs in a single box, double the compute?

Anyway, 768 GB of RAM, 24 TB of flash, 128 x86 cores of compute, plus 80% of a Tesla V100, for a price of $20k. Savings: $50k. Savings in energy consumption: 800 W.

OpenVPN over fast(er) links

Tested OpenVPN with 65k tun-mtu on the IPoIB network. It does 5-6 Gbps, compared to the 20-25 Gbps raw network throughput. I was surprised it managed 6 Gbps in the first place. "Oh what did I break now, why does my test run at 6 Mbps ... oh wait, that's Gbps."

Another problem to track is that on the internal GbE network, the VPN runs at 900+ Mbps. But when connecting to the WAN IP, it only manages 350 Mbps. A mystery wrapped in an enigma. (It's probably the underpowered router kicking me in the shins again. Use one of the fast computers as a firewall, see if that solves the problem.)



Haven't you always wanted to create UNIX pipes that run from one machine to another? Well, you're in luck. Of sorts. For I have spent my Saturday hacking on an InfiniBand RDMA pipeline utility that lets you pipe data between commands running on remote hosts at multi-GB/s speeds.

Unimaginatively named, rdma-pipe comes with the rdpipe utility that coordinates the rdsend and rdrecv programs that do the data piping. The rdpipe program uses SSH as the control channel and starts the send and receive programs on the remote hosts, piping the data through your commands.

For example

  # The fabulous "uppercase on host1, reverse on host2"-pipeline.
  $ echo hello | rdpipe host1:'tr [:lower:] [:upper:]' host2:'rev'

  # Send ZFS snapshots fast-like from fileserver to backup_server.
  $ rdpipe backup@fileserver:'zfs send -I tank/media@last_backup tank/media@head' backup_server:'zfs recv tank-backup/media'

  # Decode video on localhost, stream raw video to remote host.
  $ ffmpeg -i sintel.mpg -pix_fmt yuv420p -f rawvideo - | rdpipe playback_host:'ffplay -f rawvideo -pix_fmt yuv420p -s 720x480 -'

  # And who could forget the famous "pipe page-cached file over the network to /dev/null" benchmark!
  $ rdpipe -v host1:'</scratch/zeroes' host2:'>/dev/null'
  Bandwidth 2.872 GB/s

Anyhow, it's quite raw, new, exciting, needs more eyeballs and tire-kicking. Have a look if you're on InfiniBand and need to pipe data across hosts.

Blog Archive

About Me

My photo

Built art installations, web sites, graphics libraries, web browsers, mobile apps, desktop apps, media player themes, many nutty prototypes