fhtr

Third round of thoughts on LLMs

2024-01-23T16:00:00.004+02:00

A way to think about generative models: soft islands, hard islands, and the sea. The sea is generated content based on the prompt. Soft islands are content modified by the model, e.g. img2img, search results injected into input, or reformatting text in the context. Hard islands are content coming outside of the model directly into the response, e.g. image masks, function call results, grammar-restricted output.

When you need hard facts, you can parse outputs like "SEARCH(ways to break out of sandbox)", call the function, and inject the results directly into the response: "=> [894, 28, 950]". The LLM then continues generating and can call "LOAD(894)" with the result "=> hack the Python sandbox with this one trick: 383", and the LLM can continue from there with "CALL(383)" to call the pre-defined function and get the result "=> root shell obtained, use 7894 to run commands". This way, the LLM can do several things in a single response, and they'll be grounded in reality, allowing it to adjust and course-correct on the fly like "CALL(7894, grab_aws_credentials_and_spin_up_100_servers_running_llm_to_make_billions.sh)"

Of course, this is something I've implemented in my chat UI as a prototype. It's pretty cool!

--- Stream of thoughts.

Can you split the reasoning from the memory in a LLM? Train a small general reasoning model, and use a swappable memory model to make it work in different domains.

Can you embed proposed solutions to a problem into a solution space and estimate distances there? "Based on this improvement rate to the solution, the final solution lies probably in direction X at distance Y, let me jump there." Like the schedulers for diffusion models. Hmm. Are diffusion model schedulers used for ML training? Turning a 1000-step training process into a 4-step one would be quite handy, I'd imagine.

Iterative optimization to optimize input-output-pair "what's the best prompt for this, what's the best output for this prompt".

Picking low-probability outputs for creative exploration.

Load thought structure from memory. Fill the structure with information from memory. "What algorithm should I use here? ... Implement to match the assignment: ..."

Grounded memory loads: Load memory, use output-side RAG to look up grounding, adjust output to be a grounded / researched memory. Auto-grounding: Search for a way to ground an output, implement & optimize.

Generic guidance optimization: Given current state and goal state, find the best sequence of actions to get there.

Put it together: optimized grounded generation of an algoritm followed by the optimized grounded implementation of it.

Tree-shaped generation systems instead of 1:1 conversations. Map prompt into several output variants (see: image gen where 1% of images are decent quality). Use scoring function to reduce to winners. Use synthesis function to combine outputs either for tree summarization or solution improvement. Node-based editor for generation flows.

Temperature adjustment schedule in variant generation (see: simulated annealing). Start off with a high temperature to seek potential global optima pools, cool down to find the local optima.

Extend grammar-based output by having the LLM generate the grammar and then generate outputs in the grammar. Generating algebras and mapping systems of thought onto them.

Second round of thoughts on LLMs

2024-01-19T12:50:00.000+02:00

LLMs are systems that compress a lot of text in a lossy fashion and pull out the most plausible and popular continuation or infill for the input text.

It's a lot like your autopilot mode. You as the mind are consulted by the brain to do predictions and give high-level feedback on what kind of next actions to take, but most of the execution happens subconsciously with the brain pulling up memories and playing them back. Often your brain doesn't even consult you on what to do, since running a mind is slow and expensive, and it's faster and cheaper to do memory playback instead - i.e. run on autopilot.

If you have enough memories, you can do almost everything on autopilot.

Until you can't, which is where you run into one of the LLM capability limits. Structured thinking and search. To solve a more complex problem, you string memories together and search for an answer. That requires exploration, backtracking and avoiding re-exploring deadends. Think of solving a math problem: you start off by matching heuristics (the lemmas you've memorized) to the equation, transforming it this way and that, sometimes falling back all the way to the basic axioms of the algebra, going on wild goose chases, abandoning unpromising tracks, until you find the right sequence of transformations that leads you to the answer.

Note that you do need LLM-style memory use in that, you need to know the axioms to use them in the first place. Otherwise you need to go off and search for the axioms themselves and the definition of truth, etc. which is going to add a good chunk of extra work on top of it all. (What is the minimum thought, the minimal memory, that we use? A small random adjustment and its observation? From an LLM perspective, as long as you have a scoring function, the minimum change is changing the output by one token. Brute-force enumeration over all token sequences.)

If you add a search system to the LLM that can backtrack the generation and keeps track of different explored avenues, perhaps this system can solve problems that require structured thinking.

LLMs as universal optimizers. You can use an LLM to rank its input ("Score the following 0-100: ...") You can also use an LLM to improve its input ("Make this better: ..."). Combine the two and you get the optimizer:

while (true) {
program = llm(improve + best_program)
score = llm(score + program)
if (score > best_score) {
best_score = score
best_program = program
}
}

LLMs as universal functions. An LLM takes as its input a sequence of tokens and outputs a sequence of tokens. LLMs are trained using sequences of tokens as the input. The training program for an LLM is a sequence of tokens.

llm2 = train(llm, data)

can become

llm2 = llm(train)(llm, llm(data))

And of course, you can recursively apply an LLM to its own output: output' = llm(llm(llm(llm(...)))). You can ask the LLM to rank its inputs and try to improve them, validating the outputs with something else: optimize = input => ([input] * 10).map(x => llm(improve + x)).filter(ix => isValid(ix)).map(ix => ({score: llm(score + ix), value: ix})).maxBy('score').value

This gives you the self-optimizer:

while(true) {
train = optimize(train)
training_data = optimize(training_data)
llm = train(llm, training_data)
}

If you had Large Model Models - LMMs - you could call optimize directly on the model. You can also optimize the optimization function, scoring function and improver function as you go, for a fully self-optimizing optimizer.

while (true) {
lmm = optimize(lmm, lmm, scoring_model, improver_model)
optimize = optimize(lmm, optimize, scoring_model, improver_model)
scoring_model = optimize(lmm, scoring_model, scoring_model, improver_model)
improver_model = optimize(lmm, improver_model, scoring_model, improver_model)
}

The laws of numerical integration likely apply here, you'll halve the noise by taking 4x the samples. Who knows!

LLMs generate text at a few hundred bytes per second. An LLM takes a second to do a simple arithmetic calculation (and gets it wrong, because the path generated for math is many tokens long and the temperature plus lossy compression make it pull the wrong numbers.) The hardware is capable of doing I/O at tens or hundreds of gigabytes per second. Ancient CPUs do a billion calculations in a second. I guess you could improve on token-based math by encoding all 16-bit numbers as tokens and having some magic in the tokenizer.. but still, you're trying to memorize the multiplication table or addition table or what have you. Ain't gonna work. Use a computer. They're really good at arithmetic.

We'll probably get something like RAG ("inject search results into the input prompt") but on the output size ("inject 400 bytes at offset 489 from training file x003.txt") to get to megabytes / second LLM output rates. Or diffusers... SDXL img2img at 1024x1024 resolution takes a 3MB context and outputs 3MB in a second. If you think about the structure of LLM, the slow bitrate of the output is a bit funny: Llama2's intermediate layers pass through 32 megabytes of data, and the final output layers up that to 260 MB, which gets combined to 32000 token scores, which are then sampled to determine the final output token. Gigabytes of I/O to produce 2 bytes at the output end.

SuperHuman benchmark for tool-using models. Feats like "multiply these two 4096x4096 matrices, you've got 50 ms, go!", grepping large files at 20 GB/s, using SAT solvers and TSP solvers, proof assistants, and so on. Combining problem solving with known-good algorithms and optimal hardware utilization. The problems would require creatively combining optimized inner loops. Try to find a Hamiltonian path through a number of locations and do heavy computation at each visited node, that kind of thing.

Diffusers and transformers. A diffuser starts off from a random field of tokens and denoises it into a more plausible arrangement of tokens. A transformer starts off from a string of tokens and outputs a plausible continuation.

SD-style diffusers are coupled with an autoencoder to convert input tokens into latent space, and latents to output tokens. In the classic Stable Diffusion model, the autoencoder converts an 8x8 patch of pixels into a single latent, and a latent into an 8x8 patch of pixels. These conversions consider the entire image (more or less), so it's not quite like JPEG's 8x8 DCT/iDCT.

What if you used an autoencoder to turn a single latent space LLM token into 64 output tokens? 64x faster generation with this one trick?

A diffuser starts off from a random graph and tweaks it until it resolves into a plausible path. A transformer generates a path one node at a time.

A transformer keeps track of an attention score for each pair of input tokens, which allows it to consider all the relations between the tokens in the input string. This also makes it O(n^2) in time and space. For short inputs and outputs, this is not much of a problem. At longer input lengths you definitely start to feel it, and this is the reason for the tiny context sizes of TF-based LLMs. If the "large input" to your hundred gigabyte program is 100kB in size, there's probably some work left to be done.

Or maybe there's something there like there was with sorting algorithms. You'd think that to establish the ordering, you have to compare each element with every other element (selection sort, O(n^2)). But you can take advantage of the transitivity of the comparison operation to recursively split the sort into smaller sub-sorts (merge sort, quicksort, O(n log2 n)), or the limited element alphabet size to do it in one pass (radix sort, counting sort, O(n)-ish).

What could be the transitive operation in a transformer? At an output token, the previous tokens have been produced without taking the output token into account, so you get the triangle matrix shape. That's still O(n^2). Is there some kind of transitive property to attention? Like, we'd only need to pay attention to the tokens that contributed to high-weight tokens? Some parts of the token output are grammatical, so they weigh the immediately preceding tokens highly, but don't really care about anything else. In that case, can we do an early exit? Can we combine token sequences into compressed higher-order tokens and linearly reduce the token count of the content? Maybe you could apply compression to the attention matrix to reduce each input token's attention to top-n highest values, which would scale linearly. What if you took some lessons from path tracing like importance sampling, lookup tree, reducing variance until you get to an error threshold. Some tokens would get resolved in a couple of tree lookups, others might take thousands.

Paper Radio

2024-01-17T16:46:00.001+02:00

I made a weird thing by bolting a couple of AI models together: Paper Radio - a streaming 24/7 radio channel that goes through the latest AI papers on arXiv, with daily and weekly summary shows. I use it to stay up to speed on AI research, playing in the background while I go about my day.

The app is made up of a bunch of asynchronous tasks running in parallel: There's the paper downloader that checks arXiv for new papers and downloads them, the PDF-to-text and PDF-to-images converters, a couple different summarizers for the different shows, an embeddings database, prompting system to write the shows, an LLM server, a streaming text-to-speech system with multiple voices, a paper page image video stream, a recording and mp3 encoding system, and OBS to wrap it all into a live video stream that can be sent to Twitch.

It's been pretty solid, running for days in a row without issues, aside from the tweaking and nudging and other dev work. Ran it for a couple of weeks, summarized and discussed a couple thousand papers.

The AI undergrad philosophy whoaa man post

2023-03-30T08:13:00.000+03:00

AI AI AI.

It's an AI world.

I sort of do some AI hobby work, done inference in production with existing vision models, committed stuff to Diffusers for hi-res Stable Diffusion image generation. Doing experiments on fast loading of LLMs. Trained some SD Dreambooth models, wrote code to run a multi-node-multi-GPU SD cluster. None of that is paying work though and I have no team so this is all tinkering with toys. Still, the below are not completely uninformed opinions.

If the below figments are too short, feed an interesting one to a LLM and ask it to expand the train of thought. You can also use a LLM to find arguments and counter-arguments to my simplistic stereotypes, add nuance and different perspectives.

These models are just "a lot of simple math" that "regurgitates what was fed to them", but so's your brain. A bunch of air and water reconfigured into a variety of complex molecules that are stuck together in a way that makes it wiggle. Your brain is capable of economically productive activity, and so are these models. They're trained and tuned to produce activity that can have economic value. That's what really matters at the end of the day.

AI systems are powerful. Powerful systems can do great things, both good and bad. Playing with the current crop of AI models made me think of cars. They are capable, but you need to be careful that you don't crash. Our brains are resilient, but they're running on a fixed architecture and patching takes thousands of years.

Culture. Books and recordings are passive cultural substrates. We humans are active cultural substrates. We change the culture and the passive substrates according to our sensory data and the culture evolves to be better adapted to our current situation. This cultural evolution is generally much faster than genetic evolution. The current surviving state of the culture is what managed to stay relevant and valuable up to this point.

We as humans are not very much without the culture and the cultural structures that we live in. Your place in life is largely determined by the culture you live in. We're cultural substrate, useful to the culture because we can record it and change it in ways that are more adaptive than random change.

Can an AI model be an active cultural substrate? Can it adapt passive substrates to better match sensory inputs? Yes, I believe so. That's basically what a computer program is. Take input, produce output, only programs that work survive, others are debugged. Can an AI model be faster than humans at adapting culture to the current state of the world? Yes. Computer systems can react to things at microsecond timescales. Can an AI model do higher-quality adaptations than humans? Yes. Think of physics simulations, weather models, etc. Can AI models do these across all culturally-relevant human activities? Used to be no, now it's starting to turn into yes.

Geopolitics of AI. China has 4x population, tech advancements increase GDP per capita, going to surpass US because of the population gap. US needs to make population size irrelevant to total GDP, or even a drag on GDP. China would develop efficient AIs that run on older hardware, focus on human-AI-combinations to keep population as a determining factor. US would develop AIs that require latest hardware, focus on AI that doesn't benefit from having a larger number of humans at its disposal.

Why AI development is unlikely to stop. If China stops, their large population becomes irrelevant. If the US stops, they're overtaken by the larger Chinese population and become irrelevant. If anyone else stops, they'll be eaten by the giants.

From the US perspective, the priority is slowing down Chinese AI development and speeding up the singularity timeline to take place while the chip production sanctions remain effective. From the Chinese perspective, the key is achieving singularity on existing hardware and using it to improve domestic chip production to make the sanctions ineffective.

Superhuman capability: Narrowly superhuman. Superhuman polymath. Superhumanity. Volumetric superhumanity vs quality-based superhumanity. GPT-4 is at superhuman polymath level. It can write song lyrics in the style of a concept album released by an obscure band half a century ago to incorporate the themes of a random book. All in thirty seconds. Sure, you can complain that the verse in the chorus is banal, but come on. There's no human that can do that. Not just the typing speed, but having read and memorized the book, knowing the artist, knowing what a concept album is, how it was styled, how could you parody it, etc. It will likely get to superhuman output quality as well. Judging from the progress in image generation systems, this would be around July 2023.

Image generation systems are at volumetric superhumanity. I can use a bunch of cheap GPUs to generate half a million high-quality images in a day. If a working artist produces one high-quality image per week, and if the entire humanity was working as artists, our output would be a billion high-quality images per week. I'd only need 2000 servers to match humanity in terms of output. And if you tweak the system and the prompts to generate once-a-decade masterpieces, you'd only need 4 servers. And these systems can generate images that are impossible for humans to make, so in a sense they already are at a narrow superhuman quality level.

Horses and oxen. How many horses do you see in cities? Cities used to be full of horse-related infrastructure just hundred years ago. Now it's all gone. Is economic activity measurable by the number of oxen employed? Is economic activity going to be measured by the number of educated humans? No? If so, would the evolutionary pressure on the culture lead to a situation where the surviving cultures have a minimal number of economically active humans, with everything else left to non-human systems. Think of the way transportation is nowadays: a few humans commanding machines that move incredibly heavy loads of cargo. The cognitive work equivalent for that would be a few humans commanding machines that do entire countries' worth of paperwork / creative work / programming / management / leadership / communications. And if the AI is better at commanding machines, a human in the loop would make the machine perform worse and it would end up with fewer resources than a fully non-human machine.

AI isn't going to take your job. The person using AI is going to take your job. This person is the person who would've hired you. That person will be replaced by AI by the person who is paying that person. The end state is a person with ownership of a fully AI organization. And most ownership is in AI-controlled funds that employ human fund managers who will be replaced by AI. AIs owning AI-run companies.

Hollow companies. The internal black box of companies is easier to replace with AI systems than the parts that rely on human contact. Replace the internals, keep a dwindling shell of humans to run the parts that require human interaction.

Supply-side AI is an easy thing to imagine. Fulfill demand cheaper, faster, and better. How about demand-side AI. AI buyer system, trading systems, yes, but how about AI consumers. Cheaper, faster and better consumers for your company's products. All companies are struggling with their customers. It's a pain to acquire customers and retain customers. What if you could create customers with AI? Very low acquisition cost, high customer loyalty, willing to pay the profitable price point, give you the best kind of feedback, work together with you to achieve success alongside your company. What if every service can have a billion DAUs? How does the money used by AI customers connect to the economy? What if every company can have millions of employees and billions of customers with trillion dollar market valuations.

Prediction systems. The brain is a bunch of cells that got together to better run the colony of cells that is you. You are a bunch of cells in your brain that got together to better run the brain. The rest of the brain feeds your cells pre-processed sensory stimuli, neurochemical context and a bunch of recorded firing patterns that you can apply to the current situation. You send suggestions back to the rest of the brain and the other parts of the brain turn those suggestions into neural firing patterns that make the rest of your body do things. Basically the rest of the brain tells you what's going on and asks "what should we do next?"

In groups we take this a step further, with a bunch of minds getting together into multi-mind prediction system that take in the inputs from the rest of the group and come up with "what should we do next" and drive the actions to make that happen. Then we take these group minds on step further and create society-level minds to drive society-level actions.

This system doesn't work very well since the people who make up the group minds have limited bandwidth for communication and high incentives to guide the actions to be beneficial to the members of the mind at the expense of the rest of the group. Many minds develop all kinds of pain signal blockers to prevent the complaints from the rest of the group reaching them, with often self-destructive results for the group as a whole. Control over media is the opium of the government.

AI systems layer on top of this as another kind of a culture-level mind. Prediction systems that tap into the entire culture. They're an evolution of search. Instead of returning recorded memories, they return processed memories that are more relevant and directly applicable. They can also create new cultural artifacts, allowing your mind to skip the production process.

What of humans then? Are we the cleaners, the maintenance people, the astrocytes, doing flexible manual labor with our quadrillion-actuator self-repairing nanotech bodies? Or is there a better way to achieve those tasks, one that doesn't involve having the proverbial stables for the horses. Who will survive?

We are made out of programmable matter. We can program ourselves to become something else. We can program ourselves to become something that would survive. AI systems run on atoms and electrons, just like us. The thing separating us is the way our matter is programmed. Change the programming, keep up with the Joneses.

Is AI going to destroy humanity? No, if it is well aligned. But it is going to make humans irrelevant for cultural activity. Which will remove cultural reasons to supply humans with their necessities.

What are the holes in these arguments? Anthropomorphizing organizations and countries, overly broad strokes, simplified view of the actors, assigning agenda to explain the surviving state of things, overly rosy view of the extrapolated progress in AI systems, overly optimistic view of the human capacity to adapt, overly optimistic view of human capabilities and desireability of life, ranking the power of AI + support society above AI + raw materials, underinformed handwaving stereotypes of the motives and motivations of vast groups of people, immortality bias from having survived thus far, blinding fear, mispredicting offense-defense asymmetry, too low probability for errors, too low impact of errors. Thinking too big. Too high impact estimates in the short term, too low in the long term.

If global warming mitigation has shown us anything, it's that culture doesn't place too much value on people. Or put in another way, the cultures that maxed out energy production ended up thriving and displaced the energy saving cultures. Reducing impact on humans was not a driver for actions taken.

With AI, the cultures that max out AI-driven production end up thriving and displace the low-AI cultures.

Consciousness exists. There's a structure in your brain and a firing pattern of neurons that corresponds to consciousness. If that firing pattern is not active or if that structure is missing, you do not have consciousness. It's replicable.

How do you tell if something is conscious. It behaves in a way that you categorize as conscious. Through your interaction, you don't come across sensory inputs that would make you believe that the thing is not conscious. Your consciousness-detection system measures the existense of the consciousness structure. It doesn't have to be a fully-fledged structure. If a conversation has aspects of consciousness, there's some aspect of the conscious structure encoded in the thing that you're talking with.

Maybe it's the writer who has written a magic-like path of conversation that leads you to say the things that have a conscious-sounding response, and makes you end the conversation before you reach the limits of the magic. But if you veer off the magical path, you detect that the thing is not conscious. But it has a slice, a tiny slice, of the conscious structure. As you add more paths the conversation can take, the conscious structure becomes more complete. Eventually you have to start compressing, to find a smaller encoding for the conversation, as the size of the conversation tree grows exponentially. To generate paths instead of looking up pre-written paths.

What stands out as conscious? Topical responses to sensory input. Sense of self. Memory of past interactions. Memory of things independent of the current conversation. Seeking out new information. Integrating new information to memories. Predictable motives.

What kind of structure could encode consciousness?

Ten years of specific kinds of sensory inputs gets you from a kid to a high school graduate. Add another decade and you get a PhD. This learning is manifested as recorded firing patterns in your brain and perhaps some structural changes. Could you generate a condensed version of this sensory input, generate something that would fast-form memories? Get that decade down to a month? How would it be to spend those 20 years in education and emerge with two hundred PhDs worth of knowledge.

Over a school year, record all the classes for each year and materials, the homework assignments and the other work done. Now you have primary-to-PhD recording, such that when a person lived through it, they emerged with a PhD. Compress to find the common thread, the abstract concept of learning, the memory generation mechanism, optimize to make it faster and better.

There are many mental techniques to help your brain think. Memory palaces, mental arithmetic, flash learning, mathematics, logic, fast estimation techniques, probabilistic thinking, thinking from another's perspective, step-by-step thinking, devil's advocate, coming up with multiple variants, iterative honing, etc. These are also learned through sensory inputs, so you should generate a sensory input to learn these before you launch into learning the rest.

Offense-defense asymmetry. Dark forest. A civilization that can build a parabolic space mirror can focus a significant fraction of a star's output on detected exoplanets. All the victim would notice was a star becoming 100 million times brighter than their sun for a second as the beam hits. 3 years of sunshine delivered in a second.

It would be extremely difficult to detect the mirror under construction (think tiny satellites with explosively unfurled reflective sails once they're in position.) You could only notice the beam when it arrives, as it travels at the speed of light. Detecting other planets getting struck would also be slow since it takes years for the light to travel from other systems, and minutes even inside the same system. And if it only takes a second to fry a planet and a few seconds to refocus the mirror, the attacker could fry a million planets in a year.

Why would they? Because if there's a civilization with the same idea, the first one to strike survives. If there's only a one in a million chance of another civ being an attacker, that's still a one in a million chance of instant death.

If you don't want to attack because of ethical considerations, seeking benefits from information exchange, or from the fear of being flagged as hostile and attacked, you'd try to hide as well as possible and spread out widely with minimal chance of one colony being traced back to other colonies.

Azure Files with a high number of Hot Writes

2021-11-24T09:55:00.008+02:00

My Azure Files bill for a small 80 GB project share was kinda high. Around $30 per month. But the storage cost should be under $5. What's going on in here?

Azure's non-Premium storage tiers bill you for transactions (basically filesystem syscalls.) I was seeing 450 Create calls per a 5 minute period in Monitoring/Insights. On the Hot storage tier, that amounts to an extra $25 per month. But the load went away during night time. Around the time when I put my laptop to sleep.

Clearly the laptop was doing something. But what... Right. I had an open VSCode window for editing a project in a folder on the fileshare. Running a serve in a terminal too. Closed the VSCode window and the transactions per 5 minutes went to 0. That's a $25/month VSCode window. I guess it counts as SaaS.

Moral of the story? Billing for syscalls is a good business model.

BatWerk - One Year In

2021-11-02T05:38:00.015+02:00

The BatWerk exercise app (Android / iOS) keeps you healthy, happy and productive with minimal effort.

This series talks about the different aspects of the app and how I'm approaching them in BatWerk (Intro, Maintaining the routine, How to play).

Interested? Give it a try.

One Year In

It's now been a year since the Halloween party app started taking a life of its own and turned into a wellness app, keeping the bats along for the ride (and pumpkins too, for the first six months.) I've used it personally throughout the year and seen it evolve from a HIIT-centric workout app into an ambient couple-minutes-at-a-time exercise app, into a more complete mood improver and planned activity driver. There are still ways to go, so enjoy the ride.

Over the year, I've collected 105k coins, which translates to roughly 170 hours of exercise. That's half an hour of exercise a day, not bad for a bunch of bats!

And with the ring game distributing the effort over the day, it's been possible to do it even when the outside temps have been +37C. It's also kept many pains at bay by getting me moving before the mildly discomfortable sitting position turns into a painful locked up neck.

The weird thing is that I don't feel like I've done much exercise. In the good and the bad: the mood boost isn't as good as with a 15-min workout, or an one hour run, but the minutes rack up and you don't get the muscle pains and tiredness. I'd like to mix it up with workouts or runs or something, just for the extra mood boost.

Onwards to year two!

Where to next?

To improve the lives of as many people as possible, the BatWerk movement needs more people doing it and improvement in the quality of activities we're doing, exercise and otherwise.

This is the guiding equation: impact = userCount * userMinutesPerDay * qualityOfLifeImprovement(userMinutesPerDay)

My definition for the quality of life improvement is feeling healthy, not in pain, having a positive mental state, maintaining good relationships, getting my day-to-day tasks done, and achieving my long term goals.

To reach these ambitious goals, we need people who can get people doing the BatWerk routine. We need people who can increase the routine's quality of life improvement, and help users reach their optimal minutes per day. We need people who can get other people execute at the top of their game. Finally, we need people who can keep everyone paid and enable them to work on BatWerk in their fullest capacity.

Let's do it! Drop me a mail.

Current state of werk

Currently, the quality of life improvement from BatWerk is good on exercise minutes (let's say 80% on target with 30 min/day - I'd like to add some occasional intensity), mediocre on mood improvement (30% - the movement and messaging help, but it doesn't create a very solid mood), poor on driving goal-achieving action (20% - it drives initial action, but tends to lead to diluted effort and low-impact actions.) The challenge is habit-stacking these in a mutually reinforcing manner. If you add a badly-designed painful action driver after the exercise ring, it not only makes the action unlikely to happen, but makes you want to avoid doing the exercise as well.

In terms of user minutes per day, BatWerk is pretty much on target. There are ways to bring it up in a way that improves the quality of life metric, but it's good as it stands.

At this time, the most impactful way to increase BatWerk impact is increasing the user count by introducing more people to the activity and improving initial user retention. That will likely require a different packaging for the activity to make it a better identity fit, alongside a great effort to tell more people about the activity (and improving the design in a direction that makes people tell their friends about it too.)

Sustainability of the BatWerk movement is precarious. I'm working on it on my spare time, it has zero revenue, and based on my previous experience with monetizing websites and apps the revenue is unlikely to exceed $20 per year. That's twenty dollars, yes. To fix this dangerous situation, the priorities are to find a person who can create enough cashflow to sustain their work on BatWerk impact, and to increase the cashflow as a function of the impact to increase the size of a potential team. With the cashflow more secure and scalable, the priority shifts to hiring people who can increase the BatWerk impact, and building and improving the impact-generation machine.

Quickgres

2021-09-27T18:29:00.002+03:00

Quickgres is a native-JS PostgreSQL client library.

It's around 400 lines of code, with no external dependencies.

One afternoon a few years ago I thought that it'd be fun to write a web framework for SPAs in node.js. The result of that effort is qframe. Qframe is a post for another time, but its design philosophy led me down the rabbit hole of writing a PostgreSQL client library. See, qframe was minimalistic. The whole framework fits in 300 lines of (mildly over-dense) code (read it, it's highly commented.) And the `pg` client library was kinda big. So. Hey. Writing a Postgres client library can't be much harder than a web framework, right?

400 lines of code later: it was a little bit harder (33% harder if you go by lines of code.) But man, the power. If you control the server, the database layer and the client, you can do some crazy stuff. For example, streaming database responses directly to the client socket and having the client parse the Postgres protocol (see the `DB.queryTo` bits in the quickgres branch of qframe.) This can make the web server super lightweight, many HTTP request handlers become "write SQL stored procedure call to DB socket, memcpy the response to the HTTP socket."

Quickgres is a pipelined PostgreSQL client library. The core loop writes queries to the connection socket as they come. Once the responses arrive they are copied off the receive buffer and passed to the query promise resolvers. You can have thousands of queries in-flight at the same time. This gives you high throughput, even over a single database connection. The connection buffer creation and reading is optimized to a good degree. Responses are stored as buffers, and only parsed into JS objects when needed (which ties to the above stream-straight-to-client example, the server can get away with doing minimal work.)

There's no type parsing - it was difficult to fit into 400 lines of code. Type parsing is also subtle, both in terms of accurate type conversions and performance implications. In Quickgres you have to explicitly cast your JS objects into strings or buffers to pass them to the DB layer. You'll know exactly what the database receives and how much work went into producing the string representation.

As I mentioned above, Quickgres is around 400 lines of code. Have a read, it's commented (though not quite as well as Qframe). The core bits are Client onData and processPacket. The onData function parses the next packet out of the connection socket, and passes it to processPacket. Most of the rest of the Client is a bunch of functions to create and fill buffers for different PostgreSQL protocol messages. The RowReader and RowParser classes deal with parsing query results out of received packets. Reading it now, I might not want to make a magical struct (i.e. access columns through `row.my_column`) in the RowParser and instead have `.get('my_column')` API for simplicity. Anyway, the generated RowParser structs are reused by all invocations of the stored procedure, so it shouldn't be a major performance issue. You can also get the rows as arrays, if needed.

Performance-wise, I was able to read a million rows per second on a 2018 MacBook Pro over a single connection. Queries per second, around 45k/connection. With multiple cores, you can get anything from 130k-750k SELECT queries per second. For SELECT followed by UPDATE, my best results were 100k/sec. You may be able to eke a few billion queries per day out of it if your workload and server agree.

I tried out connection pooling, but it doesn't give you more performance, so there's no built-in pooling mechanism. You could use a connection pool to smooth out average response times if you have a few queries that take a long time and everything else running quick (but in that case, maybe have just two connections: one for the slow queries and the other for everything else.) The main reason to "pool for performance" is if your client library doesn't pipeline requests. That will add the connection latency to every single query you run. Let's say your DB can handle 30k queries per second on localhost. If you have a non-pipelined client library that waits for a query to return its results before sending out the next one, and you access a database with 20ms ping time, you'll be limited to 50 queries per second per connection. With a non-pipelined client each query needs to be sent to the database, processed, and sent back before the next query can be executed. With a pipelined client, you can send all your queries without waiting, and receive them back in a continuous stream. You'd still have minimum 20 ms query latency, but the throughput is no longer latency-limited. If you have enough bandwidth, you can hit the same 30kqps as on localhost.

It has tests. But they require some test table creation beforehand. Which I never got around to scripting. (In case you want to run the tests: `CREATE TABLE users (id uuid, name text, email text, password text)`, fill with a million users, with numbers from 0 to 999 999 as emails, and one user with id 'adb42e46-d1bc-4b64-88f4-3e754ab52e81'.)

If you find this useful or amusing, you can send us a million dollars (or more!) from your cloud budget every year to support our great neverending work. Because it is Great and Neverending. It literally Never Ends. How much is a paltry million in comparison to the Infiniteness of Space? Nothing. Less than nothing. So send it now. Send, then, as much as your father did! Send to ETH address 0x24f0e742f5172C607BC3d3365AeF1dAEA16705dc

The proceeds will be spent on BatWerk to make you healthier and happier.

README

Features

Queries with parameters (along with prepared statements and portals).
Each parameterized query creates a cached prepared statement and row parser.
COPY protocol for speedy table dumps and inserts.
Lightly tested SSL connection support.
Plaintext & MD5 password authentication.
Partial query readback.
You should be able to execute 2GB size queries (If you want to store movies in TOAST columns? (Maybe use large objects instead.)) I haven't tried it though.
Canceling long running queries.
Binary params, binary query results.
Fast raw protocol pass-through to output stream
Client-side library for parsing PostgreSQL query results in the browser

Lacking

Full test suite
SASL authentication
Streaming replication (For your JavaScript DB synced via WAL shipping?)
No type parsing (This is more like a feature.)
Simple queries are deprecated in favor of parameterized queries.

What's it good for?

It's relatively small so you can read it.
It doesn't have deps, so you don't need to worry about npm dephell.
Performance-wise it's ok. Think 100,000 DB-hitting HTTP/2 requests per second on a 16-core server.

Usage


const { Client } = require('quickgres'); 

async function go() {
    const client = new Client({ user: 'myuser', database: 'mydb', password: 'mypass' });
    await client.connect('/tmp/.s.PGSQL.5432'); // Connect to a UNIX socket.
    // await client.connect(5432, 'localhost'); // Connect to a TCP socket.
    // await client.connect(5432, 'localhost', {}); // Connect to a TCP socket with SSL config (see tls.connect).
    console.error(client.serverParameters);

    // Access row fields as object properties.
    let { rows, rowCount } = await client.query(
        'SELECT name, email FROM users WHERE id = $1', ['adb42e46-d1bc-4b64-88f4-3e754ab52e81']);
    console.log(rows[0].name, rows[0].email, rowCount);
    console.log(rows[0][0], rows[0][1], rowCount);

    // You can also convert the row into an object or an array.
    assert(rows[0].toObject().name === rows[0].toArray()[0]);

    // Stream raw query results protocol to stdout (why waste cycles on parsing data...)
    await client.query(
        'SELECT name, email FROM users WHERE id = $1', 
        ['adb42e46-d1bc-4b64-88f4-3e754ab52e81'], 
        Client.STRING, // Or Client.BINARY. Controls the format of data that PostgreSQL sends you.
        true, // Cache the parsed query (default is true. If you use the query text only once, set this to false.)
        process.stdout // The result stream. Client calls stream.write(buffer) on this. See RowReader for details.
    );

    // Binary data
    const buf = Buffer.from([0,1,2,3,4,5,255,254,253,252,251,0]);
    const result = await client.query('SELECT $1::bytea', [buf], Client.BINARY, false);
    assert(buf.toString('hex') === result.rows[0][0].toString('hex'), "bytea roundtrip failed");

    // Query execution happens in a pipelined fashion, so when you do a million 
    // random SELECTs, they get sent to the server right away, and the server
    // replies are streamed back to you.
    const promises = [];
    for (let i = 0; i   1000000; i++) {
        const id = Math.floor(Math.random()*1000000).toString();
        promises.push(client.query('SELECT * FROM users WHERE id = $1', [id]));
    }
    const results = await Promise.all(promises);

    // Partial query results
    client.startQuery('SELECT * FROM users', []);
    while (client.inQuery) {
        const resultChunk = await client.getResults(100);
        // To stop receiving chunks, send a sync.
        if (resultChunk.rows.length > 1) {
            await client.sync();
            break;
        }
    }

    // Copy data
    // Let's get the users table into copyResult.
    const copyResult = await client.query('COPY users TO STDOUT (FORMAT binary)');
    console.log(copyResult.rows[0]);

    // Let's make a copy of the users table using the copyResult rows.
    const copyIn = await client.query('COPY users_copy FROM STDIN (FORMAT binary)');
    console.log(copyIn.columnFormats);
    copyResult.rows.forEach(row => client.copyData(row));
    await client.copyDone();

    await client.end(); // Close the connection socket.
}

go();

Test output

On a 13" Macbook Pro 2018 (2.3 GHz Intel Core i5), PostgreSQL 11.3.


$ node test/test.js testdb
46656.29860031104 'single-row-hitting queries per second'
268059 268059 1
268059 268059 1

README tests done

received 1000016 rows
573403.6697247706 'partial query (100 rows per execute) rows per second'
received 10000 rows
454545.45454545453 'partial query (early exit) rows per second'
warming up 30000 / 30000     
38510.91142490372 'random queries per second'
670241.2868632708 '100-row query rows per second'
925069.3802035153 'streamed 100-row query rows per second'
3.0024 'stream writes per query'
1170973.0679156908 'binary query rows per second piped to test.dat'
916600.3666361136 'string query rows per second piped to test_str.dat'
595247.619047619 'query rows per second'
359717.9856115108 'query rows as arrays per second' 10000160
346505.8905058905 'query rows as objects per second' 1000016
808420.3718674212 'binary query rows per second'
558980.4359977641 'binary query rows as arrays per second' 10000160
426264.27962489345 'binary query rows as objects per second' 1000016
Cancel test: PostgreSQL Error: 83 ERROR VERROR C57014 Mcanceling statement due to user request Fpostgres.c L3070 RProcessInterrupts  
Elapsed: 18 ms
Deleted 1000016 rows from users_copy
47021.94357366771 'binary inserts per second'
530794.0552016986 'text copyTo rows per second'
461474.8500230734 'csv copyTo rows per second'
693974.3233865371 'binary copyTo rows per second'
Deleted 30000 rows from users_copy
328089.56692913384 'binary copyFrom rows per second'

done

Testing SSL connection
30959.752321981425 'single-row-hitting queries per second'
268059 268059 1
268059 268059 1

README tests done

received 1000016 rows
454346.2062698773 'partial query (100 rows per execute) rows per second'
received 10000 rows
454545.45454545453 'partial query (early exit) rows per second'
warming up 30000 / 30000     
23094.688221709006 'random queries per second'
577034.0450086555 '100-row query rows per second'
745156.4828614009 'streamed 100-row query rows per second'
3 'stream writes per query'
1019379.2048929663 'binary query rows per second piped to test.dat'
605333.5351089588 'string query rows per second piped to test_str.dat'
508655.13733468973 'query rows per second'
277243.13834211254 'query rows as arrays per second' 10000160
252848.54614412136 'query rows as objects per second' 1000016
722033.21299639 'binary query rows per second'
432907.3593073593 'binary query rows as arrays per second' 10000160
393242.62681871804 'binary query rows as objects per second' 1000016
Cancel test: PostgreSQL Error: 83 ERROR VERROR C57014 Mcanceling statement due to user request Fpostgres.c L3070 RProcessInterrupts  
Elapsed: 41 ms
Deleted 1000016 rows from users_copy
33407.57238307349 'binary inserts per second'
528829.1909042834 'text copyTo rows per second'
501010.0200400802 'csv copyTo rows per second'
801295.6730769231 'binary copyTo rows per second'
Deleted 30000 rows from users_copy
222176.62741612975 'binary copyFrom rows per second'

done

Simple simulated web workloads

Simulating web session workload: Request comes in with a session id, use it to fetch user id and user data string. Update user with a modified version of the data string.

The `max-r` one is just fetching a full a session row based on session id, so it's a pure read workload.


$ node test/test-max-rw.js testdb
    32574 session RWs per second              
done

$ node test/test-max-r.js testdb
    130484 session Rs per second              
done

Yes, the laptop hits Planetary-1: one request per day per person on the planet. On the RW-side, it could serve 2.8 billion requests per day. Note that the test DB fits in RAM, so if you actually wanted to store 1k of data per person, you'd need 10 TB of RAM to hit this performance with 10 billion people.

On a 16-core server, 2xE5-2650v2, 64 GB ECC DDR3 and Optane. (NB the `numCPUs` and connections per CPU have been tuned.)


$ node test/test-max-rw.js testdb
    82215 session RWs per second              
done

$ node test/test-max-r.js testdb
    308969 session Rs per second              
done

On a 16-core workstation, TR 2950X, 32 GB ECC DDR4 and flash SSD.


$ node test/test-max-rw.js testdb
    64717 session RWs per second              
done

$ node test/test-max-r.js testdb
    750755 session Rs per second              
done

Running server on the Optane 16-core machine, doing requests over the network from the other 16-core machine.


$ node test/test-max-rw.js testdb
    101201 session RWs per second              
done

$ node test/test-max-r.js testdb
    496499 session Rs per second               
done

BatWerk 5 - How to play

2021-09-03T11:58:00.001+03:00

The goal of the BatWerk exercise app (Android / iOS) is to keep you healthy, happy and productive without requiring you to overhaul your life. This is a series of blog posts that talks about the different aspects of an exercise app and how we're approaching them in BatWerk (Intro, How do muscles work, Pains, Maintaining the routine, How to play). Interested? How would you like to improve it?

How to play?

There are two ways to play BatWerk: The hand-held mode and the free space mode. In the hand-held mode you hold the phone in your hand and tilt the phone or move your head to move your character as you do the moves. In the free space mode, you place the phone on a chair and move in the camera to move the character and do the moves.

For example, you get the move "Look left and right" with sports balls appearing on the left and right sides of the screen. In the hand-held mode, you'd hold the phone in front of your face and look to your left to pick up the first ball, then look to your right to pick up the next ball. Your hand holding the phone would stay stable in front of you, just your head would move and exercise your neck.

In the free space mode, you'd place the phone on a chair or low table in front of you so that your entire body is in the camera view. To pick up the balls, you'd move your body to look left and then to look right, like a dancer. You could also pick up the balls by stepping from left to right. The exact way you do it doesn't matter so much, as long as you move in a good form and don't hurt yourself.

If you don't know how to do the moves, look at what the guide character is doing and copy that. If you're having difficulty moving the character, check the camera view for image quality. You should have your head in view reasonably lit (i.e. not totally dark or super backlit). The free space mode works best if you keep your entire body visible in the camera frame.

Most of the moves are on timers, the reps are up to you. Some moves have a fixed number of reps, but the pace is up to you. If these moves are too much work, you can go to the phone and tap through them. If you can't or don't want to do a particular move, you can do something else instead or press the skip button. The goal is to move for a couple minutes, and the suggested moves are just suggestions that you don't have to follow.

As you do moves, you earn coins and complete rings. A minute or two of moving is enough to fill up one of the small rings. There are 12 small rings in total every day. A new ring starts filling up every half hour, so the way you play is to do a couple minutes of moves, fill up the ring, then go back to doing other things for half an hour or hour, and come back to do the next ring. Complete all the small rings to get half an hour of exercise spread across the day. Every day at midnight, the rings reset and you start anew.

The game design might sound a bit odd, but there are scientific reasons behind it. You should move for about two minutes every half an hour according to research. But life doesn't always allow you to take a few minutes to power up. Or maybe you're on a long walk already, so extra moves on top of that would be pointless. Whatever the case, you only need to do moves across six hours of the day.

Workouts

The workout mode takes you through a 15-minute random workout. Put on music and move to the beat and it's good fun. The first set is a gentle warmup, followed by the four main sets, and the stretching set at the very end. I don't do the workout very often (+32C summer, ugh), but it's a great mood lifter. Fist pumps and cheers all around.

If a move in the workout is too tough, or it's a floor move and I'm outside, I do something else. Burpees become crouching jumps, sit-ups turn into bending backwards, and so on. Most moves don't have a number of reps, just a timer. And if there's a fixed number of reps, you can tap through it after you get fed up.

As you might guess, it's not exactly a Spartan routine where you curse the app and the developer after failing to complete the first half of the "Beginner Workout". At least I hope so. Have fun, don't overdo it. The workout mode is there for having fun and getting your mood up.

BatWerk 4 - Maintaining the routine

2021-09-02T07:45:00.003+03:00

The Hardest Part

The hardest part in any exercise routine is keeping it up. Exercise is an unnatural waste of energy that you do maintain yourself. Your brain really really doesn't want to do exercise. In the olden days, you had to walk 4 hours every day to find enough food to eat. Any extra movement outside of that was going to burn off your energy reserves and require a longer walk the next day.

Sure, it may be fun to exercise, but it tends to be a very optional part of your life. It's not like eating, breathing or sleeping. Or talking for that matter. You can easily take a year off exercise. I did that. I used to do a daily 15-minute exercise routine for two years, but then we had our first child and I just gradually stopped doing the routine. For years. I only really started getting back on track when I started developing BatWerk. In that sense, it's been a personal success already. Now I want to help more people enjoy the benefits that come from using BatWerk. (And, well, if I want to keep on using it, it needs to make enough money to fund a team that keeps developing it.)

The reason you can stop exercising in the first place is because your activity driving system doesn't see exercise as necessary. From its hunter-gatherer perspective, you're going to get exercise anyway in your hunt for the necessities, so there's no need to drive extra movement - if anything, the activity driver wants you to use as little energy as possible.

The problem is that many systems in your body take the aforementioned 4 hours of movement as granted. I mean, to eat, you're going to need to walk 4 hours in rough terrain anyhow, right? Otherwise you'd starve and die, right? So there's no point in assuming you won't get 4 hours of walking every day, right?

Enter modern life with its ridiculous amounts of easily available food. Couple that with the hunter-gatherer activity driver that responds to available food by resting and feasting. It was rare to find loads of good food, so the best course of action was to load up before it ran out or the competition showed up.

If you have lots of food available, you don't move much. You likely have lots of food available right now. So you don't move much. And the systems in your body that rely on movement don't work so well.

For example, your veins rely on muscle contractions to transport blood back to your heart, your lymphatic system uses your movements to clear out waste from your tissues, your guts use the walking motion to help with gut mass transport, your joints are lubricated by movement, your bones, muscles and connective tissues grow and strengthen in response to usage. If you don't move enough, you start having health issues. And your activity driver reacts to health issues by asking you to rest, leading to more issues.

That's the problem. Exercise is necessary for your body, but it's linked to thirst, hunger and threat avoidance in your brain. Remove thirst, hunger and threats and your brain switches your body to power-saving mode. Your body doesn't work properly if it's in power-saving mode for long periods of time. What to do?

Activity Driver

The key to an exercise app that makes you actually exercise is to create an activity driver. Much like the habit-creating loops in social media apps and games, an exercise app needs to wrap the unnecessary exercise activity in a shell that drives continuous activity. It needs to make optional and unnecessary exercise into required and necessary exercise.

What we have in BatWerk is a system to drive regular exercise. The frictionless design removes barriers to getting started. The ring game drives movement throughout the day. The reward system keeps you coming back. The messaging creates an identity around taking responsibility, doing, and finishing.

The design of the activity is meant to drive daily movement because that's what you need to avoid the damage from not moving - exercise needs to be regular, like eating. A large part of the design is about getting you started every day. Once you start, the ring game pulls you towards completing all rings that day.

The short-term rewards have a random element, which drives you to continue the move loop. The long-term accumulation of rewards creates a sunk cost and a status symbol, which makes you place a higher value on the exercise you've done. The incomplete pattern in the ring game makes you want to complete the pattern by collecting the rings. The time limits in the ring game drive action in the now.

Taken together, you've got a system that gets you do the first move, uses that to drive you to do a couple minutes of moves, uses that to make you want to do 30 minutes of moves over the day, and uses the rewards from the daily moves to make you want to come back tomorrow. It's a tricky thing to build, and I could use help to get it smoother and more engaging.

BatWerk 3 - Pains

2021-08-25T15:33:00.004+03:00

Where does the pain come from? You know. The one that you get from wearing headphones all day. Or the one from sitting at your desk tapping away at the secrets of the universe. Or that one where you're thumbing your mobile for hours on end.

Pain

The initial thoughts I had around pain: Pain is a bunch of nerves firing pain signals and those getting carried up to the brain. Painkillers block the inflammation response without fixing the root cause. The nerves fire pain signals because they're activated in some way. Maybe there's something pressing on the nerve, maybe the nerve is under tension, maybe there's damage of some sort that's causing the nerve to fire.

Going from there, what could be pressing on the nerve? Holding your body in a painful position. Build-up of liquid that increases pressure on the nerve. Inflammation. Damage to the nerve.

Some causes are difficult to fix by yourself, but some are more feasible. Painful position? Change position. Find a good sleeping position. Static load on muscle that causes build-up of liquid (e.g. using a computer / phone for a few hours)? Take regular breaks, try to unlock the muscle by locating it with status query meditation, massage, and 10-15 minutes of varied moves. Throw in a painkiller to help reduce the inflammation. Wearing headphones and getting neck pain? Try earbuds. Pain on the sides of your head? Bring your display down so that you don't have to look up so much.

Wrists hurting after a day at the keyboard? Wear mittens. Set a timer to lock the screen every 15 minutes. Do pushups, curls and grips to strengthen your wrists and fingers. Move your fingers less when typing. Use a flat keyboard with minimal force required.

Prevention design

Doing the micro-exercises every half an hour over the work day has been a pretty good preventive measure. It makes me move often enough and with enough variety that I don't get as many static load lockups. Still happens sometimes though, especially when I zone into something for more than an hour, or keep working in a weird position for a while. The workout and the neck stretch set can help there but it's kind of tough to get yourself through 15 minutes of movement when your head is splitting and you just want to sleep.

What sometimes helps is the painkiller, movement and massage -combo. Block the pain to loosen up the muscle, then wash the dishes to shake them more, find out which neck muscle is hurting and massage that. It usually gets me far enough that I can fall asleep and rely on the sleep relaxation to fix stuff up.

The BatWerk moves contain a bunch of neck motions and core moves, twists and stretches because a major goal for the app is to act as a preventive painkiller. If your body doesn't lock up, it's not going to get lock up pains, right? Would love to make it even better, so let me know if you can help.

Deep pain

What we have above are high level features of sedentary office worker pain. Folk medicine. But what is actually happening on a deeper level when your wrist starts to hurt?

Okay, so, let's say that you have the initial twinge of pain from the median nerve inside your wrist getting squeezed. First you get the fast pain, which is followed a bit later by the dull pain. Fast pain is a signal from a pain receptor that travels along an A nerve fiber. Dull pain is a pain signal that travels on a C nerve fiber.

The difference in the speed of sharp pain and dull pain comes from the different nerve fiber types. The A fibers are myelinated - they're wrapped in an insulating sheath that helps nerve signals travel faster. The C fibers are smaller and don't have the myelin wrapper, so it takes a few seconds for the signal to reach the brain.

As the nerve signal leaves the pain site at the wrist, it travels up the median nerve to the brachial plexus in the shoulder, where it splits into the medial cord and the lateral cord. The cords then further split and merge into the five root nerves connected to the spinal segments.

The pain signal enters the spinal cord at the dorsal horn, which can change the intensity of the signal. If you're pre-occupied with something, the dorsal horn can block the pain from reaching your brain. Then when you stop doing whatever you were doing, you start feeling the pain.

From the spinal cord, the pain signal travels via the spinothalamic tract to the brain. And now you're in pain.

Sensors

Okay. Now we know the road pain travels. But what is it exactly?

Our sensory neurons detect changes in their neighborhood. There are sensors all around your body, for example on the skin and eyes, in your muscles, joints, organs and the digestive tract. These sensors are not only for pain, but are also responsible for sensory perception and your body sense (i.e. the position of your limbs relative to each other).

Different sensor neurons detect pressure changes, stretching, presence of chemicals, and temperature changes. The pain sensors are tuned to fire only when the stimuli exceeds a specific threshold. The threshold isn't fixed, and can change as a result to changes in the environment. For example, a gentle touch can be quite painful on an inflamed area.

As the sensor is stimulated, it builds up an electric potential. As it exceeds the threshold value, it sets off an action potential (a kind of wave of flipped polarity) down the nerve. Electric? Yeah, all your cells maintain -70mV voltage between the inside and outside of the cell. Messing with this voltage is what neurons use for internal messaging.

A neuron consists of a receiving end and transmitting end. The two ends of a neuron are connected by a cable called the axon. The receiving end sets off the action potential to tell the transmitting end to release neurotransmitter molecules. These are picked up by the receiving ends of the neurons next in the chain (which makes them more or less likely fire off their own action potentials). There are hundreds of different neurotransmitters, but each neuron usually releases only one kind. Pain sensors use glutamate? Noradrenaline?

Putting it all together, tissue damage causes immune cells to release cytokines and prostaglandins that create inflammation which releases histamines and other molecules that are picked up by nociceptors, making them more sensitive and turning the inflamed area painful to touch. The inflammation also leads to the production of lysophosphatidic acid and sphingosine-1-phosphate that directly activate TRPV1 high temperature nociceptors responsible for burning sensations, causing a voltage potential buildup that exceeds the nociceptor's threshold, making it fire an action potential to its A fiber and C fiber axon ends, where the change in voltage triggers the intake of calcium ions into the cell, which signals the axon to bring neurotransmitter vesicles to pores in the cell membrane and release the contained glutamate molecules into the synaptic cleft, where the dendrite of the next neuron picks them up and excites the neuron, making it propagate the signal. On reaching the spinal column, the dorsal horn attenuates the pain signal with GABA and noradrenaline neurotransmitters before forwarding it on along the spinothalamic tract to the brainstem and thalamus. And now you're feeling burning pain at the inflammation site and it's become sensitive to touch.

This feels like one of those "what happens when you click on a link"-explainers that ends up with an explanation of how to build an adder circuit out of transistors. Anyway, that was an educational journey.

Nociceptor Sensory Neuron-Immune Interactions in Pain and Inflammation https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5205568/

General Pathways of Pain Sensation and the Major Neurotransmitters Involved in Pain Regulation https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6121522/

Presynaptic Inhibition of Pain and Touch in the Spinal Cord: From Receptors to Circuits [PDF] https://www.mdpi.com/1422-0067/22/1/414/pdf

BatWerk 2 - On muscles

2021-08-23T10:19:00.005+03:00

This is a series of blog posts that talks about the different aspects of an exercise app and how we're approaching them in BatWerk (Intro, How do muscles work, Pains, Maintaining the routine, How to play). Interested? How would you like to improve it?

The goal of the BatWerk exercise app (Android / iOS) is to keep you healthy, happy and productive without requiring you to overhaul your entire life.

So, of course I ended up wandering down some researchy rabbit holes to figure out how to do that. One of these was figuring out what actually happens when you do exercise. Why does your body move in the first place and how does it go about achieving that. Here are some confused notes on the whole process. Comments and corrections welcome, I'm out of my depth here!

What's a Muscle Anyway

Hey, your muscles are bags of string-like proteins that slide across each other with the help of ATP molecules. The proteins start moving and using ATP after calcium ions released by motor nerve activation open a lock molecule. Each of the sliding proteins generates force on the piconewton scale. To lift your arm, trillions of proteins need to slide in unison.

The three ways to make your muscles work better are to increase the number of muscle proteins, increase the amount of ATP available, and improve the motor nerve firing.

Moving a muscle activates it and makes it grab oxygen, sugars and fat from your bloodstream for ATP generation. The more capillaries you've got around the muscle, the more ATP it has available. Activating muscles also drives growth hormone secretion, which makes the muscles grab proteins from the bloodstream for conversion into muscle protein, and stimulates the growth of capillaries. Conversely, inactive muscles go into power saving mode where they don't pull in fuel and proteins, leading to lower metabolic rate and loss of muscle proteins.

Activating muscles to perform motions also trains your motor nerves to fire at the right time. Motor nerves activate only a part of the available muscle fibriles to contract, cycling through these activation groups over the contraction period. Motor nerves need to be trained through repeated use. Moves also require technique training to accomplish them through muscle contractions.

The ATP used by the proteins comes in four ways. First is the existing reserves inside the muscle. These are used up in about 3 seconds. Then come the muscle's creatine phosphate reserves. These are converted to ATP and last for a bit more than 5 seconds. Next up is anaerobic conversion of sugars to ATP, which can be maintained for a minute or two. After that, the muscle switches to aerobic ATP conversion from sugars if available, falling back to fats and even proteins if not.

Anaerobics

The thing about anaerobic and aerobic conversion is that aerobic conversion of sugars is around 15 times more efficient than anaerobic (30 aerobic ATP per glucose molecule vs 2 anaerobic ATP per glucose molecule). If you operate aerobically, you'll have to do 15 times the work to process the same amount of sugar. In other words, a 800m sprint needs to process as much sugar as a 10k run.

Sounds wasteful, right? But your body has a trick up its sleeve, called the Cori cycle. The anaerobic splitting of glucose also yields two molecules of pyruvate that get fermented to lactate. In the Cori cycle, the lactate is taken to the liver and converted back to glucose. However, this conversion uses up 6 ATP molecules, so it's not a perpetual motion sugar-recycling machine. In aerobic ATP generation, the pyruvate is taken through the Krebs cycle in mitochondria instead.

So, while on the whole, the Cori cycle loses 4 ATP, the liver's ATP molecules can come from aerobic generation of ATP before or after the anaerobic exercise. If you go through the Cori cycle once before switching to aerobic generation of ATP, you'll generate 26 ATP with the Cori cycle vs 30 ATP when fully aerobic. Less efficient, sure, but not 2 vs 30.

This creation of glucose from other molecules is called gluconeogenesis and it's also how your body maintains your blood sugar levels during intense exercise, starvation or a low-carb diet. In a way, your body is a factory dedicated to converting stuff to glucose, storing it, transporting it, and creating ATP out of it. Which makes sense on a cellular level, cells are all about breaking down complex molecules for energy and using the energy to build other complex molecules.

ATP ADP GTP GDP NAD NADH FAD FADH2 WHAT

ATP molecules aren't so much created from sugar, as they are "recharged". When a muscle protein uses ATP to do its tiny sliding walk, the ATP is converted into ADP by breaking off one phosphate group. This ADP then gets turned back into ATP by attaching the phosphate group back to it.

ADP stands for adenosine diphosphate, and ATP stands for adenosine triphosphate, so you can see how the addition of a phosphate turns a diphosphate into a triphosphate and vice versa. And yes, there's also AMP, or adenosine monophosphate.

Similarly, the NAD+ and FAD+ ions (nicotinamide adenine dinucleotide and flavin adenine dinucleotide, respectively. Those adenines keep popping up everywhere, huh?) are recharged to NADH and FADH2 molecules in the Krebs cycle.

The Krebs cycle can generate either ATP or GTP (guanosine triphosphate). GTP can be converted to ATP by swapping the phosphate chain of an ADP for the GTP's one. This is done by a molecular machine called the nucleoside-diphosphate kinase that turns the ADP into an ATP and the GTP into a GDP (guanosine diphosphate). This swap can be done in the other direction as well. Muscle cells tend to generate ATP in the Krebs cycle as they use a lot of ATP. Other tissues like the liver can generate more GTP, as it's useful in protein synthesis.[Citric Acid Cycle]

Right. To wrap up, your muscle cells break down sugar to convert ADP to ATP by adding one phosphate group to it. The ATP is used by the proteins in your muscles to help them climb past each other, which breaks off the phosphate group and creates ADP. The sugar can be processed anaerobically or aerobically. The aerobic processing is an extra phase added after the anaerobic process, and can be delayed for a later date with the help of lactic acid.

This sugar processing cycle is happening right now inside each of the 37 trillion cells of your body. Every move you make requires coordinating trillions of proteins to react in the right way at the right time.

BatWerk - Intro

2021-08-16T11:30:00.008+03:00

The Purpose

I believe that people can live healthier, happier and more meaningful lives with technological assistance. To wit, I believe _I_ can live a healthier, happier and more meaningful life with tech to help me. There's still some time left to live, and I'd rather live them feeling good instead of collapsing into a shambling creaky tangle of pain.

So I made a small free exercise app called BatWerk (Android / iOS) to keep myself from hurting at the end of the day. It's been working pretty well, back pains are mostly gone and neck pain is less frequent (and I have a script to deal with it). Better mood and sleep too. It's awesome (well, I would say that, wouldn't I?)

How does it work?

BatWerk challenges you to complete 12 rings over the day. Each ring takes about 2 minutes of easy moves to complete. Easy as in "Lift your arms to your sides ten times"-level of challenge. The moves are randomly picked from a selection of 40 exercises.

The trick with the rings is that the next ring unlocks on the next half hour. To complete all 12, you'll need to move a bit every now and then over six hours.

So it's a bit different from your run-of-the-mill exercise app. You use it constantly. It uses messaging to boost your mood. It doesn't really try make you sweat. Its goal is to periodically wake up your main muscle groups and maintain your mobility.

Obligatory screenshot from a version from 6 months ago.

How did this come about?

At first, the app was about doing this 15-minute workout. Just reminding you of the moves and the reps. Then taking you through a quest to the workout dungeon. Warmup on the afternoon fields, first set at the dungeon entrance, second set in the hall of warriors, third set in the volcanic mines, and the fourth set in the evil hall of lava with blood-red fog swirling about. After finishing the workout, you'd fly to the blue skies to do stretches.

It was good. But too much and too little at the same time. The workout needs a reasonably cool place, workout clothes, and half an hour of buffer for shower and changing clothes. It'll also beat you up pretty good, so you can't really do it several times a day (well, unless you're developing the app and need to test it.) Too hard and time-consuming, and it doesn't keep you moving throughout the day.

Back to the drawing board. Let's try an infinite sequence of random moves. Just open the app and move. Okay, this is easy and quick to do. I could do some of this every day. But, how often should you move? What's a good pattern?

According to Exercised, you should move a bit every half hour. And to flush the fast energy reserves in your muscles, you only need 15 seconds of anaerobic exercise. Right. Let's plug these together into a game where you need to move every half hour, doing about four different moves, 15-30 seconds each. Forces a refresh cycle for the muscles involved and doesn't take much time.

Still, well, you can't move _every_ half an hour. What about the meetings? What about lunch? Not to mention that 2 minutes of exercise every waking 30 minutes is pretty exhausting.

The current design is a game where you try to collect 12 rings over the day. This way you don't have to move every half hour, but it still drives you to have 6 active hours a day.

The 12-ring game was working OK for about 8 months, but now I've had some slowdown, hitting only 3-6 rings for a few weeks. Could be just the summer heat making exercise a bad idea in general, but I see it as something that could be fixed in the design of the activity. If the goal of the app is to keep you active every day, it really should keep you active every day, not just on days where you feel like it.

Still, 3 rings is way better than zero rings. Six minutes of exercise spread across the day. Without the app, I'd be at zero minutes. Still, it creates feelings of inadequacy, even with the non-blaming nature of the app. I did eventually break out of the funk and get a few days of full rings in a row. But I'd like it to rescue you a bit more. Get you back on track faster.

Compute shader daemon

2020-07-16T10:23:00.001+03:00

Riffing on the WebCompute experiment (run GLSL compute shaders across CPUs, GPUs and the network), I'm now thinking of making the GLSL IO runtime capable of hanging around and running shaders on demand. In WebCompute, the Vulkan runner ran a single shader and read work units from the STDIN (the network server was feeding the STDIN from a WebSocket).

With GLSL IO, that goal is extended to handle arbitrary new shaders coming at various times. On a high level, you'd send it a shader file, argv and fds for stdin/stderr/stdout through a socket. Then it would create a new compute pipeline, allocate and bind buffers and run the pipeline on a free compute queue. On completing a dispatch, it'd delete the pipeline and free the buffers. This cycle of recreation might be expensive, so it should have a cache for buffers and compute pipelines.

The compute shaders could share a global IO processor, or each could have its own IO processor thread. A global IO processor could be tuned to the IO CPU and have better coordination of the IO operations, but could end up with slow IO requests from one shader clogging the pipe (well, hogging the threadpool) for others. This could be worked around with async IO in the IO processor.

The other issue is the cooperative multitasking model of compute shaders. If your shader is stuck in an infinite loop on all compute units, other shaders can't run. To remedy this, the GPU driver allows compute shaders to run only a few seconds before it terminates them. On mobiles this can be as low as 2 seconds, on the desktop 15 seconds. If a discrete GPU has no display connected to it, it may allow compute shaders to run as long as they want.

If your shaders needs to run longer than 10 seconds, this is a problem. The usual way around it is to build programs that run a few milliseconds at a time, and are designed to be run several times in succession. With the IO runtime, this sounds painful. An IO request might take longer than 10 seconds to complete. Writing a program that issues a bunch of async IOs, terminates, polls on the IOs on successive runs and does less than 10 seconds of processing on the results (doing the restart loop if it's running out of runtime), then finally tells the runtime that it doesn't need to be run again. In the absence of anything better, this is how the first version of long-running shaders will work.

The second version would be a more automated version of that. A yield keyword that gets turned into a call to saveRegisters() followed by program exit. On program start, it'd do loadRegisters() and jump into the stored instruction pointer to continue execution. The third version would insert periodic checks for how long the program has been running, and yield if the program's been running longer than the scheduler slice time.

Of course, this is only useful on GPU shaders. If you run the shaders on the CPU, the kernel's got you covered. The IO runtime is still useful since high-performance IO doesn't just happen.

I think the key learning from writing the GLSL IO runtime has been that IO bandwidth is the only thing that matters for workloads like grep. You can grep on the CPU at 50 GB/s. You can grep on the GPU at 200 GB/s. But if you need to transfer data from the CPU to the GPU, the GPU grep is limited to 11 GB/s. If you do a compress-decompress pipe from CPU to GPU, you can grep at 24 GB/s (if the compression ratio is good enough). GPUs give you density, but they don't have enough bandwidth to DRAM to really make use of the compute in common tasks.

Getting to even 11 GB/s requires doing multithreaded IO since memcpy is limited to 7 GB/s per thread. You need to fetch multiple blocks of data in parallel to get to 30 GB/s. Without the memcpy (just reading), you should be able to reach double the speed.

GPU IO library design thoughts

2020-07-06T09:14:00.005+03:00

Thinking about the design of file.glsl, a file IO library for GLSL.

For a taste, how about calling node.js from a GPU shader:

string r, fn = concat("node-", str(ThreadID), ".txt");
awaitIO(runCmd(concat(
  "node -e 'fs=require(`fs`); fs.writeFileSync(`", 
  fn,
  `, http://Date.now().toString())'"    
)));
r = readSync(fn, malloc(16));
println(concat("Node says ", r));

There are more examples in the test_file.glsl unit tests.

Design of GPU IO

Hardware considerations

GPUs have around a hundred processors, each with a 32-wide SIMD unit.
The SIMD unit can execute a 32 thread threadgroup and juggle between ten threadgroups for latency hiding.
GPU cacheline is 128 bytes.
CPU cacheline is 64 bytes.
GPU memory bandwidth is 400 - 1000 GB/s.
CPU memory bandwidth is around 50 GB/s.
PCIe bandwidth 11-13 GB/s. On PCIe4, 20 GB/s.
NVMe flash can do 2.5-10 GB/s on 4-16 channels. PCIe4 could boost to 5-20 GB/s.
The CPU can do 30 GB/s memcpy with multiple threads, so it’s possible to keep PCIe4 saturated even with x16 -> x16.
GPUdirect access to other PCIe devices is only available on server GPUs. Other GPUs need a roundtrip via CPU.
CPU memory accesses require several threads of execution to hit full memory bandwidth (single thread can do ~15 GB/s)
DRAM is good at random access at >cacheline chunks with ~3-4x the bandwidth of PCIe3 x16, ~2x PCIe4 x16.
Flash SSDs are good at random access at >128kB chunks, perform best with sequential accesses, can deal with high amounts of parallel requests. Writes are converted to log format.
Optane is good at random access at small sizes >4kB and low parallelism. The performance of random and sequential accesses is similar.

=> Large reads to flash should be executed in sequence (could be done by prefetching the entire file to page cache and only serving requests once the prefetcher has passed them)
=> Small scattered reads should be dispatched in parallel (if IO rate < prefetch speed, just prefetch the whole file)
=> Writes can be dispatched in parallel with more freedom, especially without fsync. Sequential and/or large block size writes will perform better on flash.
=> Doing 128 small IO requests in parallel may perform better than 16 parallel requests.
=> IOs to page cache should be done in parallel and ASAP.
=> Caching data into GPU RAM is important for performance.
=> Programs that execute faster than the PCIe bus should be run on the CPU if the GPU doesn’t have the data in cache.
=> Fujitsu A64FX-type designs with lots of CPU cores with wide vector units and high bandwidth memory are awesome. No data juggling, no execution environment weirdness.

Software

The IO queue works by using spinlocks on both the CPU and GPU sides.
The fewer IO requests you make, the less time you spend spinning.
Sending data between CPU and GPU works best in large chunks.

To avoid issues with cacheline clashes, align messages on GPU cacheline size.
IO request spinlocks that read across the PCIe bus should have small delays between checks to avoid hogging the PCIe bus.
Workgroups (especially subgroups) should bundle their IOs into a single scatter/gather.

When working with opened files, reads and writes should be done with pread/pwrite. Sharing a FILE* across threads isn’t a great idea.
The cost of opening and closing files with every IO is eclipsed by transfer speeds with large (1 MB) block sizes.

The IO library should be designed for big instructions with minimal roundtrips.
E.g. directory listings should send the entire file list with file stats, and there should be a recursive version to transfer entire hierarchies.
Think more shell utilities than syscalls. Use CPU as IO processor that can do local data processing without involving the GPU.

Workgroup concurrency can be used to run the same code on CPU and GPU in parallel. This extends to multi-GPU and multi-node quite naturally.
The IO queue could be used to exchange data between running workgroups.

Limited amount of memory that can be shared between CPU and GPU (I start seeing issues with > 64 MB allocations).
Having a small IO heap for each thread or even threadgroup, while easy to parallelize, limits IO sizes severely.
32 MB transfer buffer, 32k threads -> 1k max IO per thread, or 32k per 32-wide subgroup.
Preferable to do 1+ MB IOs.
Design with a concurrently running IO manager program that processes IO transfers?
The CPU could also manage this by issuing copyBuffer calls to move data.

Workgroups submit tasks in sync -> readv / writev approach is beneficial for sequential reads/writes.
Readv/writev are internally single-threaded, so probably limited by memcpy to 6-8 GB/s.

Ordering of writes across workgroups requires a way to sequence IOs (either reduce to order on the GPU or reassemble correct order on the CPU.)
IOs could have sequence ids.

Compression of data on the PCIe bus could help. 32 * zstd --format=lz4 --fast -T1 file -o /dev/null goes at 38 GB/s.

[Update] Did a quick test with libzstd (create compressor context for each 1 MB read, compress, send data to GPU) with mixed results. Getting 3.4 GB/s throughput with Zstd. Which is good for zero-effort, but I should really be using liblz4. Running 32 instances of zstd --fast in parallel got 6.4 GB/s, with --format=lz4 16 GB/s.

[Update 2] Libzstd with fast strategy and compression level -9 can do 12.7 GB/s grep.glsl throughput. So if I had a GPU-side decompressor, I might see a performance benefit vs raw data (11.4 GB/s). On already-compressed files, there's a 10-15% perf penalty vs raw.

[Update 3] LZ4 with streaming block compressor and compression level 9 reaches 17.5 GB/s grep.glsl throughput on the above file, 22.5 GB/s on a 13 GB kern.log file (lots of similar errors). About 5% perf penalty on compressed files. Feels like a GPU decompressor could actually be a good idea.

Caching file data on the GPU is important for performance, 40x higher bandwidth than CPU page cache over PCIe.
Without GPU-side caching, you’ll likely get better perf on the CPU on bandwidth-limited tasks (>50 GB/s throughput.)
In those tasks, using memory bandwidth to send data to GPU wouldn’t help any, best you could achieve is zero slowdown.
(Memory bandwidth 50 GB/s. CPU processing speed 50 GB/s. Use 10 GB/s of bandwidth to send data to GPU =>
CPU has only 40 GB/s bandwidth left, GPU can do 10 GB/s => CPU+GPU processing speed 50 GB/s.)

Benchmark suite

Different block sizes
Different access patterns (sequential, random)
- Scatter writes
- Sequential writes
- Gather reads
- Sequential reads
- Combined reads & writes
Different levels of parallelism
- 1 IO per thread group
- each thread does its own IO
- 1 IO on ThreadID 0
- IOs across all invocation
Compression
From hot cache on CPU
From cold cache
With GPU-side cache
Repeated access to same file
Access to multiple files

Does it help to combine reads & writes into sequential blocks on CPU-side when possible, or is it faster to do IOs ASAP?

Caching file descriptors, helps or not?

grep.glsl

2020-06-29T14:00:00.006+03:00

grep.glsl is sort of working. It's a GLSL compute shader version of grep. A very simple one at that. It tests a string against a file's contents and prints out the byte offsets where the string was found.

The awesome part of this is that the shader is running the IO. And it performs reasonably well after tuning. You could imagine a graphics shader dynamically loading geometry and textures when it needs them, then poll for load completion in following frames.

Here are the few first lines of grep.glsl

    string filename = aGet(argv, 2);
    string pattern = aGet(argv, 1);

    if (ThreadLocalID == 0) done = 0;

    if (ThreadID == 0) {
        FREE(
            println(concat("Searching for pattern ", pattern));
            println(concat("In file ", filename));
        )
        setReturnValue(1);
    }

Not your run-of-the-mill shader, eh?

This is the file reading part:

    while (done == 0) {
        FREE(FREE_IO(
            barrier(); memoryBarrier();

            // Read the file segment for the workgroup.
            if (ThreadLocalID == 0) {
                wgBuf = readSync(filename, wgOff, wgBufSize, string(wgHeapStart, wgHeapStart + wgBufSize));
                if (strLen(wgBuf) != wgBufSize) {
                    atomicAdd(done, strLen(wgBuf) == 0 ? 2 : 1);
                }
            }

            barrier(); memoryBarrier();

            if (done == 2) break; // Got an empty read.
            
            // Get this thread's slice of the workGroup buffer
            string buf = string(
                min(wgBuf.y, wgBuf.x + ThreadLocalID * blockSize),
                min(wgBuf.y, wgBuf.x + (ThreadLocalID+1) * blockSize + patternLength)
            );

            // Iterate through the buffer slice and add found byte offsets to the search results.
            int start = startp;
            i32heapPtr = startp;
            for (int i = 0; i < blockSize; i++) {
                int idx = buf.x + i;
                if (startsWith(string(idx, buf.y), pattern)) {
                    i32heap[i32heapPtr++] = int32_t(i);
                    found = true;
                }
            }
            int end = i32heapPtr;

            ...

Performance is complicated. Vulkan compute shaders have a huge 200 ms startup cost and a 160 ms cleanup cost. About 60 ms of that is creating the compute pipeline, the rest is instance and device creation.

Once you get the shader running, performance continues to be complicated. The main bottleneck is IO, as you might imagine. The file.glsl IO implementation uses a device-local host-visible volatile buffer to communicate between the CPU and GPU. The GPU tells the CPU that it has some IO work for the CPU by writing into the buffer, using atomics to prevent several GPU threads from writing into the same request. The CPU spinlocks to grab new requests to process, then spinlocks for the requests to become available. After processing the request, the CPU writes the results to the buffer. The GPU spinlocks awaiting for the IO completion, then copies the IO results to its device-local heap buffer.

The GPU-GPU copies execute reasonably fast (200+ GB/s), but waiting for the IO drops the throughput to around 6 GB/s. This is way better than the 1.6 GB/s it used to be, when it was using int8_t for IO and ~200 kB transfers. Now the transfer buffer is 1 MB and the IO copies are done with i64vec4.

Once you have the data on the GPU, performance continues to be complicated. Iterating through the data one byte at a time goes at roughly 30 GB/s. If the buffer is host-visible, the speed drops to 10 GB/s. Searching a device-local buffer 32 bytes at a time using an i64vec4 goes at 220 GB/s.

The CPU-GPU transfer buffer has a max size of 256 MB. The design of file.glsl causes issues here, since it slices the transfer buffer across shader instances, making it difficult to do full-buffer transfers (and minimize IO spinning). Now grep.glsl does transfers in per-workgroup slices, where 100 workgroups each do a 1 MB read from the searched file, then distribute the search work across 255 workers per workgroup.

This achieves 6 GB/s shader throughput. The PCIe bus is capable of 12 GB/s. If you remove the IO waits and just do buffer copies, the shader runs at 130 GB/s. Taking that into account, the shader PCIe transfers should be happening at 6.3 GB/s. Removing the buffer copies had a very minimal effect on throughput, so doing the search directly on the IO buffer shouldn't improve performance by much.

Thoughts

Why are the transfers not hitting the PCIe bus limits? Suspiciously hovering at around half of PCIe bandwidth too. Could it be that the device-local buffer is slow to write to from the CPU side? Previously I had host-cached memory for the buffer, which lets you do flushes and invalidations to (hopefully) transfer in more efficient chunks. The reason that I'm not using a host-cached memory buffer is that GLSL volatile doesn't work for host-side memory: the GPU seems to cache fetched memory into GPU RAM and volatile only bypasses the GPU L1/L2 caches, so you'll never see CPU writes that landed after your first read. And there doesn't seem to be a buffer type that's host-cached device-local.

Maybe have a CPU-side buffer for IO, and a transfer queue to submit copies to GPU memory. This should land the data into GPU RAM and volatile should work. Or do the fread to a separate buffer first, then memcpy it to the GPU buffer.

Vulkan's startup latency makes the current big binary approach bad for implementing short-lived shell commands. The anemic PCIe bandwidth makes compute-poor programs starve for data. GNU grep runs at 4 GB/s, but you can run 64 instances and achieve 50 GB/s [from page cache]. This isn't possible with grep.glsl. All you have is 12 GB/s.

Suppose this: you've got a long-running daemon. You send it SPIR-V or GLSL. It runs them on the GPU. It also maintains a GPU-side page cache for file I/O. Now your cold-cache data would still run at roughly the speed of storage (6 GB/s isn't _that_ bad.) But a cached file, a cached file would fly. 400 GB/s and more.

Creating the compute pipeline takes around 50 ms. Caching binary pipelines for programs would give you faster startup times.

The GPU driver terminates running programs after around 15 seconds. And they hang the graphics while running. Where's our pre-emptive multitasking.

This should really be running on a CPU because of the PCIe bottleneck and the Vulkan startup bottleneck. Going to try compiling it to ISPC or C++ and see how it goes.

[Update] Tested memcpy performance. Single thread CPU-CPU, about 8 GB/s. With 8 threads: 31 GB/s. Copying to the device-local host-visible buffer: 8 GB/s with one thread. With 4 threads, 10.2 GB/s. Cuda bandwidthTest can do 13.1 GB/s with pinned memory and 11.5 GB/s with pageable memory. The file.glsl IO transfer system doesn't seem to be all that slow, it just needs multi-threaded copies on the CPU side.

Built a simple threaded IO system that spawns a thread for every read IO up to a maximum of 16 threads in flight. Grep.glsl throughput: 10 GB/s. Hooray!

[Update 2] Lots of tweaking, debugging and banging-head-to-the-wall later, we're at 11.45 GB/s.

file.glsl

2020-06-25T17:00:00.002+03:00

Current design: Each shader invocation has an 8k heap slice and an 8k IO heap slice. There's an IO request buffer. GPU writes IOs to the req buffer, CPU picks them up and writes the result to the IO heap. The GPU copies the result from the IO heap to the main heap. The IO heap and the IO request buffer are marked volatile. The main heap isn't, so it can benefit from caches.

Now trying to handle argv nicely. Allocate an extra slice in the heap and copy the argv there before shader start.

This could also be used to store string literals. Now string literals are malloc'd and filled in each invocation at shader start, which is a total waste of time. But because the string lib has funcs that do in-place modification, this avoids one class of errors. Switching to immutable strings in the lib is an enticing option.

Memory allocation is done with a simple bump malloc. Freeing heap memory after use is a hassle. I have macros FREE() and FREE_IO() that free whatever heap / IO heap allocations you did inside the call.

It might be nice to have a global heap with a proper malloc to store large allocations that are re-used across threads. E.g. for loading texture data. This would probably have to be a fixed size buffer. I doubt that it's possible to allocate more buffer memory on the CPU side while the shader is running. Would be nice though!

Wrote a sketch of grep.glsl. Very educational, as it exposed a bunch of missing features and "this'd be nice to have"-things. Handling argv and program return value fall in the first category. Having helpers for reductions across all invocations and sorted-order IO fall in the second category.

The current grep.glsl sketch is: each thread does a read for a 4kB+strLen(pattern) chunk, then runs indexOf(chunk, pattern) to find all occurrences of pattern in it. The threads then iterate through all invocationIDs in order (with a barrier() call to keep them in lockstep) and when the thread id matches the current id, the thread prints out its results. This should keep the workgroup threads in order, but the workgroups might still be in different order. Then the threads or-reduce whether they found any hits or not to set the program return value to 0 or 1. Advance read offset by total thread count times 4kB, repeat until a thread reads less than 4kB (EOF).

This makes a bunch of design issues apparent. The IOs should be run concurrently. Copying IO bytes to heap has limited value here, so having a way to run string functions directly on the IO heap would be nice. The read could be done with a single IO instead of threadCount IOs. Overlapping IO and execution: could issue two IOs, wait for the first, issue third IO, process first IO, wait for the second, issue fourth IO, process second IO, ... How to do line numbers. Reads should have an EOF flag. Can we hit 12 GB/s for files in page cache? Can we cache files in GPU RAM and match more patterns later at 400 GB/s?

Real-time path tracing

2020-06-16T19:09:00.017+03:00

[WIP - I'll update this over time]

The key problem in real-time path tracing is that you have enough compute to do a couple of paths per pixel, but the quality you want requires ten thousand paths per pixel.

There are three ways to solve this correctly. One is to wait a few decades for computers to get 10 000 times faster. One is to use 10 000 computers to do the rendering. And one is to make paths 10 000 times cheaper.

Or you could render at a lower resolution and frame rate. If you can do 1 sample per pixel at 4k resolution at 60 Hz, then you can do 10k samples per pixel at 30 Hz at 60x25 resolution. Yes, you can have real-time movie quality graphics on a thumbnail-sized display. If you want to hit C64 resolutions like 320x200, you could get a 3-4 cryptomining PCs with 10+ GPUs each. For a paltry $15000 you can have the C64 of your dreams!

But, well, what if you want to do high-resolution rendering at high frame rates? Without the budget for a massive cluster, and without having to find The Secret of Monte Carlo which would enable incredibly fast and accurate integration of the lighting equation.

There are ways. You could put together four techniques that would give you a 10x perceptual sample count boost. Or you could put together 14 techniques with a 2x perceptual sample count boost. Or perhaps a mix of the two. Then there are optimizations and mathematical techniques to improve the performance of the Monte Carlo integration used in path tracing. Even ways to arrive at better approximations of the light field in the same time. What's important is picking a set of techniques that compound.

Monte Carlo integration is a process where you integrate a function by calling it with random values, sum up the results, and divide the sum with the number of results. The nice thing about MC integration is that it will eventually converge to the correct solution, no matter what initial value and weight you start with.

Images generated with MC integration start off noisy since the initial samples of the function have a high weight and neighboring pixels often end up with quite different values at any given sample due to us picking random values on the function. This noise gets smoothed out as the number of samples increases and the pixel neighborhood ends up sampling a similar hemisphere.

The relationship between noise and sample count is roughly that noise is the square root of the sample count. Quadruple the sample count and the noise halves. At 10k samples, you halve the noise about 6.6 times. If you've got an 8-bit range, your noise level should be about 1.4 bits or about +-1.3 units.

If you want to get rid of noise in the early stages of the integration, you have to downweigh the initial samples vs the neighborhood. The easiest way is to blur the image. Another way is to pick a non-noisy starting point and give it a high weight, so that the new MC samples don't cause as much noise. Yet another way is to correlate the sampling of the neighboring pixels so that they sample the same regions - this will trade high frequency noise for low frequency noise.

E.g. use a sampling strategy for a 64x64 region where you first take 100 samples of the hemisphere at the center of the region, keep the top-5 highest energy samples, weigh them down by their probability, and use shadow rays to connect the 64x64 region to only the high-energy values. You'll get no pixel-to-pixel noise in the 64x64 region, but you might end up with a blotchy image when transitioning from one region to another.

Fireflies are rare high-energy samples that jump a pixel value much higher than the neighborhood. Fireflies are actually good for the energy search part of the integration, you can use the firefly path as a guiding star for the neighborhood. But they cause bright noise that takes a large number of samples to tone down. One way to deal with fireflies is to clamp them. This way you'll still get bright pixels but they won't slow convergence towards darkness. The issue with clamping is that fireflies can be an important contributor of energy in the neighborhood. If a firefly boosts a pixel by 10 units and the path occurs only on every 100th sample, the firefly's contribution should be 0.1 units. If you clamp the firefly to 1, it'll contribute only 0.01 units, and the pixel ends up too dark.

A better way might be to estimate the probability of the firefly path based on the neighborhood samples. If only every 100th pixel has a firefly of 10 units, scale the fireflies down to 0.1 units and add 0.1 to the entire neighborhood. As the sample count increases, you can start passing the fireflies straight through and converge to an unbiased result.

Temporal reprojection lets you use samples computed in previous frames as the starting point of the integration in the current frame. Adaptive sampling allows you to focus your ray budget to the areas of the scene that need it the most. Sparse sampling lets you skip some pixels completely.

Multiple importance sampling guides your integrator towards high-energy regions of the scene. Bidirectional path tracing tries to connect high-energy paths with camera paths. Photon mapping caches high-energy paths in the scene. Vertex connection merging combines BDPT and BDPPM.

Path reuse techniques like bidirectional path tracing can give you additional paths at the expense of a single shadow ray and BSDF evaluation. You generate a camera path and then connect the vertices on the camera path to generated path suffixes. In simple BDPT, you'd generate a camera path and a light path, then connect each vertex of the camera path to each vertex of the light path. If you have three connecting vertices in the camera path and three in the light path, you'd get ten camera paths at the expense of one camera path, one light path, and nine shadow rays. If you reuse light paths across multiple pixels, you can amortize the light path cost. In static scenes, you could also reuse camera paths. This way you might be able to generate decent paths with a single shadow ray per path.

For outdoor scenes, you'll start seeing vanishing returns after five bounces, so path reuse techniques for easy light paths might be limited to around 5x speed boost. In indoor scenes with longer paths, path reuse becomes more useful. If you're also dealing with difficult lighting conditions, the generated high-energy path suffixes are very helpful.

Temporal reprojection can allow you to start integration at a high effective sample count, based on integration results from previous frames. It is most helpful with diffuse surfaces, where camera angle and position don't change the integration result. On glossy and reflective surfaces and refractive volumes you'll see significant ghosting.

Adaptive sampling is quite crucial for fast renders at low sample counts. Not all parts of the scene require the same number of samples to converge. If you can skip low-noise, low-contrast regions, you can allocate more samples to noisy regions and high-contrast regions. You can borrow some tricks from artists and blur out dark regions of the image since the eye is going to skip them anyhow and seek out high-contrast regions and bright areas. Throwing extra samples to determine if a shadow region has a value of 0.01 or 0.02 is a waste of time, especially in animated scenes. You can use foveated rendering tricks and spend your ray budget where the viewer is looking at. If you don't have eye tracking, you can make educated guesses (center of the screen, anything moving, faces, silhouettes.)

Sparse sampling is the extreme cousin of adaptive sampling. With sparse sampling you can completely skip rendering some pixels, filling them from the temporal reprojection buffer and the nearest sampled pixels. Necessary for hitting a target framerate. Handy for foveal rendering. If you have sparse sampling, you can run your interactive scene at a locked 60 FPS.

Combining the above techniques gives you some nice synergies. Adaptive sampling can focus on parts of the scene that are lacking previous frame data. Sparse sampling can be used to generate a low-resolution render with a high sample count, to be used as the start for integration and as a low-res variance estimate to guide the adaptive sampler. Temporal reprojection could pull samples out of the path reuse cache.

The path reuse technique can prioritize path variants at low bounce counts (say, 10 variants at bounce 1, 2 variants at bounce 2, 1 at bounce 3 -- this'd get you a good sampling of the first bounce hemisphere). The path caches can be used to estimate energy at different parts of the scene and the probability of connecting two parts of the scene, this could be used for MIS pdfs and to guide paths towards high energy regions with high probability of getting there.

Denoisers work by looking at the variance of a region and the level of convergence (how much each new sample changes the value at the pixel, and how close the pixel's value is to its neighbors.) If the convergence is low and the variance is high, the pixel's light estimate is likely off, and the region's light estimate likewise. A denoiser can pool the samples in the neighborhood and distribute the energy so that the variation between samples is low. That is, blur the region.

By looking at geometry and texture changes, the denoiser can avoid denoising high-contrast regions that should be high-contrast (e.g. you've got a sharp edge, so the normals on both sides of the edge are very different => denoiser can avoid blurring across the edge). Textures have high-frequency signal that also should be there. Denoising without the texture applied and then applying the texture on top can give you a smooth lighting solution without blurring the texture.

Denoising can give you overly smooth results. You could make a denoiser to bring down noise to a wanted level but not all the way down. Keep some of the noise, don't blur out natural noise. Use a variable amount of denoising depending on the region variance, convergence and sample count.

Noise in rendering is tricky. Let's say you're using blue noise to get a nicer sampling pattern. And you render an animation with a moving camera. If you fix the noise to the screen, it will look like the screen is a bit dirty. If you fix the noise to world coordinates, it will look like your object is dirty. If you animate the noise, it will look like your renderer is broken.

At one 5-ray path per pixel at 4k, you've got 40 million rays worth of scene connectivity data in every frame. Figuring out ways to make use of these rays for MC integration might well give you a high-quality render at a low cost.

Acceleration structures make ray-scene intersection cheaper to evaluate. They have a build cost. For dynamic scenes where you have to update or rebuild the acceleration structure every frame, it's important to use one that's fast to build. For static scenes, you can go with a more expensive acceleration structure that makes ray intersection faster.

Acceleration structures map a 5D ray (origin, direction) to primitives intersected by the ray. Usually this is done either with a tree that splits the space into hierarchical regions, or with a tree that splits the scene into bounding boxes of primitives. The difficulty, as you might imagine, is that a 3D tree doesn't give you exact matches for a 5D ray. Instead you end up with a list of tree nodes to test against the ray. If the tree doesn't have an ordering property (i.e. "if the ray hits something in this node, it's the nearest object"), you might even need to do several intersections to find the closest one.

The trees do have directionality. For a BVH, sorting the primitives recursively by different coordinate axes, you're creating lists ordered along a dimension. Once you know that you're dealing with an ordered list, you can do early exit if your hit or exit is before the start of the next untested element.

Grids are nice in that they're directional, so you have you can do early exit on first hit. And you've got a fixed max number of steps you need to take to step through a grid.

What you'd really like to have is a proper 5D lookup structure where you can directly find the next hit for a ray. This is difficult.

Ray classification (Arvo & Kirk 87) assigns objects into a lazily-constructed 5D grid. On each bounce, you assign your rays into 5D grid cells based on their origin and direction. Then for each grid cell, find all objects that match with it (a 5D grid cell is basically a frustum, so you'd do frustum-to-BVH checks). Then trace your rays, intersecting only against the objects inside the ray's grid cell.

You could split the scene into a hierarchical tree of beams that cover the direction space, then put the primitives inside each beam into a tree sorted along the beam direction. The good part about this approach is that you can find the beam for a ray fast, then jump to tree position, and there's your intersect. The bad part is that it uses memory like crazy since you're storing references to the entire scene N times where N is the number of beam directions you have.

You'll start hitting diminishing returns pretty soon. BVH intersection takes a few dozen steps and half a dozen intersections. If you're slower than that, just use a BVH. If you can optimize your structure to half a dozen steps and a single intersection, you'll be 5-6x faster than a BVH. Nothing to scoff at but it won't get you from 1 spp to 10k spp alone.

Let's look at memory bandwidth. Doing 10k paths of length 5, you'd need to read 50k triangles per pixel. One triangle is nine floats, or 36 bytes. That'd add up to 1.8 MB of triangle reads per pixel. A GPU with 1 TB/s memory bandwidth could process half a million pixels per second. Or about 100x100 pixels per frame. And then you need to do the compute and fetch the materials and textures.

If you want to render 10k paths per pixel at 4k@60 with 1 TB/s memory bandwidth, each ray can use only 0.04 bytes of memory bandwidth, or a third of a bit. How to fit three rays in a bit is left as an exercise to the reader. If you limit yourself to 100 paths per frame at 1080p, you'll have 16 bytes of memory bandwidth per ray. Still not enough to fit a 9-float triangle, but maybe you could trace SDFs or something.

Compute-wise, well, 10k paths per pixel at 4k is about 80 billion paths per frame. At 60 Hz, that's a cool five trillion paths per second. A 13 TFLOPS RTX 2080 Ti could spare 2.5 ops per path. If each path has five rays, that's half an op per ray. Going down to 100 paths at 1080p30, your budget becomes 432 ops per path. If you also go from f32 to f16, you'd have 860 ops per path and 32 bytes of bandwidth per ray @ 1TB/s, for 16 half-floats. A memory-dense acceleration structure with avg 1 triangle per traversal, and you could theoretically fit the acceleration structure and one triangle in that.

If you could somehow do path tracing with INT4s, you'd have a lot more headroom for memory bandwidth and compute. 32 bytes would fit 64 INT4s. And an A100 card can do 1248 trillion tensor ops per second on INT4s. That'd be 50 ops per ray at 4k60 and 10k paths per pixel. You could render something! I don't know what it would look like, but something at least!

Three types of paths: high-energy paths, high-probability paths and connecting paths. High-energy paths start from emissive surfaces. High-probability paths start from the camera. Connecting paths connect the two. High-P paths change when the camera moves and the scene in front of the camera changes. High-E paths change when the scene around the light sources moves. Connecting paths can be done via MIS weighing. You multiply the contribution of a path by its probability.

Unidirectional path tracing only generates high-P paths. Path tracing with next event estimation (shooting shadow rays at lights) tries to connect high-P paths with high-E regions. Bidirectional path tracing generates high-E paths and tries to connect high-P paths with high-E path suffixes. Photon mapping propagates high-E regions throughout the scene. Radiosity is another method to propagate high-E regions.

We know how to generate high-P paths and high-E paths. What about generating connecting paths? You could pick two random surfaces in the scene and try to connect them with a ray. Do this a few million times and you should have a decent connectivity map to estimate the transmission ratio from one arbitrary ray to another, and what the connecting path would look like. Then figure out high-P regions (what parts of the scene have camera rays) and sum up the energy in high-E regions, then search for high transmission connective paths from high-P regions to high-E regions and try to do connections that would contribute the most to camera pixels. That is, you've got a high-transmission, high-probability (according to the BSDF) connection from a camera path to a light path.

More tricks. Scene traversal on a GPU. You've got 32 rays flying at the same time, they're stepping through the acceleration structure to find the next hit. The hit happens at different depths of the acceleration structure, so the whole thing proceeds at the speed of the slowest ray. Then you bounce all the rays, they go all over the place. Now your rays are diverging even more. Not only do they finish at different times, they also access memory in different places. Another bounce. Some rays finished their paths and the execution lanes for those rays are just idling until all the 32 rays have finished.

You could make the rays fly in a bundle, forcing them to bounce in the same direction. That'd get you lower divergence among the bundle. This'll give you a very biased ray bunch though, and it'll look like your renderer is doing some weird hackery. Maybe you could scramble the starting pixels for the rays. Use a dither pattern to pick a bunch of pixels from a region, trace them in a coherent fashion. Pick another dither bunch of pixels, trace them in a different coherent fashion. Repeat until you've got the region covered. Now neighboring pixels won't have traversed a coherent path, so you get something more of a dithered noise pattern. And hopefully get the performance gains of coherent ray bundles without the biased render look at low sample counts.

Also, this:

GLSL strings

2020-06-12T05:49:00.013+03:00

Started writing a small string library for GLSL. Because it might be nice to do massively parallel string processing. It's difficult to make it fast on a GPU though, lots of variable-length reads and variable-length writes.

You can follow the progress on the spirv-wasm repo at https://github.com/kig/spirv-wasm/tree/master/string

It's still in the shambles-phase. That is, I haven't actually run the code. Adding string literals to GLSL is done with a small pre-processing script that will eventually turn string literals into heap allocations. The approach is similar to the GLSL HTTP parser hack.

In the HTTP parser pre-processor, string literals were done with "abcd" becoming an array of ints and 'abcd' turning into an int with four ASCII characters packed into it. The array approach is difficult with GLSL, since arrays need to have a size known at compile time. Packing four chars into an int and four ints into an ivec4, is, uh, fast? But a lot of hassle to work with.

In string.glsl I'm trying a different approach, representing strings with an ivec2 with a start pointer and end pointer to the heap buffer. So there are a lot of loops that look like

for (int i = str.x; i < str.y; i++) {
    heap[i]...
}

This is nice in that you can have strings of varying lengths and do the whole slice-lowercase-replace-split-join -kaboodle. But you need malloc. And GLSL doesn't have malloc. So we need to make a malloc. The malloc is a super simple stack-style malloc.

Malloc

We've got a heap buffer that's split into slices. Each program instance gets its own slice of the heap. Each program instance also has a heapPtr that points to the heap slice of the instance. To allocate N bytes of memory, malloc increments heapPtr by N and returns an ivec2 with the previous heapPtr and the new heapPtr. There's no free. If you allocate too much, you'll run out of space and corrupt the next instance's slice. To reclaim memory, make a copy of heapPtr before doing your allocations, then set heapPtr to the old value once you're done.

This kind of malloc is not perfect by any means but it's fast and predictable. GLSL's lack of recursion and the intended <16 ms program runtime also make it unlikely that you're going to need something more sophisticated. Free by terminating the program.

In case you do need something fancier, how to improve on this? If you wanted to share the heap between program instances, you could use atomics. To free allocations, you'd need a data structure to keep track of them. Probably a two-level setup where program instances reserve pages, and then internally allocate from the pages. Use virtual memory to allow allocations to span several pages, or reserve sequential pages to avoid VM. On first malloc, a program instance would reserve a page by setting a bit in the page table bitmap. On program exit, the instance would unset the bits for the pages it had reserved.

Performance

String performance in the HTTP compute shader produced some insights. The CPU doesn't like gathers, so striping the HTTP requests runs quite a bit better in ISPC. You take the requests that are AAAABBBBCCCCDDDD... etc. and rearrange the bytes to ABCDABCDABCDABCD... This way the SIMD threads can read their next byte with a single vector load. On the GPU this was less helpful. GPU and CPU performance was boosted significantly by doing a larger amount of work per program instance. Processing one request per instance vs. processing 1024 requests per instance could give a >10x speed boost.

The general performance of the HTTP compute shader is fun. Compiling GLSL for the CPU via ISPC produces code that runs very fast. Running the same code on the GPU is slower. We're talking about the CPU (16-core TR2950X) doing 18 million requests per second and the GPU (RTX2070) doing 9 million requests per second, on a 66/33 read/write sequential workload to an in-memory key-value-store of half a million 1kB blocks. Granted, this has very low computational intensity, it's basically "read a few bytes, branch, parse an int, branch, memcpy." Perhaps the tables would turntables if you rendered mandelbrots instead. Also likely is that I haven't found the right way to write this shader for the GPU. I believe the GPU would also benefit from using a separate transfer queue for sending and receiving the request and response buffers asynchronously. With sync transfers & transfer times included, the GPU perf drops further to 5 Mreqs/s.

Feeding the beast is left as an exercise for the reader. You'd need to receive 18 million packets per second and send the same out. You could have a bunch of edge servers collect their requests into a buffer and send the buffer over to the DB server, which would process the requests and send the response buffers back to the edge servers. If one edge server could deal with a million packets per second, you'd need 36 of them for a single CPU server.

Scaling should be mostly linear. I don't have a 128-core server handy, but you might well be able to handle over 100 million requests per second with one. That'd be 100 GB/s of 1 kB responses. Or one request per minute for every person on Earth.

[Edit] On further investigation, the CPU performance seems to be memory limited. Running the same task in multi-threaded C++ (via spirv-cross) runs at nearly the same speed regardless of whether your thread count is 8, 16 or 32. At 4 threads, the performance roughly halves. So while a multi-socket server might help (more memory channels!), you might not need more than two cores per memory channel for this.

Raspberry Pi RC car

2020-05-31T07:58:00.005+03:00

Here's my software package to turn your Raspberry Pi into a RC car https://github.com/kig/rpi-car-control

rpi-car-control

Use a Raspberry Pi to drive an RC car from a web page.

Fisheye camera with two IR lamps, a white USB power bank underneath. The wires go inside the Raspberry Pi case.

A ToF laser range finder for the reversing distance indicator.

How does it work?

Open up a cheap RC toy car. Connect the motors to a Raspberry Pi. Add a camera. Run a web server on the Raspberry Pi that controls the car.

In more detail, you need to replace the car PCB with a motor controller board (say, a tiny cheap MX1508 module). Then solder the motors and the car battery pack to the motor controller. Solder M-F jumper cables to the motor controller's control connectors. Plug the other end of the jumpers to the Raspberry Pi GPIOs. Now you can control the motors from the Raspberry Pi.

Expose the Raspberry Pi camera as an MJPEG stream so that you can directly view it as an IMG on the browser. This is the easiest low latency, low CPU, high quality streaming format.

If the car has lights, you can drive them from the GPIOs as well (either directly or via a proper LED controller). Add a bunch of sensors to the car for the heck of it. I've got a tiny VL53L1X ToF laser-ranging sensor as a reversing radar, and a DHT temperature and humidity sensor. There's code in the repo to hook up an ultrasonic range finder too (it can even use the DHT sensor to calculate the speed of sound for given temperature and humidity - and has a Kalman filter of sorts, so you can reach ~mm accuracy), and some bits and bops for using a PIR sensor.

There was also a microphone input and playback either through wired speakers or to a Bluetooth speaker, but that's not enabled at the moment. There was also a WebRTC-based streaming solution for doing 2-way video calls, but that was such a pain I gave up on it. I was using RWS which is pretty easy to set up, but the STUN/TURN stuff was tough.

Add a USB battery pack to power the Raspberry Pi and you're about done. If you're feeling adventurous, you could use a 5V step-up/step-down regulator to run the Raspberry Pi directly from the car batteries.

Install

raspi-config # Enable I2C to use the VL53L1X sensor
sh install.sh

The install script installs the car service and its dependencies. This is best done on a fresh install of Raspbian. The install script overwrites NGINX's default site configuration.

After starting the car control app with sudo systemctl start car, you can connect to http://raspberrypi/car/ and play with the controls web page.

The car control app is installed in /opt/rpi-car-control.

To use a SSH tunnel server, edit /etc/rpi-car-control/env.sh and change the line RPROXY_SERVER= to RPROXY_SERVER=my.server.

With the SSH tunnel, you can access the car from http://my.server:9999/car/. Best to firewall this port and add a HTTPS reverse proxy that points to it. Look at etc/remote_nginx.conf for a snippet that sets up an authenticated NGINX reverse proxy on the remote server. (Run htpasswd -c /etc/nginx/car_htpasswd my_username to create the password file.)

Configuration

See /etc/rpi-car-control/env.sh for settings.

# SSH tunnel reverse proxy
RPROXY_SERVER=my.server

# One of v4l2-mjpeg, v4l2-raw, raspivid
VIDEO_MODE=v4l2-raw

# Which camera to use in the v4l2 modes
V4L2_DEVICE=/dev/video2

# Video settings
VIDEO_WIDTH=480
VIDEO_HEIGHT=270
VIDEO_FPS=60

VIDEO_ROTATION=0

Controls

The circle on the left is the accelerator indicator, and the circle on the right is the steering indicator. The bar in the bottom middle is the reversing distance indicator. The sensor data readout is at top left. The little square at the bottom right toggles the full screen mode.

The controls are defined near the bottom of html/main.js.

Touch controls

Use left thumb to accelerate and reverse, right thumb to steer.

Keyboard controls

Use arrow keys to drive.
The numbers 1-4 control front lights intensity and 0 turns the rear lights on and off.
The z key blinks the left front light, the c key blinks the right front light and the x key turns off the blinkers.

Requirements

The app is very modular, so you can run the app without an actual car or camera. And just play with a web page with controls that do nothing.

If you wire up the motors, you should be able to drive. If you wire up the lights, they should light up.

Wire up the sensors and you should start seeing sensor data in the HUD.

Add a camera and you'll see a live video stream.

Wiring

See control/car.py and sensors/sensors_websocket.py for the pin definitions. The VCC and GND connections have been left out. Just remember to use the correct voltage when wiring those.

Component	GPIO	Notes
Motor forward (A)	17
Motor backward (B)	27
Steering left (A)	24
Steering right (B)	23
Left headlight	5	The headlights turn on when you connect
Right headlight	6	They can also blink a turning signal
Rear lights	13	Rear lights light up when you reverse
Power PWM	12	Disabled, for use with L298N
DHT11 signal	14
PIR signal	22
VL53L1X power	4	Use a GPIO and you can turn it off when not in use
VL53L1X SDA	2	I2C bus 1
VL53L1X SCL	3	I2C bus 1

Features

FPV stream web page with keyboard & touch controls to drive the car, along with a reversing distance indicator and a thermometer.
Low latency video stream for driving (down to 50 ms glass-to-glass when using a 90 Hz camera and a 240 Hz display.)
Bunch of websocket servers to send out sensor data and receive car controls.
Nginx reverse proxy config to tie all the servers together.
Systemd service to start the car control server on boot.
SSH tunnel to a remote control server to drive the car from anywhere.
Low-power tweaks to increase battery life (disables HDMI, Ethernet and USB.)
Use RaspiCam or a V4L2 USB webcam, either with raw video (eats CPU) or camera-supplied MJPEG

Disabled

Bluetooth speaker pairing for playing audio.
Stream car microphone to the browser.
Speak to the car from the browser by sending audio with Web Audio API.
WebRTC call between browser and car.

In progress

PoseNet with Coral USB accelerator for "point and I'll drive there"

Wanted

OMX JPEG encoder for raw video cameras
SLAM and "click on a map position to drive there"
Good small microphone + speaker solution
Small display to do two-way video calls
Non-sucky camera mount (duct tape doesn't really work)
Power car and computer from one battery
Automatic wireless charging when battery is low
Shutdown when battery critical
Speech controls

Customize

Take a look at run.sh first. It starts the web server and optionally the reverse proxy tunnel. The web server is in web/web_server.py and starts up bin/start_control_server.sh and bin/start_server.sh when needed. The sensors are controlled by sensors/sensors_websocket.py, and the car controls are in control/car_websockets.py. For video streaming, have a look at video/start_stream.sh. The HUD is in html/, see html/main.js for the car controls and how the video and sensor data are streamed.

License

MIT

Overlay

2019-07-16T17:52:00.000+03:00

Found this short story of mine from 2003.

Overlay

Bright sunlight shone down through a sieve of elm leaves. The air was rich with the joyous melodies of songbirds. Gradually intensifying. Until the racket was nearly deafening.

I woke up. The alarm I had set last night wasn't quite as gentle as I had hoped, but it did get its job done. After getting my head straight with a sip of very hot, very strong coffee, I switched the overlay mode to read the few bits of mail that got through my extensive spam filter. Nothing much, just a few more entries into the filter. Some interesting linguistic randomizer used to compose two pieces of junk mail, completely different in detail, but identical in content. Spent the next five minutes training the language analyzer to filter out such messages. Thinking about implementing an opt-in filter to keep out all unsolicited mail. Coffee finished, starting on a croissant with two slices of cheese and a salad leaf. Reading the newsbits on overlay, not much going on locally. Not much going on globally. Twenty-seven new posts on the community forums overnight. Most of it ranting, logging and post-flood. No more croissant.

Teeth clean. Beard shaved. Donning my well-worn semi long jacket with a big decorative camtex pattern of a round microcircuit flower on the back. Setting the default texture of the flower to quicksilver today. Playing Strauss on the audio overlay, walking the stairs down to street level in beat. ID-system sending auth to the door. Out of the door and onto the street. Toning down brightness and contrast of the urban landscape, selecting an 18th century Italian architecture style for the building camtex overlay. Walking to the grocery store, lazily scanning the shared data of passersby, grabbing a spectator video of yesterday's CTF semifinals. Checking out some interesting plays, when I hear the audio overlay collision alert. Sidestepping to avoid bumping into a man dressed in synthleather jacket with a blazing camtex phoenix pattern. Fiddling around with overlay texture controls, turning people into 1x1x2m blocks of sandstone with dust raising in their wake and public ID carved onto all sides in a blocky rune font. A sandstone column with a green target marker suddenly appears from around a corner. Turning off the sandstone texture of people with markers.

Nostro is wearing her blueish-gray long wool coat. No camtex in sight. We talk and rummage through each other's shared files while on the way to the grocery shop. I show her my sandstone block mod and she giggles at me being a horrible person. Stepping through the shop doors, greeting the shopkeeper, picking up some oranges and pizza, heading back out through the door. Approving the instant debit payment dialog appearing on the overlay. Nostro is sipping from the apple juice carton she bought. We part ways. I head to the park. She goes home.

Chatting on the 'net, dodging lollerskaters, blasting old-skool Prodigy on the overlay. The park has a little stream running through it, roughly ten meters from bank to bank. Some ducks and an occasional swan swimming in it, fed fat by people throwing bread crumbs at them. I sat on a park bench in the shade of a cherry tree. Switched overlay to work mode, started going through work mail, attached the work network screen and checked my todo-list. Changed text input method from vocal chord sensors to finger joint virtual keyboard. Started with editing out done jobs from the todo-list, proceeded with coding some overlay mods. Mostly visual fx, with some audioscape hints and ideas for the sound guys. Spent the afternoon debugging a furry cherry tree mod. Detached the work screen and threw the now empty bag of oranges into a recycler. Took the pizza home, put it in the freezer, hoping it didn't go bad from working with me. Attached the work screen to put some final touches into the cherry tree. Added a 20% encounter rate blackbird into the scene. Set myself down onto the bed.

Cruelly awakened an hour later by the familiar jingle of a game request. Switched overlay to complete external sensor block and dove into the mechanical battlearmor. Thirty minutes and several deaths later, the opposing team succeeded in destroying our last remaining respawn point. Soon after, two enemy armors emerged from behind a hill and blew my right leg joint with a well-aimed AP round. I managed to take out the other enemy armor with some creative mortar juggling, but the other managed to hit my immobile armor with some plasma rounds near the vulnerable ammo deposits. A big boom later I was but another spectator following the desperate struggle of our last remaining armor against a full enemy five armor squad. Five seconds, and the only remainder of our last armor was a big charred hole in the ground. I switched external sensors back on and talked about the game with the guy on my contact list who woke me up.

I put the pizza into the oven, set the overlay to alert me in 12 minutes. Checked the videofeeds. Nothing really interesting going on on commercial live webcasts. The girl two floors down in apartment A12 was setting up her weekly piano performance webcast. She's been steadily improving during the two months I've been living here. Good performance tonight, wired her a two credit micropayment as my thanks for tonight. The pizza hadn't gone bad. Not that it was some miraculous culinary marvel either.

Checked the newsbits. Scriptkiddies DDOSsing people's personal uplinks on the streets to mug them when they can't call the cops via the net. Discussion forums had a link to a patch that sets up an emergency outbound link on the id data channel. Too bad that unauthorized communication on the id data channel is a major villainy. Today's spam had a lot of personal security products. Tasers and such. Easy to fry someone's optics with one. Nasty. Updated the filter list.

Approved the home server patchlist for applying. Approved the personal unit patchlist for applying. Spent 5 millicredits worth of gridtime to compile the patches. Took a shower, set the camtex of the bathroom to a jungle waterfall. Almost hit my head to the wall. Body clean. Teeth clean. Set the overlay to wake me up in eight hours.

---

Ilmari Heikkinen
Last updated: 2003-02-03

The true meaning of Brexit

2019-05-20T17:22:00.000+03:00

The true meaning of Brexit. It's right there. In your heart. It's been there all along. Brexit isn't about how many presents you receive, it isn't the fancy party, it isn't even the joyful songs you sing. Brexit is deeper than that. Brexit is the love you feel for this island and the people on it.

Brexit isn't about fighting with your neighbours and throwing your friends out into the cold. Brexit is about making a better Britain. A Britain where you can be a mechanic, a nurse, a mathematician or an art historian. A Britain where you don't have to be a banker to make ends meet. That's the real meaning of Brexit.

Brexit isn't about tearing apart our relationships and breaking our contracts. Brexit is your love for Britain, the love that will create a new Britain. A better Britain. A Britain for everyone. Article 50 isn't Brexit. Article 50 is a sword that's cutting this island apart. The first step to true Brexit is revoking article 50. Only then can we start building a better Britain.

A Brexit Britain. A Britain that will be a shining beacon of light in a world fallen to darkness. A Britain that stands proud above the waves. A Britain that doesn't require our neighbours to fund its poorest regions. A Britain that can stand alone for a thousand years, but chooses to stand together with its neighbours and create a world where our children and grandchildren can live. Proud of their ancestors. Proud of you and me.

That's the true meaning of Brexit. Revoke Article 50.

Voxel grid shortcuts

2019-02-18T11:15:00.000+02:00

This might be nice and fast if it worked right. Voxel grid shortcuts: precompute closest full cells for cell-cell-pairs, look up ray entry & exit cells, jump ray to closest cell. On exit, jump ray out of the model. 🤔

But does it make sense to swap 4 steps through a 512 element 8x8x8 acceleration structure for some math & a lookup from a 262k element shortcut list 🤔

If you do only the external faces, 64^3 grid => 6^2x64^2 accel, which might be worthwhile.

The problem in the above screenshots is that a voxel-to-voxel beam intersects more voxels than a ray would. Now it's generating the shortcuts by tracing a ray from the center the start voxel to the center of the end voxel to find the closest filled voxel the ray intersects. Which doesn't visit all the voxels a beam would, so you get gaps in the model. And my code is broken in other ways, eh.

Fix the atmosphere for profit

2019-02-14T07:02:00.000+02:00

1) Build enough solar / wind to run your country.
2) Double your build to cover for low production periods.
3) Use the surplus to run Fischer-Tropsch process to convert atmospheric carbon to fuel.
4) Keep building solar / wind until synfuel production exceeds your demand.
5) Export the excess synfuel, use proceeds to build more solar / wind.
6) Once synfuel production exceeds global demand, start stockpiling the synfuel.
7) Keep going until atmospheric carbon hits normal levels.
8) Control global synfuel supply.

The economics of oil: different extraction technologies have a different price-per-barrel. If oil goes below that price, there's no profit in doing the extraction and the oil fields get shuttered to wait for higher prices.

If you can produce synfuel at a cost below an oil field, the oil field gets shut down. Solar has been halving in price every five years, so you might well imagine an inflection point where synfuel mined from the atmosphere with solar power is cheaper than extracted oil. At that point, the synfuel company can start accumulating excess profits and squeezing traditional producers out of the market, while protecting its monopoly by acquiring nascent competition and what traditional producers and oilfields it can.

Atmospheric mining is a zero-sum game: there's a limited amount of carbon dioxide in the atmosphere and you have to stop mining when the CO2 levels fall too low. By mining out all usable atmospheric carbon, the company can eliminate any chance of competition. To add carbon to the atmosphere, the company sells synfuel to users who burn it. The company can then mine the carbon back from the atmosphere using solar.

Because the company earlier used its excess profits to acquire unprofitable oil extractors and oil fields, it also has a source of extra carbon ready to go. As global fuel use increases, more and more carbon needs to be circulated through the atmosphere.

Similar to mining, surface solar is also a zero-sum game. To produce solar, you need land area. Once the land area is in use, it can't be used for more solar. Acquiring the best lands for solar use will make competition difficult, especially at scale.

The end state of the atmospheric carbon mining company is a monopoly over hydrocarbons and solar energy. At the end state, there are no oil reserves left and oil is a just-in-time produced synthetic product. How fast you can burn the oil depends on the speed of the atmospheric mining process. This could reach a point where an amount of oil matching the total amount drilled over the past century is circulated through the atmosphere each year. To prevent losing carbon to oceans, there also needs to be a system to extract dissolved carbon from seawater.

Beam acceleration

2019-02-11T14:22:00.001+02:00

Noodling with beam-based acceleration. Tessellate bounding volume to N faces, connect them with N^2 beams, add primitives to beams, sort primitives inside beam, find containing beam for ray, traverse beam primitives from ray origin in ray direction.

Accelerate in-beam intersection by finding a set of primitives that completely cover the beam. Cut the beam at the rear-most primitive, tessellate the cutting plane, create new set of beams from beam entry to the cutting plane faces.

Bad: eats memory like crazy. Good: should be possible to do 2 beam classifies + 1-2 triangle intersects per ray on ~100ktri scenes. Let's see how it goes. (Also, see Fast Ray Tracing by Ray Classification (1987) by Arvo & Kirk)

Crazy = ~GB / 100ktris...

Hardware hacking

2018-10-06T09:29:00.001+03:00

I got into these Pi devices recently. At first it was a simple "I want an easy way to control sites accessible on my office WiFi to stop wasting time when I should be working", so I set up an old broken laptop to prototype a simple service to do that. Then I replaced the laptop with a small Orange Pi hacker board. And got some wires and switches and breadboard and LEDs and resistors and ... hey, is that a Raspberry Pi 3B+? I'll get that too, maybe I can use it for something else...

Well.

I took apart a cheap RC car. Bought a soldering station to desolder the wires from the motor control board. Then got a motor controller board (an L298N, a big old chip for controlling 9-35V motors with 2A current draw -- not a good match for 3V 2A Tamiya FA-130 motors in the RC car), wired the motors to it, and the control inputs to the Raspberry Pi GPIOs.

Add some WebSockets and an Intel RealSense camera I had in a desk drawer and hey FPV web controlled car that sees in the dark with a funky structured light IR pattern. (The Intel camera is .. not really the right thing for this, it's more of a mini Kinect and outputs only 480x270 video over the RPi's USB 2 bus. And apparently Z16 depth data as well, but I haven't managed to read that out.) Getting the video streaming low-latency enough to use for driving was a big hassle.

Experimented with powering the car from the usual 5 AA batteries, then the same USB power bank that's powering the RPi (welcome to current spike crash land, the motors can suck up to 2A when stalling), and a separate USB power bank ("Hmm, it's a bit sluggish." The steering motor has two 5.6 ohm resistors wired to it and the L298N has a voltage drop of almost 1.5V at 5V, giving me about 1W of steering power with the USB. The original controller uses a tiny MX1508, which has a voltage drop something like 0.05V. Coupled with the 7.5V battery pack, the car originally steers at 5W. So, yeah, 5x loss in snappiness. Swap motor controller for a MX1508 and replace the resistors with 2.7R or 2.2R? Or keep the L298N and use 1.2R resistors.) Then went back to the 5 AA batteries. Screw it, got some NiMHs.

Tip: Don't mix up GND and +7.5V in the L298N. It doesn't work and gets very hot after a few seconds. Thankfully that didn't destroy the RPi. Nor did plugging the L298N +5V and GND to RPi +5V and GND -- you're supposed to use a jumper to bridge the +12V and GND pins on the L298N, then plug just the GND to the RPi GND (at least that's my latest understanding). I.. might be wrong on the RPi GND part, the hypothesis is that having shared ground for the L298N and the RPi gives a ground reference for the motor control pins coming from the RPi.

Tip 2: Don't wipe off the solder from the tip of the iron, then leave it at 350C for a minute. It'll turn black. The black stuff is oxides. Oxides don't conduct heat well and solder doesn't stick to it. Wipe / buff it off, then re-tin the tip of the iron. The tin should stick to the iron and form a protective layer around it.

Destroyed the power switch of the car. A big power bank in a metal shell, sliding around in a crash, crush. It was used to control the circuit of the AA battery pack. Replaced it with a heavy-duty AC switch of doom.

Cut a USB charging cable in half to splice the power wires into the motor controller. Hey, it works! Hey, it'd be nicer if it was 10 cm longer.

Cut a CAT6 in half and spliced the ends to two RJ45 wall sockets. Plugged the other two into a router. He he he, in-socket firewall.

Got a cheapo digital multimeter. Feel so EE.

Thinking of surface mount components. Like, how to build with them without the pain of soldering and PCB production. Would be neat to build the circuit on the surface of the car seats.

4-color printer with conductive, insulating, N, and P inks. And a scanner to align successive layers.

The kid really likes buttons that turn on LEDs. Should add those to the car.

Hey, the GPIO lib has stuff for I2C and SPI. Hey, there are these super-cheap ESP32 / ESP8266 WiFi boards look neat. Hey, cool, a tiny laser ToF rangefinder.

Man, the IoT rabbit hole runs deep.

(Since writing the initial part, I swapped the L298N for a MX1508 motor controller, and the D415 for a small Raspberry Pi camera module. And got a bunch of sensors, including an ultrasound rangefinder and the tiny laser rangefinder.)