virtual machine optimizations

about::code - my personal blog april 2018

Optimizing my Virtual Machine written in C++

After applying all kinds of macro optimizations to my virtual machine, I really wanted to push the limits and see how far I could get. This writeup contains some fundamental optimization techniques and some advanced strategies I investigated, which are build upon those principles. Note that this post is rather long and a deep dive! Feel free to skip to the next post if you do not care about nano seconds in your current project! Before diving deep, let's start this with a super important disclaimer that should precede every writeup about performance:

Before trying to apply any kind of optimizations make sure you understand the problem that you are trying to solve. Also make sure that you have the right set of tools to measure all assumptions (especially as they will most likely turn out to be false!). Always prefer clean code unless you can proove that you found a bottleneck in your program in which it is spending most of it's time. Since optimizing code usually means complicating and obfuscating it, it is only worth when it leads to a significant speedup! Optimizing code is very hard and micro-optimizations usually have a non-acceptable "performance-gain vs time invested"-ratio unless you apply it to such hotspots! That being said, I hope you learn something new in the next paragraphs!

This post is especially about the final (micro) optimizations which save of some nano seconds (or less) for a specific task. But I thought it might help much more to write about the macro optimizations first since those are fundamental and might yield a much higher speedup with far less effort.

There are many good writeups out there and thus I'll just give a short summary. Feel free to skip it if you already know all about memory latency, caches, vectorization and how to prevent the pipeline from starving! Otherwise this is most likely the most important takeaway of this writeup:

Optimization tip #1 is to make your memory access linear and predictable. This allows the hardware to prefetch and cache the data efficiently. Always pay attention to the alignment of your data and try to use every bit of a loaded cacheline to its full extend. The speed-up clearly depends on your previous software/data-architecture, but if you did not pay attention to any of those details it could very well be a factor of 10x or beyond.

Optimization tip #2 is to vectorize your calculations. Basically all modern processors have wider registers that can apply the same operation to multiple chunks of data at once which if officially referred to as SIMD. The width of those registers (and thus the amount of data that an operation can be applied to) is increasing, at the time of writing this article modern processors are capable of AVX512 which can process 64 bytes (e.g. 16 float values) per instruction. The speedup depends on your previous solution but being compute-bound (e.g. after applying tip #1), it could be a factor of 4x (for SSE) or above (depending on your problem and the hardware's capabilities).

Optimization tip #3 is to use multiple cores. The reason this tip is ranked third is its complexity. Writing correct parallel code is hard - even with all the abstractions (languages and frameworks) we build around it. Try tip #1 and #2 and measure if it is still needed before thinking about applying it! If you have many sequential sections in your code (usually guarded by something like a mutex) you might not even gain any performance benefits. Also there are problems like false sharing that might really hurt performance so bad that those issues will actually make your program slower! Don't confuse this with multiple threads having their own responsibilities which is usually fine. What I mean here is that multiple cores work on the very same problem simultaniously. In my experience the best way to accomplish a good speedup is to split the data in large, equal chunks and have each of the available cores process a chunk as in isolation until you're done. This mostly eliminates the need for heavy primitives (e.g. by using atomics in that case) while also preventing invisible problems like false sharing. The speedup is totally depending on the number of physical cores and your architecture (hardware and software), a very rough estimation would be 4x and beyond on a decent PC, expect this number to grow when more cores hit the mainstream as it does currently happen.

General tips: compilers need to follow strict aliasing rules such that the resulting binaries will always produce the intended results. Pointer aliasing is very harmful, so try to be super supportive there by following a few rules: passing something by value into a functions helps the compiler to proove that it is only visible in that scope so it can reason about reordering instructions. In cases where you know that certain inputs do not alias but your compiler does not figure it out you can use extensions like restrict as a last resort. Passing by reference is usually not a great idea since references are similar to pointers and other threads might use the same object as well, so the compiler must be extra careful to (re)load values that might change and looses a load of possibilities to apply all sorts of fancy magic.

The neat thing is that those techniques stack, which means that by applying data oriented design to solve caching and memory issues, you might have a considerable speedup because the hardware was basically starving before. At this point you might get lucky and be compute-bound. Using SIMD now might give you another noticeable speedup and at this point it is propably easy to apply since data should already be a key part of your design. The same might therefore also be true for using multiple cores. This means that all those speedup factors can be multiplied together to reach a throughput that can be orders of magnitude higher than the unoptimized design. Remember to measure the deployed system in both cases since there might be many effects that prevent you from achieving the theoretical speedup (e.g. other processes running on the same physical machine that intefere with the cache).

Now with all that being said, let's take a step forward towards micro-optimizations - the kind of optimizations that are only relevant if and only if you have already applied tip #1 and tip #2 to your problem (as well as the general tips of course). The optimizations discussed here are very specific to my problem which was designing a fast and secure virtual machine. Also the impact of those micro-optimizations might not seem so big (around a few percent each, hence the name) but keep in mind that the measurements were taken with regards to the already optimized code (all tips above applied). When I'd apply them even before #1, they'd propably not speed up the program in any way (but they would propably make it starve faster I suppose).

With all those disclaimers given, let's investigate what those micro-optimizations applied to my virtual machine look like:

All my opcodes were designed to be 4 bytes large. A key part of writing a fast virtual machine is to interpret them efficiently. Opcodes are basically just numbers and depending on some bits they might either mean something like: Add the values of register 4 and register 3 using int arithmetic and store the result in register 2. A different opcode might maybe tell the vm to load a certain number into a specific register or to push it on the stack. I'll give you a few more example categories: Jump to a certain address (which might be part of the opcode or maybe read from a register or the top of the stack) if certain flags are set or divide the value of register 8 by the value of register 4 using float arithmetic and store the result into register 3 (which means the bits of the numbers are interpeted differently, e.g. a half value that gets decopressed into a float value to calculate the result which is converted and stored as half again). Finally it could be something like reading an external source (like a gamepad or clock) and of course writing to an external source like drawing something or playing audio.

The naive way to accomplish this would maybe be do write all those cases with an if( that one condition and that other condition and those flags) do this else if( this and that ) do something else. But yeah, that would be very slow and tedious to write. So my first approach was to split the category of the opcode using a switch statement with 16 cases (one for each category like arithmetic, bitwise operations, calls, loading and storing values, external sources and so on). There was a 4bit nibble in the opcode that allowed me to extract this category efficiently (using a shift followed by a bitwise and). Inside such a handler (I called them opcode executors) I would use parts of the opcode directly, e.g. to index into some specific register or to push the low-word on the stack. That way the code became pretty much branch-free. Except for the 'large' swticht/case block at the beginning and here is where I applied my first micro-optimization: The compiler failed to generate efficient code here and I also thought that using such a giant switch/case statement is ugly. So I used a static constexpr array of member-function pointers (that all had the same signature) to the specific opcode executor. Needless to say I ordered all 16 of them in ascending order and used the opcode nibble to index into that array to call the specified opcode excutor directly. This allowed the compiler to generate very neat code while I could still gurantee that I'll never access the array out of bounds and that all those function pointers are valid (of course I did automated testing to ensure that I would not mess up any entries). This kind of dispatching (with otherwise equal code and an identical test-binary for the VM that was just doing some intense math and jumping) lead to a speedup of about 1ns per executed opcode.

I think this is the point where you might think "One nanosecond, that's ridiculous, I'm wasting my time here!" but keep in mind that the code was already optimized before. This means by executing one million opcodes my time went from an average of 6,3ns to 5,5ns per opcode on my machine (which is a laptop). That is a 14.5% speed-up. Note that an iteration in that case was not simply done by using a raw loop with a million iterations but by benchmarking the official API: It allows you to run the VM for a maximal specified amount of cycles or until either an error occures or a frame is ready. An error might either be a stack error (over- or underflow) or a unknown opcode/extension (not all categories are needed right now and there were a few opcodes that did not require each bit so those are expected to be 0 with the possiblity to indicate a later extensions which is not known to the current implementation). So you can safely call this in a productive environment since the VM will halt if there was an error or you can run it until a frame is ready without having to fear that it will block your program forever (e.g. if there was a endless loop in the binary which is proven to be impossible to detect in the general case so I thought this might be a good option). Since the binary for testing did all kinds of arithmetic using registers, no push or pop was required. Jumping to an address and executing code from there was also always valid, so there was basically no operations that could raise any kind of error and it always executed the whole 1'000'000 cycles.

Speaking of errors: The second slowdown was harder to track down: The API supports two ways of executing a cycle. The first will return the executed opcoded or throw a std::runtime_error when an error occured. The second version returns an std::optional of opcode. The idea was that this interface can be used in an environment where exceptions are not allowed. But the second version was much slower first although almost all of the code was identical, except for the error-cases. After rolling my own fine-tuned implementation of an optional (for simple pod types) I noticed that the problem was related to the ternary operator which I used to construct the optional with. I rolled my own branch-free ctor implementation and it run equally fast as the version using exceptions (I only benchmarked cases where no exceptions occured since this is the general case that I wanted to optimize for). I knew I was on the right track but rolling your own versions of something that is part of the stl is a very bad idea. To cut it short, replacing the ternary operator with a branch (for the error-case that was basically never taken) made the version using a std::optional as fast as the version using exceptions. That might seem counter-intuitive - I thought the ternary operator would compile into a cmov instruction that would easily outperform a branch, but that was not the case on my hardware. At this point I can only speculate why this code is better (successful speculative execution and correct branch prediction? cmov blowing the instruction-cache?) but the real takeaway here is that you should always measure instead of trusting your guts since some results might even be counterintuitive.

Optimizing is always a trade-off like trading memory for compute or trading time/codequality for speedups. The important part is to be fully aware of all constraints to make the right decisions within!

Constraints differ across platforms: Accessing memory on a modern desktop pc has relatively high latency, it might sometimes be cheaper to compute a value from the avialable data each time it is needed rather than storing it (especially in memory bound environments where storing that extra information would increase memory pressure or destroy the alignment). On a microcontroller memory access is relatively fast so it might be worth storing those extra bits of data to save precious cpu-cycles.

The takeaway here is that you need information to optimization within constraints! If I knew that my virtual machine will execute a class of opcodes regulary, I can try to speed them up and every percent that I gain will eventually show up. If I blindly optimze some exotic opcodes that are never used in real code, the speedup will be zero even if I managed to make executing that opcode 10x faster. Make sure your time for optimizing is well spent! Be sceptical about micro-bechmarks as they might not necessarily tell you about the performance gains in the deployed application. Collect as much data as you can to identify the current bottlenecks such that your time spent makes a real impact. Hope that helped!

visit my blog about a raii-powered gfx stack