Parallelizing your application
This is the famous no free lunch conundrum: if we want our program to run faster, it's not enough to buy a faster processor—we have to restructure it to be parallel and to use more processor cores! We'll have a closer look at these techniques in Chapter 5, An In-Depth Guide to Concurrency and Multithreading, when we'll discuss multithreading.
As for vector processing, we have to notice that, for SIMD instructions to be performant, the loaded data usually has to be aligned. Apart from that, as of today, any decent modern compiler will try to vectorize the code, so normally, we don't have to bother. However, the compiler will generate code for a specific architecture, but maybe you'd like to run your program on several processor and look for the more advanced extended instruction sets available? This is possible, and techniques for doing that are known as CPU dispatching.
The second theme related to parallelism to be mentioned here is the out-of-order execution. First, sometimes, we can encounter advice to break too long dependency chains so as to allow reordering, as was shown in a previous diagram. Arguably, this is a low-level technique, but sometimes every nanosecond may count.
Another theme is that there could be a time where we would like to disable instruction reordering. What could it be? Right—when synchronizing among threads, we have to know exactly in which order which variables were read or written. On the processor level, this can be forced with memory barriers, but this will prevent the possible reordering optimizations. That is the reason synchronization in a multithreaded program is already expensive on the processor level.