As I use to say, in my working area, the good book (or site) is the one that inspires you to development, that fires you with new ideas for your projects. Here is one book that has this feature.

"Optimizing software in C++" is not what we could call a real book. It has no ISBN, the unique reference I had to it was on internet and I couldn't find any hard cover version. Anyway it is an electronic format reading that really deserves a hard cover and any honor that is given to a book.

When developing a project that had a rigorous requirement for performance I started to look on internet about how to improve even better my software. I had already well balanced the multi-threading design for the best processing, network bottlenecks were not an issue anymore and the IO configuration of the Linux servers were already working for me after some adjustments. Then I just had the CPU to work on in order to improve performance and reach my goal of channel numbers.

Well, as I said in a previous post, I found lots of partial information about the subject, but nothing explaining well all the reasons. Until I found the pdf "Optimizing software in C++" here.

This book goes deep on software optimization considering each clock that could be saved during your software implementation. Since I was already used with firmware development, that was not a problem for me, but most people would say that it is not worthy to care about some clocks. Well, depending on the project you are working on, some clocks may make a big difference.

As an example, my project was being designed to generate one network packet at each 20 milliseconds, and the same software had to handle one incoming network packet at each 20 milliseconds. It still does not seems to be too much, right? But imagine my case where I was trying to deal with 4000 channels in a 2GHz 8 core CPU?
This amount of channels required the processing of at least 8000 packets (generate and receive) at each 20 milliseconds. At this point, making the calculation, there is 5000 clock cycles to process each packet, does it seem enough? Well, considering that I had to compress/uncompress data, protect/unprotect data using OpenSSL, move buffers and package/unpackage them for other layers of data before sending/receiving, apply digital signal processing algorithms on the information received... it is really not too much clock cycles. I was feeling like I was developing an embedded firmware again (despite the amount of available memory that was huge, not common for embedded). It is not to be forgotten also that I had to consider many threads competing for processor time generating task switch overhead and other parts of software that was not directly related to packet handling. Thinking about these issues, now, for sure, 5000 cycles were too few and I had to improve the usage of CPU!

I had already a good understanding about many things regarding performance improvement, but the reading of this book helped me to organize my ideas, to focus on what is real relevant and many "whys" were much better explained. After all, I can say that many performance concepts were already in my mind, but if you cannot translate your thoughts into words, it's because you are not controlling this thought at all, so you cannot use it as an advantage.

Finally, how did I apply some lessons learned with this book in my project?

I started implementing the memory considerations described in the book. Since memory limitations was not an issue for my project, I decided to create polls allocating on heap the memory that was being used in critical areas of the code and avoid memory allocations in parts of code that must have a Real Time response. Although memory was not an issue, I had to consider also to minimize the structure sizes in order to take advantage of the cache memory. If I wasted memory with irrelevant things I would loose cache efficiency. As it is explained:

"Reading or writing to a variable in memory takes only 2-3 clock cycles if it is cached, but several hundred clock cycles if it is not cached"

One can feel very well this cache problem when programing for GPU, where there is an entire mechanism designed just for memory access and cache improvement.

And then comes the question: "Who would waste memory with irrelevant things?"

Well, if your memory structures are not well organized, some bytes of memory are left unused due to alignment issues and then I started to reorder my structures to make them more cache friendly. These all is well explained in the book and it makes a huge difference in how you organize your structures.

After that I started to verify if I could remove some decision points from critical areas of code. I removed any "if" statement that could be performed during channel initialization instead of during critical code (packet handling). This comes in order to avoid any branch miss-prediction penalty. Of course, I could not eliminate all of them, but at least were it was possible, I could achieve some improvement.

Some other minor improvements were implemented also and many others I didn't apply since I had already reached my goal. Some of them are really complex, mainly the ones regarding cache and memory page organization. The book covers also compiler characteristics also but it would be too much to describe here in a post that is already too large.

Perhaps you would not relate all those things with "Cplusplus" programming, but more with pure C. That was my feeling also and my project was actually not using C++, but the concepts covered still applies to C++ also.

Well, some fanatic for design pattern and OOP would become crazy with something that is written in the book, I prefer to adopt a balance approach according the requirements of the project. Take a look on a piece bellow:

"University courses in programming nowadays stress the importance of structured and object-oriented programming, modularity, reusability and systematization of the software development process. These requirements are often conflicting with the requirements of optimizing the software for speed or size.
Today, it is not uncommon for software teachers to recommend that no function or method should be longer than a few lines. A few decades ago, the recommendation was the opposite: Don't put something in a separate subroutine if it is only called once. The reasons for this shift in software writing style are that software projects have become bigger and more complex, that there is more focus on the costs of software development, and that computers have become more powerful."

That is really true, but we don`t need to go too deep to any side.

A book that detail very well with that, and makes a more balanced OOP programming using C++, design patterns and performance is the "Efficient C++ Performance Programming Techniques" that I am reading right now and it seems wonderful. I will talk about it later in a post. As an appetizer from there: "Software perfection means you compute what you need, all of what you need, and nothing but what you need.

You may find lots of resources about optimization in the site of the writer and even about assembly optimization if it is needed: www.agner.org/optimize

It's worthy to take a look there.