GPerfTools CPU Profiler
I’ve recently had reason to try and profile some C++ and C code (living in the same program), most of which was to be loaded by a call to dlopen. The first attempt was a simple use of gprof. That failed. Badly. I swung over to StackOverflow and looked for some experiences with gprof and walked away with the sentiment, “Just don’t use it.” Event when I got it working the dlopen loaded library was ignored. Very frustrating. Just don’t use gprof. At least not today.
A coworker noted that gperftools from Google existed, but didn’t know if it was simply a reroll of another profiler (perhaps gprof?) or if it solved the dlopen issue I was having. Happily, it does seem to produce profiling results for everything loaded into the program image, provided that the main program and the dlopen-loaded library are all linked with
It also appears that the perftools libraries avoid absolute timing, but rather take a statistical approach to capturing profiling image. The approach works like this.
How Does it Work?
Every unit of time (assume it’s a millisecond) the perftools library will stop the program, snapshot the execution stacks in each thread, store that information, and resume the normal execution. The resultant data is statistical sampling of what was executing over the lifetime of the program. The trouble spots will be those functions who (typically) are at the top of the call stack most often.
In other cases the problematic function may be in the call stack, but is not itself often found to be at the top. It is poorly delegating its work to expensive function or delegating its work to expensive functions. Or, you are simply computing a hard result and optimization isn’t an option.
Regardless, sampling in this manner allows you to compute the probability that you are in function
f() at any time during the program execution. Multiply that probability by the runtime of the program and you will get the wall-clock time that the function took up from your execution. Interestingly, we don’t really care about the absolute time that the hypothetical function
f() takes, we care that it constantly shows up on our call stack when we randomly stop our program.
Another benefit of capturing performance data in this manner is that we can construct call graphs with probabilities to imply program flow. This allows us to optimize code paths or observe unexpected code paths if we can only produce a call graph. Happily, the
pprof program provides as one of its outputs SVG files (and some others).
Finally, we can tweak up and down the sample rate to get a finer or more coarse resolution of profiling data. This is not something that would push me to another technology, but it is a nice-to-have.
How Do You Use It?
As mentioned above, link everything with
-lprofiler. If you are using autoconf, then
./configure LDFLAGS=-lprofiler will take care of things for you.
Then, do not run your program as root. This is a protection put in by our security-minded Googlers. Do run your program with the environment variable
CPUPROFILE=/tmp/my_program.prof set. This will cause the program to write its profiling information out to
/tmp/my_program.prof at a clean exit.
Run your program and have it exit cleanly. Killing it results in an empty file. Don’t do this. Be nice to your program.
Now you can extract the collected profiling information by running something like
pprof --text ./my_program /tmp/my_program.prof. Or you could replace
--svg and get an SVG file printed to standard out. Redirect this to a file and you can open it in your web browser and read the call graph.
Very useful stuff. Very easy to wire into your build systems (link, run with ENV set, extract results). I recommend you have your continuous integration system run this and check the 5 longest running functions. If that list changes, a human should confirm that it’s OK and adapt the list. But that’s me.
How do you Read the Results?
One last short section. You’ll notice, most obviously in the –text output from pprof, that there are 2 counts in the output (the whole output is documented here). The first count are the number of times that the function listed was at the top of the call stack. The second count is the number of times that the function was in the call stack, but it had called out to another function.
Percentages are provided for global context with other function calls. If we’re talking about a function that only executes 1% of the time, you probably don’t care that much about it.
Only you, the developer, can really tell if a large intermediate number is valuable or a red herring. Likewise, only you, the developer, can make a good guess if a large drop in one of the second number from function
f() to some function higher on the call stack matters or not. It’s a valuable clue if
f() appears 100 times but when it calls
g() only appears 10 times. That would seem to indicate that
g() is not the big time consumer under the call graph from
f(). But who knows? It’s a clue you have to explore.
Like most advanced tools, you need to do a little learning and a little hacking to make the best use of this. This tool, however, is invaluable for speeding up programs. I’ve used the equivalent in Java and can’t tell you how happy I am that something close exists for C and C++! Enjoy!