Missing/Moved version.h?
Upgrade to Fedora 18 today and found that /usr/include/linux/version.h was not included in the /usr/src/kernels/`uname -r`/includes/linux directory. The result was VMware 8.0.5 couldn’t determine if it should build or install its modules.
Copying the above existing file into the source tree worked, though.
Guice + Jersey + Shiro With No Web.xml
The habit or practice of this blog has been to publish to the Internet non-obvious bit of technology. You would think that combining Jersey, Shiro, and Guice would be a bit more strait forward, and to be fair, it was very strait forward, but in my judgement still makes for a good blog post.
So, how do you wire up a Jersey JSON service using Guice and protect it with Shiro? Both Jersey and Shiro support Guice.
The process I chose to take was to first setup a web service with Guice and Jersey, and get that running in Apache Tomcat 7. One of my constraints was to not use a web.xml and use only the Servlet 3.0 annotations to start the process.
Guice and Jersey
Aside from constantly putting the “J” from jersey in the place of the “G” in Guice, this part of the process was very strait forward. There is some documentation in the Jersey manual, but there is better documentation in the package documentation for the java doc for the jersey-guice maven package found here. Why is this very useful documentation buried so? I don’t know.
First, click that last link and follow along. You will have to make an empty class that extends com GuiceFilter so you can annotate it with @WebFilter(“/my/url/*”) from the new servlet 3.0 annotation. Notice that the GuiceFilter class is in the Guice 3.0 code base.
Next extend the class GuiceServletContextListener, also a Guice code base member. This class is our entry point, if you will. You will have to override getInjector(). Inside of getInjector() the documentation shows that you create an anonymous instance of JerseyServletModule(), a Guice Module, and with that create the Guice Injector. We will do much the same thing with Shiro in a moment.
I had to add the line
params.put("com.sun.jersey.api.json.POJOMappingFeature", "true");
to enable conversion to JSON before my calls to bind() and serve(“/my/url/*”).with(GuiceContainer.class, params).
Mixing in Shiro
As of 1.2 you can add the shiro-guice maven dependency. This is where I spent a bit of time trying to think through what was happening. My natural inclination was to try and setup a filter, but Shiro will do this for you, and you have to trust that it will get it right.
We will have to implement a ShiroWebModule just as we implemented a JerseyServletModule. I chose to put this in a separate class file so as to isolate my authentication code a bit more. Before looking into that class, in the original JerseyServletModule I put the line ShiroWebModule.bindGuiceFilter(“/my/url/*”, binder()) per the Shiro Guice integration documentation. When you read that link, be sure to read to the end of the page. There is Guice integration and then there is Web Integration. We are looking at the Web integration.
Inside my child class of ShiroWebModule I implemented configureShiroWeb and I also made a constructor that takes a ServletContext and passes that to the super constructor. No magic yet.
In the configureShiroWeb method I built a SimpleAccountRealm and then called bindRealm().toInstance(simpleAccountRealm) and then called addFilterChain(“/my/url/*”, AUTHC_BASIC). That is all you need to do to the define the module. Now you have to instantiate it and add it to the list of modules that create the Guice Injector back in the Context Listener class we were just working on.
Back in the child class of GuiceServletContextListener, you’ll notice that there is no great way to get the ServletContext. I chose to override the contextInitialized method and, before calling super.contextInitialized, I copied the ServletContext inside the ServletContextEvent into a member field. When getInjector eventually fires the servletContext is populated and I can instantiate the ShiroWebModule child class we just wrote.
Unexpected Extras
For whatever reason, commons logging was not installed and is necessary to instantiate some of the beans associated with Shiro. If you see a stack trace in your Tomcat log, perhaps that’s the issue, and you just need to add the commons-logging package.
hash(key, bucket) or just hash(key)?
Currently I am picking up Riak by the folks over at Basho.com. Put briefly, Riak is a key/value storage engine that distributes its values by hashing the keys. By Riak hashing the keys in a consistent way (unlike something like Cassandra which allows the user to pick how a key is distributed around the cluster) it avoids hot-spots. One of the frequent complaints I’ve observed about Cassandra deployments is that as the data in that cluster grows, certain nodes in the cluster become hotspots and the keyspace needs to be re-balanced across the nodes. Riak largely avoids this in not letting you change how data is distributed.
A downside to this, though, is that in those situations where you want bits of data to live together on the same node, they can’t. If you want to run a map/reduce job that combines records, you are simply going to have a lot of network traffic loading and storing to other Riak nodes.
I thought I had a clever algorithm worked up where I could store the same keys in different buckets and have a good statistical chance that they would be on the same node, but I read here, “Riak computes a 160-bit binary hash of the bucket/key pair.”
Bummer.
Two lessons from this. First, subtleties of whether you hash the bucket+key or just the key matter in subtle ways. Second, Riak seems to have a high impedance to high-throughput operations on data that is, what I’m going to call, “bite-sized.” To be fair, the transaction rate, specifically the read-rate, in early experimenting, looks great.
GeoIP 1.4.8 and Fedora 17
I’m endlessly impressed with the speed of the GeoIP library from MaxMind. Having said something nice, let me be a little critical: It trivially fails to compile on Fedora 17. How do I mean trivially?
./configure make Making all in libGeoIP make[1]: Entering directory `/mnt/archive/sbaskinger/Downloads/GeoIP-1.4.8/libGeoIP' /bin/sh ../libtool --tag=CC --mode=compile gcc -DPACKAGE_NAME=\"GeoIP\" -DPACKAGE_TARNAME=\"GeoIP\" -DPACKAGE_VERSION=\"1.4.8\" -DPACKAGE_STRING=\"GeoIP\ 1.4.8\" -DPACKAGE_BUGREPORT=\"support@maxmind.com\" -DPACKAGE_URL=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -D__EXTENSIONS__=1 -D_ALL_SOURCE=1 -D_GNU_SOURCE=1 -D_POSIX_PTHREAD_SEMANTICS=1 -D_TANDEM_SOURCE=1 -DPACKAGE=\"GeoIP\" -DVERSION=\"1.4.8\" -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_USHORT_TYPEDEF=1 -DHAVE_ULONG_TYPEDEF=1 -DLITTLE_ENDIAN_HOST=1 -DHAVE_STDINT_H=1 -DHAVE_ZLIB_H=1 -DHAVE_GETTIMEOFDAY=1 -DHAVE_VASPRINTF=1 -DHAVE_VSNPRINTF=1 -DHAVE_VSPRINTF=1 -DHAVE_GETHOSTBYNAME=1 -DHAVE_GETHOSTBYNAME_R=1 -DGETHOSTBYNAME_R_RETURNS_INT=1 -I. -DGEOIPDATADIR=\"/usr/local/share/GeoIP\" -Wall -g -O2 -MT GeoIP.lo -MD -MP -MF .deps/GeoIP.Tpo -c -o GeoIP.lo GeoIP.c mv -f .deps/GeoIP.Tpo .deps/GeoIP.Plo mv: cannot stat `.deps/GeoIP.Tpo': No such file or directory make[1]: *** [GeoIP.lo] Error 1 make[1]: Leaving directory `/mnt/archive/sbaskinger/Downloads/GeoIP-1.4.8/libGeoIP' make: *** [all-recursive] Error 1
Sigh. What is interesting is that all the .Plo files already exist inside of libGeoIp/.deps but contains the text #dummy.
In digging a bit deeper, the #dummy line is expected. Beyond that, though, I didn’t find much in the way of reasons for the failure. While I’m not sure what the fundamental failing is, I do know that GeoIP is a pretty simple project, from a compilation standpoint, so re-running libtoolize and ./configure seemed a good stab-in-the-dark solution. It works!
In checking out the libtool command, it is identical. The change, whatever it was, is deep in libtool or one of the generated Makefiles. Mystery solved, though.
GPerfTools CPU Profiler
I’ve recently had reason to try and profile some C++ and C code (living in the same program), most of which was to be loaded by a call to dlopen. The first attempt was a simple use of gprof. That failed. Badly. I swung over to StackOverflow and looked for some experiences with gprof and walked away with the sentiment, “Just don’t use it.” Event when I got it working the dlopen loaded library was ignored. Very frustrating. Just don’t use gprof. At least not today.
A coworker noted that gperftools from Google existed, but didn’t know if it was simply a reroll of another profiler (perhaps gprof?) or if it solved the dlopen issue I was having. Happily, it does seem to produce profiling results for everything loaded into the program image, provided that the main program and the dlopen-loaded library are all linked with -lprofiled.
It also appears that the perftools libraries avoid absolute timing, but rather take a statistical approach to capturing profiling image. The approach works like this.
How Does it Work?
Every unit of time (assume it’s a millisecond) the perftools library will stop the program, snapshot the execution stacks in each thread, store that information, and resume the normal execution. The resultant data is statistical sampling of what was executing over the lifetime of the program. The trouble spots will be those functions who (typically) are at the top of the call stack most often.
In other cases the problematic function may be in the call stack, but is not itself often found to be at the top. It is poorly delegating its work to expensive function or delegating its work to expensive functions. Or, you are simply computing a hard result and optimization isn’t an option.
Regardless, sampling in this manner allows you to compute the probability that you are in function f() at any time during the program execution. Multiply that probability by the runtime of the program and you will get the wall-clock time that the function took up from your execution. Interestingly, we don’t really care about the absolute time that the hypothetical function f() takes, we care that it constantly shows up on our call stack when we randomly stop our program.
Another benefit of capturing performance data in this manner is that we can construct call graphs with probabilities to imply program flow. This allows us to optimize code paths or observe unexpected code paths if we can only produce a call graph. Happily, the pprof program provides as one of its outputs SVG files (and some others).
Finally, we can tweak up and down the sample rate to get a finer or more coarse resolution of profiling data. This is not something that would push me to another technology, but it is a nice-to-have.
How Do You Use It?
As mentioned above, link everything with -lprofiler. If you are using autoconf, then ./configure LDFLAGS=-lprofiler will take care of things for you.
Then, do not run your program as root. This is a protection put in by our security-minded Googlers. Do run your program with the environment variable CPUPROFILE=/tmp/my_program.prof set. This will cause the program to write its profiling information out to /tmp/my_program.prof at a clean exit.
Run your program and have it exit cleanly. Killing it results in an empty file. Don’t do this. Be nice to your program.
Now you can extract the collected profiling information by running something like pprof --text ./my_program /tmp/my_program.prof. Or you could replace --text with --svg and get an SVG file printed to standard out. Redirect this to a file and you can open it in your web browser and read the call graph.
Very useful stuff. Very easy to wire into your build systems (link, run with ENV set, extract results). I recommend you have your continuous integration system run this and check the 5 longest running functions. If that list changes, a human should confirm that it’s OK and adapt the list. But that’s me.
How do you Read the Results?
One last short section. You’ll notice, most obviously in the –text output from pprof, that there are 2 counts in the output (the whole output is documented here). The first count are the number of times that the function listed was at the top of the call stack. The second count is the number of times that the function was in the call stack, but it had called out to another function.
Percentages are provided for global context with other function calls. If we’re talking about a function that only executes 1% of the time, you probably don’t care that much about it.
Only you, the developer, can really tell if a large intermediate number is valuable or a red herring. Likewise, only you, the developer, can make a good guess if a large drop in one of the second number from function f() to some function higher on the call stack matters or not. It’s a valuable clue if f() appears 100 times but when it calls g(), g() only appears 10 times. That would seem to indicate that g() is not the big time consumer under the call graph from f(). But who knows? It’s a clue you have to explore.
Like most advanced tools, you need to do a little learning and a little hacking to make the best use of this. This tool, however, is invaluable for speeding up programs. I’ve used the equivalent in Java and can’t tell you how happy I am that something close exists for C and C++! Enjoy!
resize2fs Impressed
I’ve been using a 2TB drive for storing some research databases, but with a new eSATA connector wanted to put a Linux distribution on it.
The bad news was that Fedora 16 didn’t want to install on my near-empty, but not totally empty, and not expendable data partition. However, being laid out on the drive with LVM I could move things around, and I have to say I was very impressed with how smoothly things went.
First, we resize my ext4 partition down to 100GB:
resize2fs /dev/mapper/greendrive-lvol0 100G
Easy, and much faster than moving that data to another SATA drive.
Now, we resize the logical volume that’s holding the partition.
lvresize -L 100G /dev/mapper/greendrive-lvol0
Answer “yes” to the really scary question about losing all my data, and we’re done!
The installer is, as I type this, putting my Fedora 16 data on a new logical volume sliced out of the same physical volume that houses my data. I may retire the data partition and fold that space into the new /dev/mapper/greendrive-lvol1 volume that is soon-to-be my new workstation image.
It’s sadly rare that technology just works, but when it does, it’s amazing. Well done!
Playing With Assembly
We were talking at my work about machine instructions available on new Intel chips, specifically the cache miss counters. Why we were talking about cache miss counters is a longer tangent, but for the purposes of this tangent, we started toying with assembly and accidentally learned a few things.
First, we wrote a trivial main.c to drive our program:
#include
// Promise C that we will link against externally defined functions (f and g)
// with these signatures.
extern int f(int i);
extern int g(int i);
// Simple main.
int main(int args, char** argv)
{
int i = 1;
printf("i=%d\n", i);
printf("f=%d\n", f(i));
printf("g=%d\n", g(i));
return 0;
}
If you hadn’t guessed, we’re now going to define int f(int) and int g(int). Both of these functions take an integer, increment it, and return it. First, here is int g(int) as generated by an unoptimized gcc call we put in the file g.S:
.global g g: push %rbp # Push the current base pointer on to the stack. mov %rsp, %rbp # Store the stack pointer value into the base pointer. mov %edi,-0x4(%rbp) # Put argument 1 into the local variable. mov -0x4(%rbp),%eax # Put the local variable into the output reg add $0x1,%eax # Increment the value in eax by 1. pop %rbp # Restore the base pointer of the prev frame. retq
Looks a little verbose. Here is our hand-optimized version called int f(int) that we put in a file named f.S:
.global f f: mov %edi,%eax # Put input value into output register. add $0x1,%eax # Increment the value in eax by 1. retq
They do (roughly) the same thing. Add 1 to the input and return the input.
So, what can we learn from this? Well, notice how we pluck the input from the %edi register and place it in the %eax register? Those calling conventions might differ if you are in another language, like, say, C++. This is why we use extern "C" { #include }. Remember that the compiler generates the machine code from the header, not the linked machine code.
It’s also instructive to notice that the unoptimized gcc call pushes %rbp (the base pointer) to the stack and then points %rbp at its old value. It then accesses a local variable 4 bytes in size above the current top of the stack (it’s -0×4 because execution stacks grow from the top-down).
If we had used a local variable and made another function call the stack pointer (%rsp) would have been decremented so that then ext call to push %rbp wouldn’t clobber our existing data. We would also see a leaveq which quickly pops all of our data off the stack by taking the old %rbp stored in the current %rbp and restoring the old value.
Nothing too earth-shattering, but revisiting the metal-level helps keep perspective and helps me keep one eye on what the stack machine is doing under the higher level language I’m usually using, such as C++.