Scala Anger AspectJ – An Intervention was in Order

November 6, 2009 Sam Baskinger 3 comments

Scala is going into production at work! Very exciting, even if the details of the deployment put Scala in a humble introductory role. Amidst all this excitement a huge collection of stack traces slipped by my attention, most of which look akin to :

java.lang.ClassCastException: org.aspectj.weaver.MissingResolvedTypeWithKnownSignature cannot be cast to org.aspectj.weaver.ReferenceType

and the line:

java.lang.RuntimeException: Can't find type scala.collection.mutable.ListBuffer$$anon

Every time one of these exceptions show up they are related to an inner Java class. Of particular interest was this message I dug up in the aspectj-users mailing list which suggests that AspectJ isn’t particularly intelligent about how it determines the outer class, which, Scala has many of!

The solution for this particular problem was to make a custom META-INF/aop.xml file the details of which are in the AspectJ developer’s guide! Simple, really… just a little elusive when staring at a giant stack trace in your QA system. Specifically, adding an <exclude>scala..*</exclude> line did the trick in the <weaver> section. Notice that there are supposed to be 2 dots in the exclude tag value. The second dot is for regular expression matching.

Now for those wondering, AspectJ is being used in this instance for loadtime weaving of Java classes as they are loaded (hence the computation of inner or outer classes). What is very exciting is wiring up monitoring code and performance code around classes based on what package they are a part of, including the new Scala code! We just have to be aware that the inner-outer class assumption of Java aren’t quite the same. We also need to keep an eye on what possible performance impact this has in our Spring Application Context as we need AspectJ around for some transaction management.

Categories: Uncategorized Tags: , ,

Remote Java Debugging

October 20, 2009 Sam Baskinger Leave a comment

Never a day goes by without me learning something that I am both fascinated by and a little embarrassed that I didn’t know already. Today’s such instance is remote java debugging, particularly in Eclipse. The man page on jdb says that I can add the arguments:

-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y

eclipse-debugTo my call to java and port 8000 on my machine will be opened up to debugger connections. Configuring a new debug configuration in Eclipse then lets you “start” your debugger by connecting to port 8000.

You’ll need to have the source code local to your system to interact with the running objects, but this is invaluable for those complex bugs buried in code that has to execute in a particular environment, etc, etc, etc.

Categories: Uncategorized Tags: , ,

Hadoop Unit Testing – Kinda Tough

October 3, 2009 Sam Baskinger Leave a comment
Pondering...

Pondering...

I’ve been looking into getting some more system-wide JUnit tests in place for hadoop, but have found a few challenges. Interesting, the fine folks working on hadoop do have for themselves some unit tests, and from the code-skimming I’ve done, they do seem to spin up embedded HDFS, JobTracker, and TaskTrackers (the job tracker being that bit of software that tracks the overall job you are executing while a task tracker handles the specifics of a particular map or reduce execution).

Back to the point at hand, I almost had the hadoo-test.jar code for the MiniMRCluster and MiniDfsCluster running perfectly except that when I submitted a job to the mini cluster I got an exception that loading the private class org.apache.hadoop.mapred.Child was not possible because the class was not found. Digging more, it seems that the class does indeed exist in the jars I’m using, specifically the hadoo-core jar, but the class loader simply will not pick it up, yet it finds other classes such as my Mapper, JobConf, etc. Very very strange. Googling hasn’t panned out and I’m not really at the point where I think I should engage MrUnit, mostly because it ships in the context of Cloudera’s hadoop code base.

Until I learn more about what the differences between vanilla hadoop and cloudera hadoop, I’ve resorted to building unit tests around my Mappers for the moment with only a few awkward tweaks to get it all working.

  1. Always set the system property (not a Configuration property) hadoop.log.dir. If you do not, strange failures occur, mostly when dealing with the TaskTracker but I make it a point to be on the safe side.
  2. Set the system property javax.xml.parsers.DocumentBuilderFactory to com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl if you are using hadoop config files. I have no idea how this got changed in my Maven setup, but in my unit tests I have to set this.
  3. I also set in the JobConf the value fs.default.class to org.apache.hadoop.fs.RawLocalFileSystem and the value fs.default.name to file:/tmp. I’m not convinced both of these are absolutely necessary for me to read my input files off disk out of src/test/resources, but there it is.

After all that, I can simply call the mapper’s map method is measure the results. No cluster needed.

Categories: Uncategorized Tags: , ,

Rails 2.3 Upgrade

September 28, 2009 Sam Baskinger 2 comments

I just moved my “toy” knowledge base Rails application from Rails 2.2 to Rails 2.3, and you know, I should have read the release notes. It seems that application.rb has changed to application_controller.rb, which is a good thing, but it sure was confusing for the first 10 minutes.

I also learned that rake rails:update existed. Live and learn. :)

Categories: Uncategorized Tags: ,

New Lucene Notes

September 24, 2009 Sam Baskinger Leave a comment

Some notes I picked up listening to a presentation on Lucene 2.9 from Lucid Imagination:

  1. Number Fields
    1. Use lower precision to reduce index sizes.
    2. Precision is supported in the NumericField
    3. Big speed boost to range queries.
  2. Query Parsing
    1. Query Rewriting – takes symbolic tokens (like *) and expands them to a large query that the back end processes. Should improve wild-card queries.
    2. There is a new Query Parser framework in the contrib section.
    3. Developers can more easily create a query parser.
    4. There is an Advanced Query Parser – look to the unit tests currently for what it is able to do.
    5. Payload Queries
      1. Byte arrays associated with a term in the index.
      2. For example, associate “Noun” with the word fox.
      3. You can then query on things like this.
      4. Payload parsing can be used to index NLP information at index time for later searching.
    6. Flexible Indexing – JIRA 1458
      1. Due in v3.0.
      2. Put a strongly-typed token streams to assist in indexing.
    7. Reverse string filters. Leading ‘*’.
    8. Arabic support is coming. (Light 8 stemmer).
    9. Persian support.
  3. Collectors
    1. HitCollector is deprecated. Basic Collector is given.
  4. GeoSpacial support is coming.
  5. New Term Vector
    1. Term Vectors are used (or computed on the fly).
    2. Term Vectors use loads of disk space.
  6. FieldCache
    1. Has been hacked by others to do joins like a database.
    2. There are more validation checks in 2.9. Better introspection.
  7. N-Gram Spell Check
  8. Bottle Necks Removed – General Improvements
    1. Lockless String “interning
Categories: Uncategorized Tags: ,

Custom Data Sinks in Cascading (for Hadoop)

September 12, 2009 Sam Baskinger 1 comment

A sink. Get it?

You would think that Hadoop or Cascading would have a nice robust output/export method. I mean, really, when you think about it, data in your cluster isn’t very useful unless you can operate on it. Now, it is true you can write some Java code that would process an FSDataInputStream but what we would all really prefer is to simply drop an XML file into HDFS and let someone go pick it up via the HDFS web interface, stream down the file and SAX parse it. If you are considering using DOM style parsing, you are brave! If you are using Hadoop, chances are that the file sizes produced will be far more than you would like to fit into your DOM parser.

A solution is to extend the cascading Hfs tap:

public class MyXmlSink extends Hfs
{ ... }

There is more to this story, but thankfully not much. There is also some very subtle hoop-jumping we are going to employ. The way Hadoop works is to serialize the objects it is going to use to run the job which means that your Hfs child class (called MyXmlSink above) may not contain any unseriaizable field such as a Logger object or the FSDataOutputStream object we need to write out our custom XML file.

Thankfully Cascading gives us some great hooks which are executed on the running processing node, specifically the openForWrite(JobConf conf) method. JobConf, btw, seems to be deprecated in Hadoop 0.20.0E xpect the signature of this method to change in the future. Back to openForWrite, in it you should create your very-nonserializable TupleEntryCollector that contains your FSDataOutputStream.

Of note, you want to have your TupleEntryCollector also implement OutputCollector from Hadoop. I discovered this via class cast exceptions, not through documentation.

class XmlWritingTupleEntryCollector
   extends TupleEntryCollector
   implements OutputCollector
{ ... }

I typically put the TupleEntryCollector as a static internal class to the Tap I’m creating, but feel free to put this anywhere. To cause cascading to use your new TupleEntryCollector you will need to call setWriteDirect(true) on your Sink Tap. I typically put this in the constructor so it is not missed. Because it is a method of the parent class the parent class will be fully initialized by the time you reach your constructor.

One final implementation detail about the TupleEntryCollector. I mentioned that it should implement OutputCollector. That interface has one method, collect(Object, Object). The only thing that method should do is check that the second object passed in is a type of Tuple or TypleEntry and pass that argument off to the collect(Tuple) method which will write the contents to the FSDataOutputStream.

The rest is basic filling-in (or color-by-number-coding, as I’ve grown to call it) the empty methods with whatever you need it to do.

One final note. You may very well want to set the total number of Reducers to 1 so that you get one large file. You will spend a long time collecting the data to one node and you will tax a single node’s HDFS store (not to mention replication) so use some caution with how you do this. It may make more sense for the consuming external application to do the aggregation itself.

Categories: Uncategorized Tags: , ,

JRuby, Rails, JBoss, and Jfrustration – Fixing Warble 0.9.4’s Standard Includes

August 16, 2009 Sam Baskinger Leave a comment

Work has been busy. Scratch that work has been absolutely insane and confusing and at times, baffeling, but I have to say that I wouldn’t trade the experience for the world! On the bright side, I’ve been distracted during what were the two worst weeks for the Brewer’s season (right now they are beating-up on Houston). Now that I have some time to myself I have had dinner w/ the Mrs., read some of the Bourne Identity, and have gotten back to porting my Ruby On Rails Knowledge Base application to JRuby on Rails on JBoss. The past few sessions I’ve spent with the technology have been plagued with the error message:

2009-08-16 00:27:28,101 ERROR [STDERR] (main) Warning: JRuby home "/home/sam/usr/jboss-5.1.0.GA/server/default/deploy/railskb.war/WEB-INF/lib/jruby-complete-1.3.0RC1.jar/META-INF/jruby.home" does not exist, using /tmp

2009-08-16 00:27:28,507 ERROR [STDERR] (main) Rails requires RubyGems >= . Please install RubyGems and try again: http://rubygems.rubyforge.org

2009-08-16 00:27:28,512 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[localhost].[/railskb]] (main) unable to create shared application instance org.jruby.rack.RackInitializationException: exit

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:38:in `run'

from /home/sam/usr/jboss-5.1.0.GA/server/default/deploy/railskb.war/WEB-INF/lib/jruby-rack-0.9.4.jar/jruby/rack/rails_boot.rb:20:in `run'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:11:in `boot!'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:109

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:20:in `require'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/environment.rb:20

Notice the odd line “Rails requires RubyGems >= .”. Eh? I did a log of digging in google and found about 4 sets of forum posts that identify this problem and correlate it with an upgrade to Rack 0.9.4 from 0.9.3. I also noticed that Warbler had included in it a copy of jruby-complete-1.3.0RC1.jar and jruby-rack-0.9.4.jar when the version I would like it to use is jruby 1.3.1.

After a little failed convincing and JBoss continually showing that while it had included in it jruby 1.3.1 that it was choosing to use, and fail to find the ruby gems, on the 1.3.0RC1 jruby jar mentioned in the log above.

Finally, I bit the bullet and decided to punt on Warbler’s automagically included jars and manually include. To do this I:

  1. created lib/java in my rails application directory.
  2. copied into it jruby-complete-1.3.1.jar
  3. copied into it jruby-rack-0.9.jar
  4. add the line config.java_libs = FileList["lib/java/*.jar"] to  conf/warble.rb

Loading this into JBoss the application loads! When I access it, it explodes in another fashion, but that’s fine! I’m still learning and I’m past my deployment problems for now!

Categories: Uncategorized Tags: , , , ,

Scala Gotcha – The Instance Type

August 11, 2009 Sam Baskinger Leave a comment

Scala has got to be one of the most fun to code in languages I’ve used in quite a while. It, like any other language out there, has its quirks that buck against your own preconceptions. This isn’t necessarily bad or good — simply part of the game of learning another language.

Tonight I would like to share the odd little bit of code:

class A {
   class B
}

Now, in Java-land I was accustomed to being able to say:

A.B uninstantiatedObject = null;

and later get an instantiated object out of an instance of A some how. What is fascinating, and I think very cool, in Scala is that you may not even created a typed object variable called A.B, you must instantiate A (lets say you assigned it to the variable a) and then may crate the variable a.B. Your code would look like:

val a = new A()
val b = null : a.B

Notice how we do not create an instance of a.B, we are referencing the type a.B. In this sense Scala is a bit more rigid in its typing than Java. If you want to be able to talk about the class B outside of the context of an instantiation of A, then you have to put B into an object.

This shaved about 3 – 5 minutes off my core-coding time today as I stared at the error message before I made that interesting connection. What surprised me most, perhaps, was that what I bumped into in Scala (that the type definition of B is only valid in the context of A) does appear in Java in the form of instantiations of B are only valid in an instantiation of A but I had never noticed them before.

Interesting…

Try Scala!


For those wondering, the prompt for such heavy use of enclosed classes was an implementation of an External Domain Specific Language using Scala’s Combinator Parser libraries. For any wondering, and external DSL is a DSL that is totally independent of the host lanuage. It is its own language, not a specialized use of the host language. While I’m on the subject, you can check out a very excellent introduction to the Scala parser API as Ruminations of a Programmer.

Categories: Uncategorized Tags: ,

Some Java NLP Solutions

While I haven’t had time to check out any of these solutions for doing Natural Language Processing (not Neural Linguistic Programming), I have noticed that every one of these except TreeTagger seems to use a maximum entropy model for, essentially, randomly tagging a sentence and seeing what order of works maps best (scores highest) using a beam search.  Yes, that is a very very simplified statement of the algorithm, but you could click that nice link and read the WikiPedia entry for it if you are really curious.

Fascinating stuff which I really would love to spend more time in. Ironically my time is being taken up as of late by computer language programming in Scala using their parsing.combinator API (which, if you haven’t had time to check out, spend a day or two building a small external DSL with it — it’s not bad at all).

Regardless, here is a list of Java NLP code bases. One final comment, MorphAdorner looks absolutely facinating. It may only be in the packaging or presentation, but I would love an excuse to work with their tool set. Enough delay; The list!

  1. MorphAdorner – adorn text with information about it.
  2. Monk – targetting humanities researchers offering pattern finding tools for collections of texts.
  3. TreeTagger – fast, just fast (at least according to a post by Matthew Wilkins.
  4. Log-Linear Part of Speech Tagger (Stanford)
  5. LingPipe – looks interesting, especially with some of the details/features given attention like re-learning on the fly.
  6. OpenNLP
Categories: Uncategorized Tags: , ,

Joda Time vs Java DateTime

Here’s an annoying imcompatibility. It seems that Joda-Time’s standard date toString() specifies time zone offsets with a semicolon like -05:00.

Java’s DateTime parser cannot handle that semicolon. It expects time zones specified without a semicolon like -0500. So, I’m not using Java’s SimpleDateFormat class and parsing my dates through DateTimeFormat from Joda-Time.

So very close with my clustering project to parsing all the documents I need and shooting them over the wires to another system. So close, so very close!

Categories: Uncategorized Tags: , ,