Structural Typing

November 25, 2009 Sam Baskinger Leave a comment

My excellent colleague, Luke Forehand, pointed out to me recently that Scala has structural typing. Specifically, a blog walking through how to build a list that takes objects which have a certain method details the concept and gives a compelling example of when/where to use it.

This is similar to duck typing in that run time examination of an object is performed to see if it complies with the setup contract. However, where as duck typing simply tries to dispatch a message to an object which may fail, structural typing is a bit tighter contract which builds in checks that a certain call exists before it is called. To steal a form of the Scala example used in the linked blog article above, consider the statement:

val myListOfThings = List[ { def f(x:Int):Int } ]

The above says, declare a list that only takes objects that have a function named f that takes an Int and returns an Int. The beauty here is that you do not have to maintain a tree of related interfaces to bind together objects of similar or identical functionality. You can also build APIs that do not require your users to implement an interface of abstract method that, prehaps, is surrounded with loads of other method signatures and method definitions that the user really doesn’t care about.

I’m not looking for every excuse I can find to use this.

New Ruby VMs and Distributed Object Caching coming in MagLev?

November 23, 2009 Sam Baskinger Leave a comment

I wish that I had more time to post some of the interesting work I’m doing with building a k-mean clustering tool for blogs in ruby, but for now the best I can offer is an absolutely refreshing article on ruby vms. What got me most excited in this was the video linked at PivotalLabs in which Martin McClure from GemStone Systems talks a little about porting Ruby into their Scheme VM which offers, if I’m reading this right, distributed object persistence. Imaging magically sharing your Ruby VM’s state across 30 machines, so longs as your storage substrate keeps scale? Very excellent!

I have to read through the MagLev page to absorb some more, but I invite you, my dear reader, to beat me to the punch.

There is an impish part of me that hopes this will give rise to technologies akin to Plan 9, though, perhaps with a few settings or perhaps, preferences?

Categories: Uncategorized Tags: , , ,

Scala Anger AspectJ – An Intervention was in Order

November 6, 2009 Sam Baskinger 3 comments

Scala is going into production at work! Very exciting, even if the details of the deployment put Scala in a humble introductory role. Amidst all this excitement a huge collection of stack traces slipped by my attention, most of which look akin to :

java.lang.ClassCastException: org.aspectj.weaver.MissingResolvedTypeWithKnownSignature cannot be cast to org.aspectj.weaver.ReferenceType

and the line:

java.lang.RuntimeException: Can't find type scala.collection.mutable.ListBuffer$$anon

Every time one of these exceptions show up they are related to an inner Java class. Of particular interest was this message I dug up in the aspectj-users mailing list which suggests that AspectJ isn’t particularly intelligent about how it determines the outer class, which, Scala has many of!

The solution for this particular problem was to make a custom META-INF/aop.xml file the details of which are in the AspectJ developer’s guide! Simple, really… just a little elusive when staring at a giant stack trace in your QA system. Specifically, adding an <exclude>scala..*</exclude> line did the trick in the <weaver> section. Notice that there are supposed to be 2 dots in the exclude tag value. The second dot is for regular expression matching.

Now for those wondering, AspectJ is being used in this instance for loadtime weaving of Java classes as they are loaded (hence the computation of inner or outer classes). What is very exciting is wiring up monitoring code and performance code around classes based on what package they are a part of, including the new Scala code! We just have to be aware that the inner-outer class assumption of Java aren’t quite the same. We also need to keep an eye on what possible performance impact this has in our Spring Application Context as we need AspectJ around for some transaction management.

Categories: Uncategorized Tags: , ,

Remote Java Debugging

October 20, 2009 Sam Baskinger Leave a comment

Never a day goes by without me learning something that I am both fascinated by and a little embarrassed that I didn’t know already. Today’s such instance is remote java debugging, particularly in Eclipse. The man page on jdb says that I can add the arguments:

-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y

eclipse-debugTo my call to java and port 8000 on my machine will be opened up to debugger connections. Configuring a new debug configuration in Eclipse then lets you “start” your debugger by connecting to port 8000.

You’ll need to have the source code local to your system to interact with the running objects, but this is invaluable for those complex bugs buried in code that has to execute in a particular environment, etc, etc, etc.

Categories: Uncategorized Tags: , ,

Hadoop Unit Testing – Kinda Tough

October 3, 2009 Sam Baskinger Leave a comment
Pondering...

Pondering...

I’ve been looking into getting some more system-wide JUnit tests in place for hadoop, but have found a few challenges. Interesting, the fine folks working on hadoop do have for themselves some unit tests, and from the code-skimming I’ve done, they do seem to spin up embedded HDFS, JobTracker, and TaskTrackers (the job tracker being that bit of software that tracks the overall job you are executing while a task tracker handles the specifics of a particular map or reduce execution).

Back to the point at hand, I almost had the hadoo-test.jar code for the MiniMRCluster and MiniDfsCluster running perfectly except that when I submitted a job to the mini cluster I got an exception that loading the private class org.apache.hadoop.mapred.Child was not possible because the class was not found. Digging more, it seems that the class does indeed exist in the jars I’m using, specifically the hadoo-core jar, but the class loader simply will not pick it up, yet it finds other classes such as my Mapper, JobConf, etc. Very very strange. Googling hasn’t panned out and I’m not really at the point where I think I should engage MrUnit, mostly because it ships in the context of Cloudera’s hadoop code base.

Until I learn more about what the differences between vanilla hadoop and cloudera hadoop, I’ve resorted to building unit tests around my Mappers for the moment with only a few awkward tweaks to get it all working.

  1. Always set the system property (not a Configuration property) hadoop.log.dir. If you do not, strange failures occur, mostly when dealing with the TaskTracker but I make it a point to be on the safe side.
  2. Set the system property javax.xml.parsers.DocumentBuilderFactory to com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl if you are using hadoop config files. I have no idea how this got changed in my Maven setup, but in my unit tests I have to set this.
  3. I also set in the JobConf the value fs.default.class to org.apache.hadoop.fs.RawLocalFileSystem and the value fs.default.name to file:/tmp. I’m not convinced both of these are absolutely necessary for me to read my input files off disk out of src/test/resources, but there it is.

After all that, I can simply call the mapper’s map method is measure the results. No cluster needed.

Categories: Uncategorized Tags: , ,

Rails 2.3 Upgrade

September 28, 2009 Sam Baskinger 2 comments

I just moved my “toy” knowledge base Rails application from Rails 2.2 to Rails 2.3, and you know, I should have read the release notes. It seems that application.rb has changed to application_controller.rb, which is a good thing, but it sure was confusing for the first 10 minutes.

I also learned that rake rails:update existed. Live and learn. :)

Categories: Uncategorized Tags: ,

New Lucene Notes

September 24, 2009 Sam Baskinger Leave a comment

Some notes I picked up listening to a presentation on Lucene 2.9 from Lucid Imagination:

  1. Number Fields
    1. Use lower precision to reduce index sizes.
    2. Precision is supported in the NumericField
    3. Big speed boost to range queries.
  2. Query Parsing
    1. Query Rewriting – takes symbolic tokens (like *) and expands them to a large query that the back end processes. Should improve wild-card queries.
    2. There is a new Query Parser framework in the contrib section.
    3. Developers can more easily create a query parser.
    4. There is an Advanced Query Parser – look to the unit tests currently for what it is able to do.
    5. Payload Queries
      1. Byte arrays associated with a term in the index.
      2. For example, associate “Noun” with the word fox.
      3. You can then query on things like this.
      4. Payload parsing can be used to index NLP information at index time for later searching.
    6. Flexible Indexing – JIRA 1458
      1. Due in v3.0.
      2. Put a strongly-typed token streams to assist in indexing.
    7. Reverse string filters. Leading ‘*’.
    8. Arabic support is coming. (Light 8 stemmer).
    9. Persian support.
  3. Collectors
    1. HitCollector is deprecated. Basic Collector is given.
  4. GeoSpacial support is coming.
  5. New Term Vector
    1. Term Vectors are used (or computed on the fly).
    2. Term Vectors use loads of disk space.
  6. FieldCache
    1. Has been hacked by others to do joins like a database.
    2. There are more validation checks in 2.9. Better introspection.
  7. N-Gram Spell Check
  8. Bottle Necks Removed – General Improvements
    1. Lockless String “interning
Categories: Uncategorized Tags: ,

Custom Data Sinks in Cascading (for Hadoop)

September 12, 2009 Sam Baskinger 1 comment

A sink. Get it?

You would think that Hadoop or Cascading would have a nice robust output/export method. I mean, really, when you think about it, data in your cluster isn’t very useful unless you can operate on it. Now, it is true you can write some Java code that would process an FSDataInputStream but what we would all really prefer is to simply drop an XML file into HDFS and let someone go pick it up via the HDFS web interface, stream down the file and SAX parse it. If you are considering using DOM style parsing, you are brave! If you are using Hadoop, chances are that the file sizes produced will be far more than you would like to fit into your DOM parser.

A solution is to extend the cascading Hfs tap:

public class MyXmlSink extends Hfs
{ ... }

There is more to this story, but thankfully not much. There is also some very subtle hoop-jumping we are going to employ. The way Hadoop works is to serialize the objects it is going to use to run the job which means that your Hfs child class (called MyXmlSink above) may not contain any unseriaizable field such as a Logger object or the FSDataOutputStream object we need to write out our custom XML file.

Thankfully Cascading gives us some great hooks which are executed on the running processing node, specifically the openForWrite(JobConf conf) method. JobConf, btw, seems to be deprecated in Hadoop 0.20.0E xpect the signature of this method to change in the future. Back to openForWrite, in it you should create your very-nonserializable TupleEntryCollector that contains your FSDataOutputStream.

Of note, you want to have your TupleEntryCollector also implement OutputCollector from Hadoop. I discovered this via class cast exceptions, not through documentation.

class XmlWritingTupleEntryCollector
   extends TupleEntryCollector
   implements OutputCollector
{ ... }

I typically put the TupleEntryCollector as a static internal class to the Tap I’m creating, but feel free to put this anywhere. To cause cascading to use your new TupleEntryCollector you will need to call setWriteDirect(true) on your Sink Tap. I typically put this in the constructor so it is not missed. Because it is a method of the parent class the parent class will be fully initialized by the time you reach your constructor.

One final implementation detail about the TupleEntryCollector. I mentioned that it should implement OutputCollector. That interface has one method, collect(Object, Object). The only thing that method should do is check that the second object passed in is a type of Tuple or TypleEntry and pass that argument off to the collect(Tuple) method which will write the contents to the FSDataOutputStream.

The rest is basic filling-in (or color-by-number-coding, as I’ve grown to call it) the empty methods with whatever you need it to do.

One final note. You may very well want to set the total number of Reducers to 1 so that you get one large file. You will spend a long time collecting the data to one node and you will tax a single node’s HDFS store (not to mention replication) so use some caution with how you do this. It may make more sense for the consuming external application to do the aggregation itself.

Categories: Uncategorized Tags: , ,

JRuby, Rails, JBoss, and Jfrustration – Fixing Warble 0.9.4’s Standard Includes

August 16, 2009 Sam Baskinger Leave a comment

Work has been busy. Scratch that work has been absolutely insane and confusing and at times, baffeling, but I have to say that I wouldn’t trade the experience for the world! On the bright side, I’ve been distracted during what were the two worst weeks for the Brewer’s season (right now they are beating-up on Houston). Now that I have some time to myself I have had dinner w/ the Mrs., read some of the Bourne Identity, and have gotten back to porting my Ruby On Rails Knowledge Base application to JRuby on Rails on JBoss. The past few sessions I’ve spent with the technology have been plagued with the error message:

2009-08-16 00:27:28,101 ERROR [STDERR] (main) Warning: JRuby home "/home/sam/usr/jboss-5.1.0.GA/server/default/deploy/railskb.war/WEB-INF/lib/jruby-complete-1.3.0RC1.jar/META-INF/jruby.home" does not exist, using /tmp

2009-08-16 00:27:28,507 ERROR [STDERR] (main) Rails requires RubyGems >= . Please install RubyGems and try again: http://rubygems.rubyforge.org

2009-08-16 00:27:28,512 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[localhost].[/railskb]] (main) unable to create shared application instance org.jruby.rack.RackInitializationException: exit

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:38:in `run'

from /home/sam/usr/jboss-5.1.0.GA/server/default/deploy/railskb.war/WEB-INF/lib/jruby-rack-0.9.4.jar/jruby/rack/rails_boot.rb:20:in `run'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:11:in `boot!'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:109

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/boot.rb:20:in `require'

from /home/sam/usr/jboss-5.1.0.GA/server/default/tmp/3j001-au64oi-fyfc3ruh-1-fyfc4xzj-9p/railskb.war/WEB-INF/config/environment.rb:20

Notice the odd line “Rails requires RubyGems >= .”. Eh? I did a log of digging in google and found about 4 sets of forum posts that identify this problem and correlate it with an upgrade to Rack 0.9.4 from 0.9.3. I also noticed that Warbler had included in it a copy of jruby-complete-1.3.0RC1.jar and jruby-rack-0.9.4.jar when the version I would like it to use is jruby 1.3.1.

After a little failed convincing and JBoss continually showing that while it had included in it jruby 1.3.1 that it was choosing to use, and fail to find the ruby gems, on the 1.3.0RC1 jruby jar mentioned in the log above.

Finally, I bit the bullet and decided to punt on Warbler’s automagically included jars and manually include. To do this I:

  1. created lib/java in my rails application directory.
  2. copied into it jruby-complete-1.3.1.jar
  3. copied into it jruby-rack-0.9.jar
  4. add the line config.java_libs = FileList["lib/java/*.jar"] to  conf/warble.rb

Loading this into JBoss the application loads! When I access it, it explodes in another fashion, but that’s fine! I’m still learning and I’m past my deployment problems for now!

Categories: Uncategorized Tags: , , , ,

Scala Gotcha – The Instance Type

August 11, 2009 Sam Baskinger Leave a comment

Scala has got to be one of the most fun to code in languages I’ve used in quite a while. It, like any other language out there, has its quirks that buck against your own preconceptions. This isn’t necessarily bad or good — simply part of the game of learning another language.

Tonight I would like to share the odd little bit of code:

class A {
   class B
}

Now, in Java-land I was accustomed to being able to say:

A.B uninstantiatedObject = null;

and later get an instantiated object out of an instance of A some how. What is fascinating, and I think very cool, in Scala is that you may not even created a typed object variable called A.B, you must instantiate A (lets say you assigned it to the variable a) and then may crate the variable a.B. Your code would look like:

val a = new A()
val b = null : a.B

Notice how we do not create an instance of a.B, we are referencing the type a.B. In this sense Scala is a bit more rigid in its typing than Java. If you want to be able to talk about the class B outside of the context of an instantiation of A, then you have to put B into an object.

This shaved about 3 – 5 minutes off my core-coding time today as I stared at the error message before I made that interesting connection. What surprised me most, perhaps, was that what I bumped into in Scala (that the type definition of B is only valid in the context of A) does appear in Java in the form of instantiations of B are only valid in an instantiation of A but I had never noticed them before.

Interesting…

Try Scala!


For those wondering, the prompt for such heavy use of enclosed classes was an implementation of an External Domain Specific Language using Scala’s Combinator Parser libraries. For any wondering, and external DSL is a DSL that is totally independent of the host lanuage. It is its own language, not a specialized use of the host language. While I’m on the subject, you can check out a very excellent introduction to the Scala parser API as Ruminations of a Programmer.

Categories: Uncategorized Tags: ,