Thursday, July 30, 2009

XML with Scala and Java

The web application I build at work stores user permissions in XML. Each component of the application can be assigned a set of permissions like view, create, update and delete. Our documentation includes an Excel spreadsheet version of the same data.

The spreadsheet looks like this:

And the corresponding XML looks like this:

I discovered that my co-workers were updating this spreadsheet manually whenever the XML changed and at the end of each software release. So I wrote a simple utility in Java to convert the XML into the spreadsheet format. I used SAX to process the XML and wrote the results to a tab-delimited text file that could be copied and pasted into a spreadsheet. You can see the Java code here.

Then I finished reading Programming in Scala and wanted to write some Scala code. So I translated my Java utility into Scala, and you can see the Scala code here. I couldn't find any examples that used the scala.xml.pull package, so I had to figure it out using the API documentation. One segment of the Scala code is below:

def getPermissions(file: File, results: ListBuffer[String]) {
val er = new XMLEventReader()
er.initialize(io.Source.fromFile(file))
val sb = new StringBuilder
var atEnd: Boolean = false
while(!atEnd) {
var next = er.next
next match {
case EvElemStart(_, "Resource", _, _) => {
sb.append(getAttributeValue(next, "resourceName", "", "\t"))
sb.append(getAttributeValue(next, "description", "", "\t"))
}
case EvElemStart(_, "Permission", _, _) => {
if (!sb.isEmpty) {
sb.append(getAttributeValue(next, "permissionName", if (sb.endsWith("\t")) "" else ", ", ""))
}
}
case EvElemEnd(_, "Resource") => {
results += sb.toString
sb.clear
}
case EvElemEnd(_, "Application") => {
atEnd = true
er.stop
}
case _ =>
}
}
}

I learned the following about Scala's pull parser:
  • If you don't call XMLEventReader.stop when you are finished parsing a file, then the thread stays alive and your application never exits.
  • XMLEventReader.hasNext always returns true (in version 2.7.5.final), so I couldn't use it for the while() loop above. Instead, I had to create the atEnd Boolean variable and look for the ending XML element.
  • It's ten times slower than using SAX in Java.
On this last point, both versions write timing information to the console.

Scala:

parsing app1.xml took 422 milliseconds.
parsing app2.xml took 156 milliseconds.
parsing app3.xml took 68 milliseconds.
parsing app4.xml took 203 milliseconds.
writeResults took 8 milliseconds.
Completed in 888 milliseconds.

Java:

parsing app1.xml took 68 milliseconds.
parsing app2.xml took 14 milliseconds.
parsing app3.xml took 5 milliseconds.
parsing app4.xml took 14 milliseconds.
writeResults took 5 milliseconds.
Completed in 127 milliseconds.

8 comments:

  1. A bit bothering, I am using the simpler XML.loadFile but looking at the code I don't see where the file stream it creates in the background is being closed.
    Did you post a question to the mailing list about it?

    ReplyDelete
  2. What did you work out to be the bottleneck there?

    ReplyDelete
  3. Apparently the problem with XMLEventReader.hasNext always returning true was fixed on April 28: http://www.nabble.com/forum/ViewPost.jtp?post=23288836&framed=y

    ReplyDelete
  4. Ricky Clarkson - I don't know exactly where the bottleneck is because I don't have a Scala profiler. I thought it might be caused by my use of String concatenation instead of StringBuilder in the getAttributeValue method, but I changed the code to use StringBuilder and it had no measurable performance impact.

    ReplyDelete
  5. Scala 2.7's pull parser should have been labeled as alpha quality. The one in the 2.8 nightly builds is better. (Not sure about performance, though.)

    ReplyDelete
  6. It would be really helpful if you could try out the version in the nightly builds and see how it does against the issues you list. It was a total rewrite and I did labor quite a while trying to get the concurrency aspect right (it uses a producer and consumer thread.)

    ReplyDelete
  7. Hi John,

    For such task i would implement something less performance-concerned:


    import scala.xml.XML._
    import java.io._

    val root = loadFile("1_registration.xml")

    val out = root.\("Resource").map(resource =>
    Array(
    resource.attribute("resourceName").getOrElse(""), "\t",
    resource.attribute("description").getOrElse(""), "\t",
    resource.\\("@permissionName").toArray.deepMkString(", "),
    "\n")).toArray.deepMkString("")

    print(out)

    ReplyDelete
  8. Hi I have a real world application for this project- I own a site called Http://CloseToTheBeach.com and we too built a solution/ scraper 3 years ago that crashed when Homeaway bought VRBO. Anyone interested in solving this problem in a scalable environment? I'd like to get all my clients- the beach property owners- to work again- I run a dedicated SQL server, each property record already has a Javascript box that I can just drop code into.

    ReplyDelete