Friday, August 7, 2009

VRBO Web Scraper using Scala and TagSoup

When we go on vacation my wife and I like to stay in houses with kitchens and bedrooms with doors we can close. We have two young daughters, and it's not fun for anyone in the family to sleep together in one room, especially when one of the girls is screaming "THAT'S TOO LOUD!" at the other one.

There are lots of properties on
Vacation Rentals by Owner, but VRBO doesn't have an advanced search engine that allows you to specify your travel dates and number of bedrooms like HomeAway. It's tedious to click through the VRBO listings individually and check the availability calendar for each.

So I created a web scraper in Scala for VRBO that builds a database of property descriptions and availability dates. I recently finished reading
Programming in Scala, and I wanted to build something in Scala. Plus Scala has a seductively simple XML syntax.

My first attempt to use Scala's built-in XML parser failed because the calendar HTML could not be parsed as XML. I found a blog,
Processing real world HTML as if it were XML in scala with a JAR that I imported into my Scala project along with this JAR from TagSoup

To use the scraper, supply the URL of a VRBO page and a folder name on your hard drive for the tab-delimited results. The scraper collects property data in one file and calendar data in another. To load the tab-delimited text files into Oracle, I used this
DDL and these (1, 2) control files. And then I could query using SQL.

It works great, and the Scala code is compact and readable (at least to me):


import xml._
import java.io._
import de.hars.scalaxml._

object VrboParser {
def main(args:Array[String]) {
val url = args(0)
var path = args(1) + File.separator + url.substring(url.lastIndexOf('/') + 1)
val ele = new TagSoupFactoryAdapter load url
val props = (ele \\ "li") filter (attribute(_, "class").matches("property-(alt-)?row clearfix"))
val propSb = new StringBuilder
val writer = createWriter(path + "-cal.txt")
for (prop <- props) {
val propInfo = propertyInfo(prop)
propSb append propInfo._2
writer.write(loadCalendar(propInfo._1))
}
writer.close
writeResults(path + "-prop.txt", propSb.toString)
}

def attribute(node: Node, label: String) =
node.attribute(label) match {
case Some(res) => res.first.toString
case None => ""
}

def loadCalendar(propertyId: String) = {
val sb = new StringBuilder
val calURL = "http://www.homeawayconnect.com/calendar.aspx?propertyid=" + propertyId + "&cid=5"
val calNode = new TagSoupFactoryAdapter load calURL
val cals = (calNode \\ "table") filter (attribute(_, "id") startsWith "calMonthAvail2009")
for (month <- cals) sb append calInfo(month, propertyId)
sb.toString
}

def calInfo(month: Node, propertyId: String) = {
val sb = new StringBuilder
val id = attribute(month, "id")
val yearMonth = id.substring(id.length - 6)
val dates = (month \\ "td") filter (attribute(_, "class") matches "AC.DV")
for (date <- dates) sb.append(propertyId + "\t" + yearMonth + zeroPad(date) + "\t" + attribute(date, "class") + "\r\n")
sb.toString
}

def zeroPad(date: Node) = (if (date.text.length < 2) "0" else "") + date.text

def propertyInfo(prop: Node) = {
val titles = (prop \\ "a") filter (attribute(_, "class") == "property-title")
val items = (titles.first \\ "span").first.text.substring(2).split("\\[")
val propertyId = items(1).substring(1, items(1).length - 1)
val location = items(0).trim()
val details = (prop \\ "li") filter (attribute(_, "class") == "property-details")
(propertyId, propertyId + "\t" + location + "\t" + details.first.text + "\r\n")
}

def createWriter(path: String) = new BufferedWriter(new FileWriter(new File(path)))

def writeResults(path: String, result: String) {
val writer = createWriter(path)
writer.write(result)
writer.close
}
}

Thursday, July 30, 2009

XML with Scala and Java

The web application I build at work stores user permissions in XML. Each component of the application can be assigned a set of permissions like view, create, update and delete. Our documentation includes an Excel spreadsheet version of the same data.

The spreadsheet looks like this:

And the corresponding XML looks like this:

I discovered that my co-workers were updating this spreadsheet manually whenever the XML changed and at the end of each software release. So I wrote a simple utility in Java to convert the XML into the spreadsheet format. I used SAX to process the XML and wrote the results to a tab-delimited text file that could be copied and pasted into a spreadsheet. You can see the Java code here.

Then I finished reading Programming in Scala and wanted to write some Scala code. So I translated my Java utility into Scala, and you can see the Scala code here. I couldn't find any examples that used the scala.xml.pull package, so I had to figure it out using the API documentation. One segment of the Scala code is below:

def getPermissions(file: File, results: ListBuffer[String]) {
val er = new XMLEventReader()
er.initialize(io.Source.fromFile(file))
val sb = new StringBuilder
var atEnd: Boolean = false
while(!atEnd) {
var next = er.next
next match {
case EvElemStart(_, "Resource", _, _) => {
sb.append(getAttributeValue(next, "resourceName", "", "\t"))
sb.append(getAttributeValue(next, "description", "", "\t"))
}
case EvElemStart(_, "Permission", _, _) => {
if (!sb.isEmpty) {
sb.append(getAttributeValue(next, "permissionName", if (sb.endsWith("\t")) "" else ", ", ""))
}
}
case EvElemEnd(_, "Resource") => {
results += sb.toString
sb.clear
}
case EvElemEnd(_, "Application") => {
atEnd = true
er.stop
}
case _ =>
}
}
}

I learned the following about Scala's pull parser:
  • If you don't call XMLEventReader.stop when you are finished parsing a file, then the thread stays alive and your application never exits.
  • XMLEventReader.hasNext always returns true (in version 2.7.5.final), so I couldn't use it for the while() loop above. Instead, I had to create the atEnd Boolean variable and look for the ending XML element.
  • It's ten times slower than using SAX in Java.
On this last point, both versions write timing information to the console.

Scala:

parsing app1.xml took 422 milliseconds.
parsing app2.xml took 156 milliseconds.
parsing app3.xml took 68 milliseconds.
parsing app4.xml took 203 milliseconds.
writeResults took 8 milliseconds.
Completed in 888 milliseconds.

Java:

parsing app1.xml took 68 milliseconds.
parsing app2.xml took 14 milliseconds.
parsing app3.xml took 5 milliseconds.
parsing app4.xml took 14 milliseconds.
writeResults took 5 milliseconds.
Completed in 127 milliseconds.