Friday, August 7, 2009

VRBO Web Scraper using Scala and TagSoup

When we go on vacation my wife and I like to stay in houses with kitchens and bedrooms with doors we can close. We have two young daughters, and it's not fun for anyone in the family to sleep together in one room, especially when one of the girls is screaming "THAT'S TOO LOUD!" at the other one.

There are lots of properties on
Vacation Rentals by Owner, but VRBO doesn't have an advanced search engine that allows you to specify your travel dates and number of bedrooms like HomeAway. It's tedious to click through the VRBO listings individually and check the availability calendar for each.

So I created a web scraper in Scala for VRBO that builds a database of property descriptions and availability dates. I recently finished reading
Programming in Scala, and I wanted to build something in Scala. Plus Scala has a seductively simple XML syntax.

My first attempt to use Scala's built-in XML parser failed because the calendar HTML could not be parsed as XML. I found a blog,
Processing real world HTML as if it were XML in scala with a JAR that I imported into my Scala project along with this JAR from TagSoup

To use the scraper, supply the URL of a VRBO page and a folder name on your hard drive for the tab-delimited results. The scraper collects property data in one file and calendar data in another. To load the tab-delimited text files into Oracle, I used this
DDL and these (1, 2) control files. And then I could query using SQL.

It works great, and the Scala code is compact and readable (at least to me):


import xml._
import java.io._
import de.hars.scalaxml._

object VrboParser {
def main(args:Array[String]) {
val url = args(0)
var path = args(1) + File.separator + url.substring(url.lastIndexOf('/') + 1)
val ele = new TagSoupFactoryAdapter load url
val props = (ele \\ "li") filter (attribute(_, "class").matches("property-(alt-)?row clearfix"))
val propSb = new StringBuilder
val writer = createWriter(path + "-cal.txt")
for (prop <- props) {
val propInfo = propertyInfo(prop)
propSb append propInfo._2
writer.write(loadCalendar(propInfo._1))
}
writer.close
writeResults(path + "-prop.txt", propSb.toString)
}

def attribute(node: Node, label: String) =
node.attribute(label) match {
case Some(res) => res.first.toString
case None => ""
}

def loadCalendar(propertyId: String) = {
val sb = new StringBuilder
val calURL = "http://www.homeawayconnect.com/calendar.aspx?propertyid=" + propertyId + "&cid=5"
val calNode = new TagSoupFactoryAdapter load calURL
val cals = (calNode \\ "table") filter (attribute(_, "id") startsWith "calMonthAvail2009")
for (month <- cals) sb append calInfo(month, propertyId)
sb.toString
}

def calInfo(month: Node, propertyId: String) = {
val sb = new StringBuilder
val id = attribute(month, "id")
val yearMonth = id.substring(id.length - 6)
val dates = (month \\ "td") filter (attribute(_, "class") matches "AC.DV")
for (date <- dates) sb.append(propertyId + "\t" + yearMonth + zeroPad(date) + "\t" + attribute(date, "class") + "\r\n")
sb.toString
}

def zeroPad(date: Node) = (if (date.text.length < 2) "0" else "") + date.text

def propertyInfo(prop: Node) = {
val titles = (prop \\ "a") filter (attribute(_, "class") == "property-title")
val items = (titles.first \\ "span").first.text.substring(2).split("\\[")
val propertyId = items(1).substring(1, items(1).length - 1)
val location = items(0).trim()
val details = (prop \\ "li") filter (attribute(_, "class") == "property-details")
(propertyId, propertyId + "\t" + location + "\t" + details.first.text + "\r\n")
}

def createWriter(path: String) = new BufferedWriter(new FileWriter(new File(path)))

def writeResults(path: String, result: String) {
val writer = createWriter(path)
writer.write(result)
writer.close
}
}