Good Habits: VRBO Web Scraper using Scala and TagSoup

When we go on vacation my wife and I like to stay in houses with kitchens and bedrooms with doors we can close. We have two young daughters, and it's not fun for anyone in the family to sleep together in one room, especially when one of the girls is screaming "THAT'S TOO LOUD!" at the other one.

There are lots of properties on Vacation Rentals by Owner, but VRBO doesn't have an advanced search engine that allows you to specify your travel dates and number of bedrooms like HomeAway. It's tedious to click through the VRBO listings individually and check the availability calendar for each.

So I created a web scraper in Scala for VRBO that builds a database of property descriptions and availability dates. I recently finished reading Programming in Scala, and I wanted to build something in Scala. Plus Scala has a seductively simple XML syntax.

My first attempt to use Scala's built-in XML parser failed because the calendar HTML could not be parsed as XML. I found a blog, Processing real world HTML as if it were XML in scala with a JAR that I imported into my Scala project along with this JAR from TagSoup

To use the scraper, supply the URL of a VRBO page and a folder name on your hard drive for the tab-delimited results. The scraper collects property data in one file and calendar data in another. To load the tab-delimited text files into Oracle, I used this DDL and these (1, 2) control files. And then I could query using SQL.

It works great, and the Scala code is compact and readable (at least to me):


import xml._
import java.io._
import de.hars.scalaxml._

object VrboParser {
  def main(args:Array[String]) {
    val url = args(0)
    var path = args(1) + File.separator + url.substring(url.lastIndexOf('/') + 1)
    val ele = new TagSoupFactoryAdapter load url
    val props = (ele \\ "li") filter (attribute(_, "class").matches("property-(alt-)?row clearfix"))
    val propSb = new StringBuilder
    val writer = createWriter(path + "-cal.txt")
    for (prop <- props) {
      val propInfo = propertyInfo(prop)
      propSb append propInfo._2 
      writer.write(loadCalendar(propInfo._1))
    }
    writer.close
    writeResults(path + "-prop.txt", propSb.toString)
  }
  
  def attribute(node: Node, label: String) = 
    node.attribute(label) match {
      case Some(res) => res.first.toString
      case None => ""
    }

  def loadCalendar(propertyId: String) = {
    val sb = new StringBuilder
    val calURL = "http://www.homeawayconnect.com/calendar.aspx?propertyid=" + propertyId + "&cid=5"
    val calNode = new TagSoupFactoryAdapter load calURL
    val cals = (calNode \\ "table") filter (attribute(_, "id") startsWith "calMonthAvail2009")
    for (month <- cals) sb append calInfo(month, propertyId)
    sb.toString
  }
  
  def calInfo(month: Node, propertyId: String) = {
    val sb = new StringBuilder
    val id = attribute(month, "id")
    val yearMonth = id.substring(id.length - 6)
    val dates = (month \\ "td") filter (attribute(_, "class") matches "AC.DV")
    for (date <- dates) sb.append(propertyId + "\t" + yearMonth + zeroPad(date) + "\t" + attribute(date, "class") + "\r\n")
    sb.toString
  }
  
  def zeroPad(date: Node) = (if (date.text.length < 2) "0" else "") + date.text
    
  def propertyInfo(prop: Node) = {
    val titles = (prop \\ "a") filter (attribute(_, "class") == "property-title")
    val items = (titles.first \\ "span").first.text.substring(2).split("\\[")
    val propertyId = items(1).substring(1, items(1).length - 1)
    val location = items(0).trim()
    val details = (prop \\ "li") filter (attribute(_, "class") == "property-details")
    (propertyId, propertyId + "\t" + location + "\t" + details.first.text + "\r\n")
  }

  def createWriter(path: String) = new BufferedWriter(new FileWriter(new File(path)))

  def writeResults(path: String, result: String) {
    val writer = createWriter(path)
    writer.write(result)
    writer.close
  }
}

4 comments:

nonhovogliaOctober 31, 2009 at 3:46 AM
I will give a try. Do you know if there is something similar to mechanize for ruby ?
UnknownNovember 1, 2009 at 7:50 PM
Great post on web scrapers, with some well thought out points, I use python for simple web scrapers, data extraction can be a time consuming process but for other projects that include documents, files, or the web i tried "web scrapers" which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs
marvinavilezJanuary 21, 2010 at 10:32 AM
Thanks for the post....has opened my world to Scala...I dont have much programming experiance and I was having a problem with the JAR flies...your code calls for "import de.hars.scalaxml._". I downloaded the JAR files from Hars "scalaxml.jar" & "tagsoup-1.2.jar" placed them into the same folder as the "vrbo.scala" file, but I still get an error when I go to compile....the error points to the "import de.hars.scalaxml._" giving an error to the "d"....should I change the name of of the jar file? unzip the jars? change the name in the code to the names of the Jars? any pointers would be great! thanks
M
andyMay 24, 2010 at 12:13 PM
I'm guessing you don't have your classpath set correctly.

scalac -cp scalaxml.jar:tagsoup-1.2.jar vrbo.scala

Good Habits

Friday, August 7, 2009

VRBO Web Scraper using Scala and TagSoup

4 comments:

Blog Archive

About Me