There are lots of properties on Vacation Rentals by Owner, but VRBO doesn't have an advanced search engine that allows you to specify your travel dates and number of bedrooms like HomeAway. It's tedious to click through the VRBO listings individually and check the availability calendar for each.
So I created a web scraper in Scala for VRBO that builds a database of property descriptions and availability dates. I recently finished reading Programming in Scala, and I wanted to build something in Scala. Plus Scala has a seductively simple XML syntax.
My first attempt to use Scala's built-in XML parser failed because the calendar HTML could not be parsed as XML. I found a blog, Processing real world HTML as if it were XML in scala with a JAR that I imported into my Scala project along with this JAR from TagSoup
To use the scraper, supply the URL of a VRBO page and a folder name on your hard drive for the tab-delimited results. The scraper collects property data in one file and calendar data in another. To load the tab-delimited text files into Oracle, I used this DDL and these (1, 2) control files. And then I could query using SQL.
It works great, and the Scala code is compact and readable (at least to me):
import xml._
import java.io._
import de.hars.scalaxml._
object VrboParser {
def main(args:Array[String]) {
val url = args(0)
var path = args(1) + File.separator + url.substring(url.lastIndexOf('/') + 1)
val ele = new TagSoupFactoryAdapter load url
val props = (ele \\ "li") filter (attribute(_, "class").matches("property-(alt-)?row clearfix"))
val propSb = new StringBuilder
val writer = createWriter(path + "-cal.txt")
for (prop <- props) {
val propInfo = propertyInfo(prop)
propSb append propInfo._2
writer.write(loadCalendar(propInfo._1))
}
writer.close
writeResults(path + "-prop.txt", propSb.toString)
}
def attribute(node: Node, label: String) =
node.attribute(label) match {
case Some(res) => res.first.toString
case None => ""
}
def loadCalendar(propertyId: String) = {
val sb = new StringBuilder
val calURL = "http://www.homeawayconnect.com/calendar.aspx?propertyid=" + propertyId + "&cid=5"
val calNode = new TagSoupFactoryAdapter load calURL
val cals = (calNode \\ "table") filter (attribute(_, "id") startsWith "calMonthAvail2009")
for (month <- cals) sb append calInfo(month, propertyId)
sb.toString
}
def calInfo(month: Node, propertyId: String) = {
val sb = new StringBuilder
val id = attribute(month, "id")
val yearMonth = id.substring(id.length - 6)
val dates = (month \\ "td") filter (attribute(_, "class") matches "AC.DV")
for (date <- dates) sb.append(propertyId + "\t" + yearMonth + zeroPad(date) + "\t" + attribute(date, "class") + "\r\n")
sb.toString
}
def zeroPad(date: Node) = (if (date.text.length < 2) "0" else "") + date.text
def propertyInfo(prop: Node) = {
val titles = (prop \\ "a") filter (attribute(_, "class") == "property-title")
val items = (titles.first \\ "span").first.text.substring(2).split("\\[")
val propertyId = items(1).substring(1, items(1).length - 1)
val location = items(0).trim()
val details = (prop \\ "li") filter (attribute(_, "class") == "property-details")
(propertyId, propertyId + "\t" + location + "\t" + details.first.text + "\r\n")
}
def createWriter(path: String) = new BufferedWriter(new FileWriter(new File(path)))
def writeResults(path: String, result: String) {
val writer = createWriter(path)
writer.write(result)
writer.close
}
}
I will give a try. Do you know if there is something similar to mechanize for ruby ?
ReplyDeleteGreat post on web scrapers, with some well thought out points, I use python for simple web scrapers, data extraction can be a time consuming process but for other projects that include documents, files, or the web i tried "web scrapers" which worked great, they build quick custom screen scrapers, web scrapers, and data parsing programs
ReplyDeleteThanks for the post....has opened my world to Scala...I dont have much programming experiance and I was having a problem with the JAR flies...your code calls for "import de.hars.scalaxml._". I downloaded the JAR files from Hars "scalaxml.jar" & "tagsoup-1.2.jar" placed them into the same folder as the "vrbo.scala" file, but I still get an error when I go to compile....the error points to the "import de.hars.scalaxml._" giving an error to the "d"....should I change the name of of the jar file? unzip the jars? change the name in the code to the names of the Jars? any pointers would be great! thanks
ReplyDeleteM
I'm guessing you don't have your classpath set correctly.
ReplyDeletescalac -cp scalaxml.jar:tagsoup-1.2.jar vrbo.scala