I used Groovy's XmlSlurper to do the heavy lifting. XmlSlurper uses SAX underneath and, importantly, lets you choose a different SAXParser. As I wanted to parse HTML and not XML, I used TagSoup as my SAXParser.
slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
Everytime I run the script (with cron, overnight), I'm going to want to use the latest data, so I created a URL object that had the URL for the UV Index data for London, and then used groovy's "withReader" enhancement to read and parse the data
url = new URL("http://blah/blah?blah")
url.withReader { reader ->
html = slurper.parse(reader)
//we should now have a parsed file
...scrapeing code...
}
As the data I was looking for was conviently located in a table, all I had to do was find the path to the table (firebug comes in real handy here)
tbl = html.body.table.tr.td.dl.dd.table
That gives a table and we can use a closure to iterate over the rows
tbl.tr.list().each { row ->
... row parsing code ...
}
Each row has a td list, so any particular cell of a row can then be accessed as row.td[X]. In order to get a row as a string, you'll need to use toString (so, to get the data of the first cell as a string, it's row.td[0].toString()).
I came across an interesting issue with the trim function when I was trying to parse the first column into a DateTime (using the Joda Time library. There were some non-breaking spaces in the String, and trim doesn't trim non breaking spaces, so I had to run a quick regular expression on the String to get rid of them
ds = row.td[0].toString().replaceAll(/\xA0/ , {""})
So putting it all together (though without the Google API code to do the uploading)
slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());
url = new URL('http://www.temis.nl/uvradiation/nrt/uvindex.php?lon=-0.07&lat=51.30')
url.withReader { reader ->
html = slurper.parse(reader);
tbl = html.body.table.tr.td.dl.dd.table
tbl.tr.list().each { row ->
if (row.td.size() == 3) {
//trim doesn't work on a non breaking space
ds = row.td[0].toString().replaceAll(/\xA0/, {""}).trim()
uvi = row.td[1].toString().toFloat()
//now do something with the date and uv index
}
}
}
There is one thing to note here -it does a quick check to make sure that there a 3 columns (date, uv index and ozone column), this is because there will be an extra row at the start of the table that contains the city name, if TEMIS know what the city name is for a set of co-ordinates.
I ran into a couple of niggles with the Google side of things, but that's probably best left to a different post
8 comments:
can you please describe in detail how did you write the code
I also want to parse the contents from a html table
So that your code is easier to read, and to add syntax highlighting check use this out.
Cheers!
Thanks for this; I keep coming back to it as a refresher.
One thing that bit me and at least one other person I know was that firebug often (always?) adds a nonexistent <tbody> in between <table> and <tr>. If you try to use gpath like "dom.body.table.tbody.tr.td" it's likely to not work... you have to ignore the tbody and just go with "dom.body.table.tr.td" and you're good to go.
Thanks again!
Just wanted to say thanks for this post - I used it to help me write my own scraper. I've put example code from my scraper up on my blog, I hope it's useful.
Can you explain in detail how to write code for you
Thesis | Dissertation | Essay | Assignment
Suffice it to say, thanks for this post - I use it to help me write my own scraper. I have put my blog on the example code, I hope this is useful, from my blade had.
Social media news Social Bookmarking Submission
this code was really helpful. thanks a lot
Adrian fromrakeback
http://half-wit4u.blogspot.com/2011/01/web-scraping-using-java-api.html
Post a Comment