Wednesday 4 July 2007

Australian product recalls feed

I'm a subscriber of Choice magazine (Australia) and a supporter of their campaigns.
Towards the back of each issue there is a couple of pages listing product recalls. I typically scan through them, but there's a lot and the format is not great.

The same information is provided online by the Australian government, here.
It's not the most attractive website, but it probably supports every browser and let's face it how slick does it need to be?

If the style doesn't date it, the lack of RSS does, and that's what I was after, a low volume feed.
I emailed the address on the contact page, but I haven't heard back, so I though I'd just scrape the pages for data and fabricate my own feed.

I tried using feed43.com, a service specifically designed to scape content from html to produce rss feeds. However, there's no facility to pull content from multiple sources, so the feed it produced was not much more than a teaser, no good for my offline reader.

I decided to write my own, using python.
A quick Google for a python module located PyRSS2Gen.

Essentially, this the the processes :
  1. Download the web page with a list of product recall in the last 30 days.
  2. Locate and extract the individual recalls using regular expresions.
  3. Download each recalled item's information page.
  4. Extract the details of the recall.
  5. Construct an RSS feed from the scraped content.
Here's the code, and the resulting rss feed.
A cron job runs every day, and updates the feed.

1 comment:

Anonymous said...

Request get_similar1.py code be posted. And/or sample pik files
to be able to learn pydot.

thanks.
t. jr.