Web 2.0 Feed Parsing Fail

| No Comments

Synopsis

My goal is to take a variety of feeds, combine them in a Planet style approach and apply some filtering and manipulation on them.

To see how possible it is, I’m also trying to do this without writing “code” in the traditional sense. Yahoo Pipes, Google Spreadsheets etc.

This is a work in progress and this post is mostly about documenting what I’ve found.

To date, we FAIL.

Yahoo Pipes

Creating a pipe that combines and filters a bunch of different (Perl related) feeds proves very easy.

Combine a bunch of blog feeds

If you’re a Perl person and would like including or can suggest a blog/perl-person/feed that is appropriate, let me know via the comments.

Apply business logic

  • Sort them by publication date/time
  • De-duplicate them based on their title
  • Optionally apply filters

Done

View the pipe summary page, then you can clone it and do your own tweaks.

Or subscribe to the rss feed

More Advanced Content Manipulation (bit.ly shorten each url)

What I wanted to do next was use the bit.ly url shortening api to replace the link in each feed item with a bit.ly shortened version. My idea was that bit.ly also provide stats on such links, so we could then produce a “Most clicked”, or something. That said, I think bit.ly only provide aggregated data.

  • “Most clicked ever” - possible
  • “Most clicked this week” - not possible, without further custom code.

The idea was not so much bit.ly specific but rather a crude PageRank like thing. So stuff a-la

  • A filtered feed by picking the most popular/clicked
  • Use combine with google’s pagerank score for the relevant domain(s)
  • Something …

In any event, I don’t think this is possible via Yahoo Pipes:

Language Filtering

Yahoo Pipes lets you translate a string from a predetermined language to another.

Aside: It’s often struck me as odd that there isn’t a “guess language” module. Google translate lacks this too — once I’ve copied in the text I want to translate, how about having a guess at what the source language is? A half decent default target language would be what my browser is reporting as my preferred language, or failing that just the target language I used last time.

Anyway - what I want is to be able to filter (out) non english posts. Is that bad manners? It strikes me the majority of readers will be english speaking, and having other languages only increases the noise to signal ratio in the general case. The logical extension would be providing a feed for each possible language, and a “global” one.

Google Spreadsheets

I know - it seems bonkers, but why not? They support the importing of external feeds via the ImportFeed formula.

Sadly

=ImportFeed("http://feeds.feedburner.com/PlanetPerl", 'items')

returns

#ERROR!

with a cell comment saying:

error: Parse error

Not the most helpful. I tried

=ImportXML("http://feeds.feedburner.com/PlanetPerl", '//entry/link@href')

as well, with similar results.

Then I noticed that the feedburner feed fails to parse.

There are a variety of people one could blame here:

  • Google: just parse the damn thing as best you can. To be fair, that is not really a good idea.
  • FeedBurner: either you are producing invalid markup, or you are copying through invalid markup from the source. Either way, you could try to fix it. Again, probably not their fault.
  • PlanetPerl or the Planet source code. Dunno, without digging further.

Next steps

Possible next steps

  • Push the feeds through Yahoo! Pipes first, as that seems to tolerate and handle invalid feeds, then use Google Spreadsheets to do funky things?
  • Except trying the combined feed mentioned above in Google Spreadsheet also fails. Trying to validate it, results in a timeout error
  • Google App Engine? This fails the “don’t write code” constraint

Leave a comment

About this Entry

This page contains a single entry by snork published on May 2, 2009 1:40 AM.

Custom BBC Show Tickets Feed is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.23-en