I was playing around on the weekend screen-scraping and analyzing word-frequencies for various sites (don’t ask), and was getting some slow responses (and accidentally got my IP blocked from one site when I hit them a few too many times).
Eventually I hit upon the idea of hitting Google Cache for each URL (the pages I was scraping had sequential ?id=xxx URLs so it was easy to automate), with the aim of speeding things up a bit and taking some load off the target sites.
With this in mind, I spent a few hours Saturday and Sunday developing fromthecache.com - it’s built on rails, and designed to provide transparent access to the Google cache, while fetching the original page as a fallback if necessary.
It occurred to me halfway through that it’s also useful for providing mirror links if a site gets slash-dotted - just put fromthecache.com/ in front of the URL and you have an instant cache link.
There’s a fairly good chance that the server’s IP will get blocked from Google for looking like a bot, but I’m hoping requests out of Heroku might come from a few different IPs and mix things up a bit.