A weekend project - fromthecache.com

November 3, 2010 · 2 min · Dave Perrett

I was playing around on the weekend screen-scraping and analyzing word-frequencies for various sites (don’t ask), and was getting some slow responses (and accidentally got my IP blocked from one site when I hit them a few too many times).

Eventually I hit upon the idea of hitting Google Cache for each URL (the pages I was scraping had sequential ?id=xxx URLs so it was easy to automate), with the aim of speeding things up a bit and taking some load off the target sites.

With this in mind, I spent a few hours Saturday and Sunday developing fromthecache.com - it’s built on rails, and designed to provide transparent access to the Google cache, while fetching the original page as a fallback if necessary.

Screenshot

It occurred to me halfway through that it’s also useful for providing mirror links if a site gets slash-dotted - just put fromthecache.com/ in front of the URL and you have an instant cache link.

There’s a fairly good chance that the server’s IP will get blocked from Google for looking like a bot, but I’m hoping requests out of Heroku might come from a few different IPs and mix things up a bit.

You can view the demo at fromthecache.com, browse the source, or download it from the project page and try it out for yourself.

comments powered by Disqus