I wrote my own screen scraping module built on PhantomJS, but unfortunately it's too slow for most screen scraping tasks that don't require browser-side JavaScript. One easy way to scrape pages with Node.js is to use Request and Cheerio.
Here is an example of scraping Bing to get all of the search results:
var request = require('request');
var cheerio = require('cheerio');
var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;
request(url, function(err, resp, body){
$ = cheerio.load(body);
links = $('.sb_tlst h3 a'); //use your CSS selector here
$(links).each(function(i, link){
console.log($(link).text() + ':\n ' + $(link).attr('href'));
});
});
Cheerio acts a jQuery replacement for a lot of jQuery tasks. It doesn't replicate jQuery in every way, and most importantly it's not meant for the browser but for the server. But it beats the pants off of the jsdom/jQuery combo for screen scraping.
If you made it this far, you should follow me on Twitter.
-JP
Want to test-drive Bitcoin without any risk? Check out my bitcoin wallet Coinbolt. It includes test coins for free.
comments powered by Disqus