Click here to Skip to main content
15,867,453 members
Articles / Web Development / Node.js
Tip/Trick

Web scraping with Node.js

Rate me:
Please Sign up or sign in to vote.
4.25/5 (3 votes)
31 Dec 2013CPOL2 min read 39.8K   14   12
Introduce how to retrieve the web page data using jQuery

Introduction 

In this article, we will review how to scrap web page data with the help of Node.Js and some helpful NPM modules.

And since I also already built a site to do web scraping, I'd also like to introduce it to you.

Background 

Web scraping has always had a negative connotation in the world since    APIs are present for most popular service and they should be used to retrieve data rather than web scraping. But we have to use scraping if we are interested in some web page data but no API provide for it or the API have some license/quota limitations.

Any language can be used for writing scrapers,    but in my view, the essence of  web page  is DOM, so retrieve data through DOM should be the best approach, and as every one knows, the most popular framework to do DOM operation is jQuery, so I hope I can use jQuery selector help me to retrieve web page data. 

Then I did some research, I  have only found Node.js meets my requirement. Let's get started! 

Scraping with Cheerio 

OK, let's be honest, actually we can not use all of jQuery   syntax.  

As the cheerio mentioned,  it is a "Fast, flexible, and lean implementation of core jQuery designed specifically for the server."  nodejs module.

We can install the module using npm:  

C++
npm install cheerio  

We also need the help of module "request" which will be used to retrieve web page data. 

C++
npm install request 

The module is extremely simple, we can just use it like using jQuyer, let us scrap a real world web page: 

C++
var request = require('request');
var cheerio = require('cheerio');

var url = "http://www.imdb.com/chart/";

  request({
            "uri": url
        }, function(err, resp, body){
		  var $ = cheerio.load(body);
		  
		  var strContent = "";
          $('th:contains(Gross)').parents('table').find('tr').each(function(index,item){
           if(index>0)
           {
              var tds = $(item).find('td');
              strContent += $(tds.eq(1)).find('a').text().trim() + "," 
			  + tds.eq(2).text().trim() + "," + tds.eq(3).text().trim()+ "\r\n";
           }
           });
		   
           console.log(strContent);
		}); 

As you see, we can scrap the data using jQuery similar syntax, the output shows in below: 

 The Hobbit: The Desolation of Smaug,$29.8M,$190.3M

Frozen,$28.8M,$248.4M

Anchorman 2: The Legend Continues,$20.2M,$83.7M 

American Hustle,$19.6M,$60M

The Wolf of Wall Street,$18.5M,$34.3M

Saving Mr. Banks,$14M,$37.8M 

The Secret Life of Walter Mitty,$13M,$25.6M

The Hunger Games: Catching Fire,$10.2M,$391.1M

47 Ronin,$9.9M,$20.6M

A Madea Christmas,$7.4M,$43.7M 

 

Points of Interest

I believe you you can use xpath/regex to get same result, but I think that code is not so clearly like the above js code. 

And the performance is also well, it only takes about 160ms in my PC, it is acceptable, is n't it?

History  

Nathan Xu (The owner of the online scraping site www.datafiddle.net ) created on 2013-12-21

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
China China
egreyeherttd

Comments and Discussions

 
QuestionInterested to learn about your requirements Pin
Gebbetje8-Jan-14 7:01
Gebbetje8-Jan-14 7:01 
AnswerRe: Interested to learn about your requirements Pin
xioxu12-Jan-14 14:11
xioxu12-Jan-14 14:11 
NewsAn online site based on this article Pin
xioxu1-Jan-14 22:19
xioxu1-Jan-14 22:19 
GeneralMy vote of 5 Pin
fdogo1-Jan-14 14:38
professionalfdogo1-Jan-14 14:38 
Persuasive argument that node.js is the best tool for webscraping.
GeneralRe: My vote of 5 Pin
xioxu2-Jan-14 13:52
xioxu2-Jan-14 13:52 
Questionworks well but Pin
fdogo1-Jan-14 14:32
professionalfdogo1-Jan-14 14:32 
AnswerRe: works well but Pin
xioxu1-Jan-14 16:35
xioxu1-Jan-14 16:35 
GeneralRe: works well but Pin
fdogo2-Jan-14 7:46
professionalfdogo2-Jan-14 7:46 
GeneralRe: works well but Pin
xioxu2-Jan-14 13:49
xioxu2-Jan-14 13:49 
GeneralRe: works well but Pin
fdogo2-Jan-14 16:19
professionalfdogo2-Jan-14 16:19 
GeneralRe: works well but Pin
xioxu2-Jan-14 16:22
xioxu2-Jan-14 16:22 
GeneralRe: works well but Pin
xioxu12-Jan-14 14:38
xioxu12-Jan-14 14:38 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.