Click here to Skip to main content
15,889,992 members
Please Sign up or sign in to vote.
3.00/5 (1 vote)
Hello,

I am performing web scraping but xpath returns null.
I am trying this since yesterday and when i ran my code in the morning today it returned me unformatted result i am not sure how because since then it is returning null value again. I do not know what the problem is. I would highly appreciate your help. Below is my code.

C#
private async Task<List<NameAndScore>> WebDateFromPage(int pagenum)
       {
           string url = "http://www.realtor.com/realestateagents/New-York_NY/photo-1";
 

           if (pagenum != 0)
               url = "http://www.realtor.com/realestateagents/New-York_NY/photo-1/pg-" + pagenum.ToString();
 
           var doc = await Task.Factory.StartNew(() => web.Load(url));
 

           //var nameNodes = doc.DocumentNode.SelectNodes("//*[@id=\"agent_list_wrapper\"]/div[2]/div[2]/div/div[1]/a");
           //var scoreNodes = doc.DocumentNode.SelectNodes("//*[@id=\"agent_list_wrapper\"]//div//div//div//div//span");

           var nameNodes = doc.DocumentNode.SelectNodes("//*[@id=\"agent_list_wrapper\"]//div//div//div/div//a");
           var scoreNodes = doc.DocumentNode.SelectNodes("//*[@id=\"agent_list_wrapper\"]//div//div//div//div");
 
           if (nameNodes == null || scoreNodes == null)
               return new List<NameAndScore>();
 
           var names = nameNodes.Select(node => node.InnerText);
           var scores = scoreNodes.Select(node => node.InnerText);
 
           return names.Zip(scores, (name, score) => new NameAndScore() { Name = name, Score = score }).ToList();
       }
 
        private async void Form1_Load(object sender, EventArgs e)
       {
           int pagenum = 0;
           var rankings = await WebDateFromPage(0);
           while (rankings.Count > 0)
           {
               foreach (var ranking in rankings)
                   table.Rows.Add(ranking.Name, ranking.Score);
               pagenum = pagenum + 1;
               rankings = await WebDateFromPage(pagenum);
           }
 
       }


What I have tried:

I have tried every possible combination of XPATH. Tried to copy different tags of XPATH of the attached website but it returns null every time. I do not what the problem is as it returned value just once
Posted
Updated 11-Oct-16 1:47am
v2
Comments
José Amílcar Casimiro 11-Oct-16 6:04am    
I was looking at the page and do not find the existing xpath in the code. I suggest you go looking for a extension for your browser that give you the xpath based on a given element.
Faran Saleem 11-Oct-16 6:06am    
Can you please guide how? the attached xpath in the code is the name of each retailer.
José Amílcar Casimiro 11-Oct-16 10:09am    
If you are using firefox or chrome you can download an extension.
That extension will give you the xpath for any given element in the page.

1 solution

Maybe the document being returned has malformed html in it. Try putting your code inside a try/catch block to see what happens.

Also, try reinstantiating the web client INSIDE your WebDataFromPage method.

Finally, what's the point is using async code when you're waiting for the code to return anyway? I'm not sure there's any tangible benefit there.
 
Share this answer
 
v3
Comments
Faran Saleem 11-Oct-16 7:53am    
Hello John,

I believe the Url is correct as if you look in their side this is how it is written. And also if the Url that you mentioned was the problem even then it should return the first page results. Of this Url http://www.realtor.com/realestateagents/New-York_NY/photo-1/ but it is returning null.

It only returned results once when i ran it in the morning today and i do not know how..i have not made any changes to my code. But can't seem to find the problem.
#realJSOP 11-Oct-16 8:28am    
FYI, I went to the URL you're using, and the following URLs return the same content:

http://www.realtor.com/realestateagents/New-York_NY/photo-1
http://www.realtor.com/realestateagents/New-York_NY/photo-1/pg-1

That means your if block should probably be "if pagenum >= 2".

See my updated solution message.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900