Click here to Skip to main content
15,916,042 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Hello,
I'm trying to parse this page below
The webpage I'm trying to parse[^]
When I download html string using webrequest, it doesn't have whole html strings
so I can't parse the contents part of the page
Can anybody help me?

C#
private void get_cotents(string contents_url)
        {
            string title = "";
            string contents = "";

            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(contents_url);
            HtmlAgilityPack.HtmlDocument mydoc = new HtmlAgilityPack.HtmlDocument();
            mydoc.LoadHtml(sourceUrl);

            string str =  mydoc.DocumentNode.InnerHtml;


            if (mydoc.DocumentNode != null)
            {
                var titleHeadline =               mydoc.DocumentNode.SelectSingleNode("//[@id='writeContents']");
     title = titleHeadline.InnerText;
             
             contents="I can't find the html code that has content";
             }
}


What I have tried:

I have tried getting html string using webclient and htmlweb
Posted
Updated 4-Apr-16 2:53am
v6
Comments
RickZeeland 3-Apr-16 13:10pm    
There is a semicolon ending missing in:
contents = "I can't find the html code that has content"

Please mention that the HtmlAgilityPack is needed:
https://www.nuget.org/packages/HtmlAgilityPack

I think your problem lies in getting the datastream, here is an example adapted from a CodeProject article:
C#
/// <summary>
/// http://www.codeproject.com/Articles/18034/HttpWebRequest-Response-in-a-Nutshell-Part
/// </summary>
/// <param name="contents_url">The URL string.</param>
private static void get_cotents(string contents_url)
{
    byte[] buffer = new byte[1024];
    HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create(contents_url);
    WebReq.Method = "POST";
    WebReq.ContentType = "application/x-www-form-urlencoded";
    WebReq.ContentLength = buffer.Length;
    Stream PostData = WebReq.GetRequestStream();
    //Now we write, and afterwards, we close. Closing is always important!
    PostData.Write(buffer, 0, buffer.Length);
    PostData.Close();
    //Get the response handle, we have no true response yet!
    HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();

    //Let's show some information about the response
    Console.WriteLine(WebResp.StatusCode);
    Console.WriteLine(WebResp.Server);

    //Now, we read the response (the string), and output it.
    Stream datastream = WebResp.GetResponseStream();
    StreamReader answer = new StreamReader(datastream);
    Console.WriteLine(answer.ReadToEnd());
    datastream.Close();
    answer.Close();
}


I think you can finish the rest of the code yourself ...
 
Share this answer
 
Comments
hapiten 3-Apr-16 20:18pm    
I changed the source from httpwebrequest to webclient but still can't get whole html source code
The problem was searching content div id...
It seems like the website hides the content area id.
I just solved this problem using xpath like this below

HtmlNode node = mydoc.DocumentNode.SelectSingleNode("//@id[.='sub_wkb_layout']");

Thank you guys and codeproject
I love this site :)
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900