Htmlagilitypack doc.loadhtml can't get whole HTML string

Question

1.00/5 (1 vote)

See more:

Hello,
I'm trying to parse this page below
The webpage I'm trying to parse[^]
When I download html string using webrequest, it doesn't have whole html strings
so I can't parse the contents part of the page
Can anybody help me?

C#

private void get_cotents(string contents_url)
        {
            string title = "";
            string contents = "";

            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(contents_url);
            HtmlAgilityPack.HtmlDocument mydoc = new HtmlAgilityPack.HtmlDocument();
            mydoc.LoadHtml(sourceUrl);

            string str =  mydoc.DocumentNode.InnerHtml;


            if (mydoc.DocumentNode != null)
            {
                var titleHeadline =               mydoc.DocumentNode.SelectSingleNode("//[@id='writeContents']");
     title = titleHeadline.InnerText;
             
             contents="I can't find the html code that has content";
             }
}

What I have tried:

I have tried getting html string using webclient and htmlweb

Posted 3-Apr-16 6:31am

hapiten

Updated 4-Apr-16 2:53am

v6

Add a Solution

Comments

RickZeeland 3-Apr-16 13:10pm

There is a semicolon ending missing in:
contents = "I can't find the html code that has content"

Please mention that the HtmlAgilityPack is needed:
https://www.nuget.org/packages/HtmlAgilityPack

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

RickZeeland · Answer 1 · 2016-04-03T07:59:00

I think your problem lies in getting the datastream, here is an example adapted from a CodeProject article:

C#

/// <summary>
/// http://www.codeproject.com/Articles/18034/HttpWebRequest-Response-in-a-Nutshell-Part
/// </summary>
/// <param name="contents_url">The URL string.</param>
private static void get_cotents(string contents_url)
{
    byte[] buffer = new byte[1024];
    HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create(contents_url);
    WebReq.Method = "POST";
    WebReq.ContentType = "application/x-www-form-urlencoded";
    WebReq.ContentLength = buffer.Length;
    Stream PostData = WebReq.GetRequestStream();
    //Now we write, and afterwards, we close. Closing is always important!
    PostData.Write(buffer, 0, buffer.Length);
    PostData.Close();
    //Get the response handle, we have no true response yet!
    HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();

    //Let's show some information about the response
    Console.WriteLine(WebResp.StatusCode);
    Console.WriteLine(WebResp.Server);

    //Now, we read the response (the string), and output it.
    Stream datastream = WebResp.GetResponseStream();
    StreamReader answer = new StreamReader(datastream);
    Console.WriteLine(answer.ReadToEnd());
    datastream.Close();
    answer.Close();
}

I think you can finish the rest of the code yourself ...

hapiten · Answer 2 · 2016-04-04T02:53:00

Solution 2

The problem was searching content div id...
It seems like the website hides the content area id.
I just solved this problem using xpath like this below

HtmlNode node = mydoc.DocumentNode.SelectSingleNode("//@id[.='sub_wkb_layout']");

Thank you guys and codeproject
I love this site :)

Posted 4-Apr-16 2:53am