Click here to Skip to main content
15,921,276 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I already have the source of HTML page so i am using

C#
string html_page_source="some page source crawled before";
HtmlDocument hdMyDoc = new HtmlDocument();
hdMyDoc.LoadHtml(html_page_source);


However i see not decoded characters such as

HTML
  
içerisinde 
göründüğünden çok
.
.


So how can i set auto decode at htmldocument ?

How can i set default encoding to solve this problem ?

And would this below method a good practice ?

C#
hdMyDoc.LoadHtml(HttpUtility.HtmlDecode(html_page_source));


C# .net 4.5 latest , WPF application
Posted

1 solution


The Html Agility Pack is equiped with a utility class called HtmlEntity. It has a static method with the following signature:
C#
/// <summary>
/// Replace known entities by characters.
/// </summary>
/// <param name="text">The source text.</param>
/// <returns>The result text.</returns>
public static string DeEntitize(string text)

It supports well-known entities (like &nbsp;) and encoded characters such as &#039; as well.

Once you've extracted the string from the document, use this method to convert the HTML-encoded entities back to text characters.

Don't HTML-decode the source before trying to load the document; you'll completely change the meaning of the markup.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900