Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / WTL

EfTidyNet: .NET Wrapper for Tidy library

4.87/5 (12 votes)
6 Sep 2013GPL38 min read 1   1.6K  
Free component for parsing HTML, .NET version of EfTidyCom

Introduction

Before I go into details, I want you to know what EfTidy actually is. EfTidy is a wrapper component of Tidy library, and if you don't know what Tidy is, here is a little description:

"TidyLib is an open source utility for tidying up HTML. Tidy is composed from an HTML parser and an HTML pretty printer. The parser goes to considerable lengths to correct common markup errors. It also provides advice on how to make your pages more accessible to people with disabilities, and can be used to convert HTML content into XML as XHTML. Tidy is W3C open source and available free. It has been successfully compiled on a large number of platforms, and is being integrated into many HTML authoring tools."

- By Mr. Dave Raggett

This is the .NET version of the EfTidyCom component (also present on The Code Project). Before moving further, this library is dedicated to the memory of my mother Late Mrs. Saroj Gupta, whom I lost recently (29th January, 2008), just want to say Mummy!, I love you.

I have had a lot of demand to provide the .NET version of EfTidyCom library as COM is losing focus and .NET seems to be the future. This library is written in VC++.NET (by mixing managed and unmanaged code). Please find a reference and test cases in this article. Thanks and just pray for my mother that she live happy wherever she is.

This is also an updated version of EfTidyCom. Some features (Node and Attribute classes) have been removed as I think they are not of much use!

Library Reference

EfTidy contains two classes:

  • TidyNetOpt [under EfTidyNet namespace]
  • TidyNet [under EfTidyNet::EfTidyOpt namespace]

EfTidy also contains four enumerations:

  • ECharEncodingType
  • EOutputType
  • EIndentScheme
  • EDoctypeModes

Now, let's take each interface one by one.

1. TidyNet

First, let's check out each and every method or property present in this interface, and the functions they perform:

Property/Method nameParametersGet/PutDescription
TidyFiletoMem const String^ SFileName , String^ % SResultn/aWrite output to memory.
TidyFileToFile const String^ SsourceFileName , const String^ SDestFilen/aWrite output in file.
TidyMemToMem String^ SsourceData , String^ % SResultn/aWrite output to memory.
TidyMemtoFile String^ SBuffer , String^ SDestFilen/aTake input as buffer and output in file.
TotalWarnings long %pValGetReturn the total number of warnings after the above four operations.
TotalErrors long %pValGetReturn the total number of errors after the above four operations.
ErrorWarning voidString^Return the buffer, which contains human readable errors/ warnings.
Option voidEfTidyOpt:: TidyNetOpt^Set the Option for the Tidy library.

2. TidyNetOpt

Here is a list of properties and methods for the ItidyOption interface:

Property/Method nameParameterGet/PutDescription
LoadConfigFile String^n/aLoad option settings from a configuration file.
ResetToDefaultValue Voidn/aReset options to default settings.
DoctypeString^BothDoctype declaration generated by Tidy.
TidyMark BOOLBothFor meta element indicating tidied doc.
HideEndTag BOOLBothSuppress optional end tags.
EncloseText BOOLBothIf yes, text in the body is wrapped in <p>.
EncloseBlockText BOOLBothIf yes, text in blocks is wrapped in <p>
LogicalEmphasis BOOLBothReplace i by em and b by strong.
DefaultAltText String^BothDefault text for alt attribute.
Clean BOOLBothReplace presentational clutter by style rules.
DropFontTags BOOLBothDiscard presentation tags.
DropEmptyParas BOOLBothDiscard empty p elements.
Word2000 BOOLBothBoth draconian cleaning for Word2000.
FixBadComment BOOLBothBoth fix comments with adjacent hyphens.
FixBackslash BOOLBothBoth fix URLs by replacing \ with /.
NewEmptyTags String^BothDeclared empty tags.
NewInlineTags String^BothDeclared inline tags.
NewBlockLevelTags String^BothDeclared block tags.
NewPreTags String^BothDeclared pre tags.
OutputType EOutputType BothYou can set the output type from here, like you can get the output as XML, XHTML or pure HTML.
InputAsXML BOOLBothTreat input as XML.
ADDXmlDecl BOOLBothAdd >?xml ?< for XML docs.
AddXmlSpace BOOLBothIf set to yes, adds XML: space attr as needed.
Bare BOOLBothMake bare HTML.
AssumeXmlProcins BOOLBothIf set to yes, PIs must end with ?>.
CharEncoding ECharEncodingTypeBothSet/Get in/out character encoding.
InCharEncoding ECharEncodingTypeBothInput character encoding (if different).
OutCharEncoding ECharEncodingTypeBothOutput character encoding (if different).
NumericsEntities BOOLBothUse numeric entities for symbols.
QuoteMarks BOOLBothOutput " marks as ".
QuoteNBSP BOOLBothBoth output non-breaking space as entity.
QuoteAmpersand BOOLBothOutput naked ampersand as &.
OutputTagInUpperCase BOOLBothOutput tags in upper not lower case.
OutputAttrInUpperCase BOOLBothOutput attributes in upper not lower case.
WrapScriptlets BOOLBothWrap within JavaScript string literals.
WrapAttVals BOOLBothWrap within attribute values.
WrapSection BOOLBothWrap within section tags.
WrapAsp BOOLBothWrap within ASP pseudo elements.
WrapJste BOOLBothWrap within JSTE pseudo elements.
WrapPhp BOOLBothWrap within PHP pseudo elements.
Indent EIndentSchemeBothIndent the content of appropriate tags.
IndentSpace longBothIndentation of n spaces.
WrapLen longBothSet wrap margin for output.
TabSize longBothExpand tabs to n spaces.
IndentAttributes longBothNew-line + indent before each attribute.
BreakBeforeBR BOOLBothOutput new-line before or not.
LiteralAttribs BOOLBothIf true, attributes may use new-lines.
MarkUp BOOLBoth
ShowWarnings BOOLBothOn/Off
Quiet BOOLBothNo 'Parsing X', guessed DTD or summary.
KeepTime BOOLBothIf yes, last modified time is preserved.
ErrorFile String^BothFile name to write errors to.
GnuEmacs BOOLBothIf true, format error output for GNU Emacs
FixUrl BOOLBothApplies URI encoding if necessary.
BodyOnly BOOLBothOutput BODY content only.
HideComments BOOLBothHides all (real) comments in output.
DoctypeMode EDoctypeModesBothSets the doctype mode for output.

Using the Code

I have used the Test.htm (included with the project) to test EfTidyNet responses. Here is what Test.htm contains:

HTML
<html>
    <head><title>tidy Library</title></head>
    <body>
      <blockquote>
        <p> </p> --(1)

        <p><fontsize="5"color=
      "#FF00FF">TidyLibrary</font></p>
      </blockquote>
      <P><p><fontsize="5"color="#FF00FF"></font></p>

      <table border="1" cellpadding="0" cellspacing="0"
         style="border-collapse: collapse"
         bordercolor="#111111" width="100%" id="AutoNumber1">

       <tr>
         <td width="50%" style="border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style:
           none; border-bottom-width: medium"> --(2)
         </td>
         <td width="50%" style="border-left-style: none;
           border-left-width: medium; border-right-style:solid;
           border-right-width: 1; border-top-style: solid;
           border-top-width: 1;border-bottom-style: none;
           border-bottom-width: medium">

         </td>
       </tr>
      </table>
      <b>Tidy  --- (3)
      </h1> <tidy> ---(4)

    </body>
</html>

In test.htm, I have added the following mistakes:

  • A dummy <Tidy> tag at (4)
  • Missing <h1> tag at (4)
  • Empty para <p> tag (1)
  • Un-closed <b> tag at (3)
Test Case # 1 using TidyNet

First, create an object of our component. Here is a listing of how to achieve that:

C++
TidyNet objTidyNet = new TidyNet(); 

Now, clean the test.htm file using this object. The code listing for that is given below:

C++
private void button1_Click(object sender, EventArgs e)
{
 int iTotalWarn = 0,iTotalErrs = 0;
 String SReturnData ="";
 String SError = "";

 TidyNet objTidyNet = new TidyNet();
 objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm",
   ref SReturnData);

 objTidyNet.TotalWarnings(ref iTotalWarn);
 SError = objTidyNet.ErrorWarning();
 objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains:

HTML
<html>
<head>
 <meta name="generator"
       content="HTML Tidy for Windows (vers 1st September 2004),
                see www.w3.org">
    <title>tidy Library</title>

</head>
<body>
    <blockquote>
        <p> </p>
        <p><font size="5" color="#FF00FF">Tidy Library</font>

        </p>
    </blockquote>
    <p><font size="5" color= "#FF00FF"> </font></p>

    <table border="1" cellpadding="0" cellspacing="0"
         style= "border-collapse: collapse" bordercolor="#111111"

         width="100%" id= "AutoNumber1">
     <tr>
        <td width="50%" style= "border-left-style: solid;
           border-left-width: 1; border-right-style: none;
           border-right-width: medium; border-top-style: solid;
           border-top-width: 1; border-bottom-style: none;
           border-bottom-width: medium">

        </td>
        <td width="50%"
           style= "border-left-style: none;border-left-width: medium;
           border-right-style: solid; border-right-width: 1;
           border-top-style: solid; border-top-width: 1;
           border-bottom-style: none;border-bottom-width: medium">
        </td>
     </tr>

    </table>
    <b>Tidy</b> --(1)
</body>
</html>

If you see the above cleaned HTML page - the dummy <tidy> tag and the </h1> have been removed near (1), and </b> is added after Tidy at (1).

Here is a summary of the errors/warnings produced by EfTidyNet, showing you the details of each action it has performed:

line 1 column 1   - Warning: missing <!DOCTYPE> declaration
line 22 column 10 - Warning: discarding unexpected </h1>
line 23 column 1  - Error: <tidy> is not recognized!
line 23 column 1  - Warning: discarding unexpected <tidy>

line 15 column 1  - Warning: <table> proprietary attribute
                    "bordercolor"
line 15 column 1  - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

5 warnings, 1 error were found!
Test Case # 2 using TidyNet with TidyNetOpt

Now, apply some options to Test.htm to get the custom output. So, I am using these options:

  • Clean =TRUE (to make separate class for style)
  • DoctypeMode = DoctypeUser (to enable display string)
  • Doctype = "Ef Tidy library" (display string)
  • OutputType = XhtmlOut (output type)
  • NewInlineTags = "tidy" (Make our dummy <tidy> tag legal)

Here is the code listing to achieve the above:

C++
private void TestCase2_Click(object sender, EventArgs e)
{
  int iTotalWarn = 0, iTotalErrs = 0;
  String SReturnData = "";
  String SError = "";

  TidyNet objTidyNet = new TidyNet();

  objTidyNet.Option.Clean(true);
  objTidyNet.Option.NewInlineTags("tidy");
  objTidyNet.Option.OutputType(EfTidyNet.EfTidyOpt.EOutputType.XhtmlOut);
  objTidyNet.Option.DoctypeMode(EfTidyNet.EfTidyOpt.EDoctypeModes.DoctypeUser);
  objTidyNet.Option.Doctype("Ef Tidy Library");

  objTidyNet.TidyFiletoMem("C:\\MyProjects\\Test\\hello.htm", ref SReturnData);
  objTidyNet.TotalWarnings(ref iTotalWarn);
  SError = objTidyNet.ErrorWarning();
  objTidyNet.TotalErrors(ref iTotalErrs);
}

And here is the result produced by Tidy listing showing what test1.htm (created by EfTidyNet) contains after applying our options:

HTML
<!DOCTYPE html PUBLIC "Ef Tidy library" ""> --(1)

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="generator"

    content="HTML Tidy for Windows (vers 1st September 2004),
            see www.w3.org" />

  <title>tidy Library</title>
  <style type="text/css">  --(2)

     /*<![CDATA[*/
       table.c4 {border-collapse: collapse}
       td.c3 {border-left-style: none;
          border-left-width: medium; border-right-style: solid;
          border-right-width: 1; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       td.c2 {border-left-style: solid; border-left-width: 1;
          border-right-style: none;
          border-right-width: medium; border-top-style: solid;
          border-top-width: 1;
          border-bottom-style: none; border-bottom-width: medium}
       h2.c1 {color: #FF00FF}
     /*]]>*/
  </style>

  </head>
  <body>
    <blockquote>
      <p> </p>

      <h2 class="c1">Tidy Library</h2>

    </blockquote>
    <h2 class="c1">
    </h2>
    <table border="1" cellpadding="0" cellspacing="0" class="c4"

           bordercolor="#111111" width="100%" id="AutoNumber1">
        <tr>
            <td width="50%" class="c2"> </td> ----(3)

            <td width="50%" class="c3"> </td>
        </tr>
    </table>
    <b>Tidy <tidy></tidy></b> ----(4)

  </body>
</html>

Now, let us see what Tidy cleans for us:

  • In (1), our custom string "Ef Tidy Library" is visible.
  • In (2) and (3), the styles are cleaned and a class is created for that.
  • In (4), our <Tidy> tag gets legal, though it does nothing in the actual HTML page.

Here is a summary of all the errors/warnings:

line 1 column 1  - Warning: missing <!DOCTYPE> declaration
line 22 column 10- Warning: discarding unexpected </h1>
line 23 column 1 - Warning: <tidy> is not approved by W3C
line 23 column 1 - Warning: missing </tidy> before </body>

line 22 column 2 - Warning: missing </b> before </body>

line 15 column 1 - Warning: <table> proprietary attribute
                   "bordercolor"
line 15 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like HTML Proprietary

7 warnings, 0 errors were found!

Here, all I have given is a small overview of the Tidy library and EfTidyCom. For more information on the Tidy library, visit Tidy home page.

Author Comment

I know there is much scope for improvement in this component. I promise these improvements will be there in the next version/update of the library. If you encounter any bugs, please intimate so that I could improve the code further.

Files Listed with the Project

EfTidy Version 1.0.2.0
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilingual support
    • Source code updated for Visual Studio 2010 
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.3
  • Source zip contains:
    • TidyLib (original Tidy library) 2009 March  release source code
    • EfTidyNet source code with multilanguage support
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0.1.2 (Latest)

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • Thanks to Wingogo and megger83 for bug reporting!
  • Project zip contains:
    • Release version of EfTidyNet Library

EfTidy Version 1.0.1.1

  • Source zip contains:
    • TidyLib (original Tidy library) 2008 release source code
    • EfTidyNet source code with multilanguage support
    • EfTidyNetx64 version by Spike!
  • Project zip contains:
    • Release version of EfTidyNet Library
    • C# test project (with source)
    • Test.htm

EfTidy Version 1.0

  • Source zip contains:
    • TidyLib (original Tidy library) source code
    • EfTidyNet source code
  • Project zip contains:
    • Release version of EfTidyNet library
    • C# Test project (with source)
    • Test.htm

Special Thanks

  • Mr. Saurabh Gupta [Director Efextra eSolutions Pvt. Ltd.]
  • Mr Spike! for creating X64 version of EfTidyNet
  • Tidy SourceForge group for Tidy library

Update History

  • 06 September 2013: EfTidyNet version 1.0.2.0 
  • 20 July, 2009: EfTidyNet version 1.0.1.3
  • 23rd June, 2008: EfTidyNet version 1.0.1.2
  • 5th March, 2008: EfTidyNet version 1.0.1.1
  • 15th February, 2008: EfTidyNet version 1.0

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)