Click here to Skip to main content
15,884,838 members
Articles / Web Development / HTML

Auto-TOC Generation and Header Numbering - Revision

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
22 Aug 2017CPOL16 min read 23.3K   495   5   2
This article revises the HTML authoring tool, HTML TOC Generator, that generates a Table of Contents for an HTML document. Optionally, the tool will number the HTML headers.

1. Introduction Table of Contents

To TOC

This article is a revision that:

  • Removes test cases from this article, tool, and the downloads
  • Adds a feature that recalls the last directory visited
  • Removes the generated "name" attribute
  • Removes the generated "px" units on <img> width and height attributes

The last two revisions were made to adhere to the HTML5 standard [^].

This article revises the HTML authoring tool, HTML TOC Generator, that generates a Table of Contents for an HTML document. Optionally, the tool will number the HTML headers.

2. Purpose Table of Contents

The HTML TOC Generator tool performs modifications to a source HTML document as directed by the contents of a <div> element with the class "toc". There are three distinct modes of operation: generation, removal, and numbering.

In generation mode, the tool

  • Generates a Table of Contents (TOC) for the HTML <h2> to <h6> Tags [^] appearing in an HTML document.
  • Allows specifying which tags are to be included in the TOC.
  • Allows the specified tags to be non-contiguous (e.g., "h2,h3,h5" would generate a table of contents for HTML headings at levels, 2, 3, 5 - skipping h4).
  • Allows specifying a heading for the TOC and the HTML heading level to be used for the TOC heading.
  • Allows specifying whether or not a link back to the TOC is desired (such a link allows a reader to return to the TOC from each HTML heading included in the TOC). If a return link is desired, allows specifying an image to be placed in the link.
  • The TOC will be placed within the HTML document where the TOC <div> element is located.

In removal mode, the tool

  • Removes all HTML previously generated by the HTML TOC Generator from the HTML document.
  • Removes the TOC <div> element from the HTML document.

In numbering mode, the tool performs the same actions as in generation mode but, in addition, generates heading numbering.

3. TOC-div Element Table of Contents

The TOC-div element specifies the desired contents of the TOC as well as the placement of the TOC within the HTML document. The TOC-div element effects the HTML output in both the generation and numbering modes. It is ignored in the removal mode.

In its simplest form, the TOC-div element takes the form:

en
<div class="toc"></div>

The generated TOC will be placed within the HTML document where the TOC-div element is placed. The generated TOC replaces the TOC-div element.

The choice of the "toc" class was intentional. With this class, the style associated with the TOC can be specified in CSS. In the event that the CSS does not define such a class, the following can be placed in the <head> of the HTML document.

en
<style type="text/css">
  .toc
    {
    }
  .toc-generated
    {
    }
</style>

The "toc-generated" class is discussed below. Note that neither class needs to be defined for the HTML TOC Generator to execute.

The format of the TOC-div element, in a modified BNF, is:

TOC-div           ::= <div class="toc" 
                          [style="[toc-headers[:<heading-tags-list>];]
                                  [toc-return[:(true|false)];]
                                  [toc-title[:<title-of-toc>];]
                                  [toc-image[:<path-to-image>];]
                                  [toc-image-width[:<width-in-pixels>];]
                                  [toc-image-height[:<height-in-pixels>];]
                                  [toc-header-level[:<heading-tag>];]
                                  [toc-numbering[:<level-list>];]"]>
                      </div> .

heading-tags-list ::= heading-tag 
                  ::= heading-tag, heading-tags-list .

heading-tag       ::= "h2"
                  ::= "h3"
                  ::= "h4"
                  ::= "h5"
                  ::= "h6" .

level-list        ::= level-value
                  ::= level-value, level-list .

level-value       ::= [heading-tag] digit .

digit             ::= "0"
                  ::= "1"
                  ::= "2"
                  ::= "3"
                  ::= "4"
                  ::= "5"
                  ::= "6"
                  ::= "7"
                  ::= "8"
                  ::= "9" .

3.1. TOC-div Attributes Table of Contents

The two attributes of the TOC-div element are "class" and "style".

3.1.1. class Table of Contents

The class attribute is required and must have the attribute value of "toc". Although the attribute value is case-insensitive, the value should be lowercase (as recommended by W3C).

3.1.2. style Table of Contents

The style attribute is optional and contains, as its properties, the desired contents of the TOC. If the attribute is omitted, the following default TOC-div element will be used:

<div class="toc"
     style="toc-headers:h2,h3,h4,h5,h6;
            toc-return:true;
            toc-title=Table of Contents;
            toc-image:/app_themes/codeproject/img/gototop16.png;
            toc-image-width:16;
            toc-image-height:16;
            toc-header-level:h2;">
</div>

Note that property names are separated from their property values by a colon (":") and that a semicolon (";") separates properties from one another. When multiple property values are supplied (as in toc-headers, above), they are separated from one another by a comma (",").

3.2. TOC-div style Properties Table of Contents

The TOC-div element style properties control what is generated by the HTML TOC Generator. As shown above, the style attribute and its properties can be omitted. However, by using the style properties, significant control over the contents of the generation of the TOC can be asserted.

3.2.1. toc-headers Table of Contents

toc-headers specifies which HTML headings tags will generate an entry in the TOC. Note that HTML <h1> tags are never processed by the HTML TOC Generator.

The toc-headers property may be omitted, and if omitted, entries for all HTML headings tags, appearing in the HTML document, will be placed in the TOC. Likewise, if the toc-headers property is present, but the heading-tags-list is omitted, entries for all HTML headings tags, appearing in the HTML document, will be placed in the TOC.

The heading-tags-list is composed of one or more of "h2", "h3", "h4", "h5", or "h6", in any order, in any case, separated by commas. White-space within the heading-tags-list is ignored. An empty heading-tags-list is treated as if the heading-tags-list was omitted and entries for all HTML headings tags, appearing in the HTML document, will be placed in the TOC. Unrecognized or duplicate values within the heading-tags-list are ignored.

An example of a heading-tags-list is "h3,H5,h 2,foo,h4bar,h8,h2". For this example, TOC entries will be generated for the HTML headings tags <h2>, <h3>, and <h5>. "foo", "h4bar", and "h8" will be ignored. The entry "h  2" will be recognized as "h2" and the duplicate "h2" will be ignored. The entry "H5" will be modified to "h5". Even though the entries in the heading-tags-list are unordered, the TOC entries will be ordered.

During the processing of HTML headings, HTML text formatting tags are retained. These include:

TagDescription
<b>Bold text
<del>Deleted text
<em>Emphasized text
<i>Italicized text
<ins>Inserted text
<mark>Marked/highlighted text
<small>Smaller text
<strong>Important text
<sub>Subscripted text
<sup>Superscripted text

To more fully understand the HTML TOC Generation processing, some basic terms used in this article need to be defined.

Element Definition

All HTML elements are considered to start at the opening "<" in its tag and are considered to end at the closing ">" in its closing tag.

The content of an HTML element is considered to start immediately following the closing ">" in its opening tag and is considered to end immediately before the opening "<" in its closing tag

During its initial processing of <h?> and <div> elements, HTML TOC Generator removes any previously generated elements with the class "toc-generated". For example, if the previous example header had been processed, it might have the following form.

Expanded Element

The "toc_bookmark_1" bookmark is the target of the entry in the TOC for this heading. The "toc-generated" class is the signal to the HTML TOC Generator that this element is to be removed during any removal process.

The href in the second <a> element points to the TOC. The <img> element is present because the user did not specify a toc-return-image and so the default was used. Again, the "toc-generated" class is the signal to remove this element.

When all initial processing is completed, the heading element will appear as the original heading, above.

3.2.2. toc-return Table of Contents

toc-return specifies whether or not a return link to the TOC will be placed in the content of the HTML headings tag. Such a return link allows a reader to return to the TOC from locations within the document. The recognized property values are "true" and "false".

If the toc-return property is omitted or if the toc-return property is present but a property value is not or if the toc-return property is present but an unrecognized property value is supplied, return links to the TOC will be placed in the InnerHtml of the HTML headings tags and a bookmark, named "toc_return_to_toc" will be placed in the TOC.

3.2.3. toc-title Table of Contents

toc-title specifies the title for the TOC. If toc-title is missing, the title "Table of Contents" will prefix the TOC.

The value of toc-title may contain any alphanumeric character plus any of the following characters:

en
Tilde (~)
Exclamation mark (!)
Number sign (#)
Dollar sign ($)
Percent sign (%)
Circumflex accent (^)
Ampersand (&)
Asterisk (*)
Left parenthesis (()
Right parenthesis ())
Underscore (_)
Plus sign (+)
Grave accent (`)
Hyphen (-)
Equals sign (=)
Left bracket ([)
Right bracket (])
Vertical line (|)
Semicolon (;)
Colon (:)
Greater-than symbol (>)
Question mark(?)
Comma (,)
Period (.)
Space ( )

Any other character will be removed from toc-title. If, after processing, an empty string results, no TOC title will be generated.

3.2.4. toc-image Table of Contents

toc-image specifies the path to an image that will be placed in the return link to the TOC in the text of the HTML headings tag. If toc-image or the toc-image property value is missing, the path

"/app_themes/codeproject/img/gototop16.png"

will be inserted into the TOC. The value of the property may contain any valid path character. No test is made to insure that a valid path is provided.

The default path is defined specifically for Code Project articles. Documents for which a TOC is generated, but will not be published at Code Project, should have a toc-image specified. The image path must be "visible" to the HTML document. See the discussion, below.

3.2.5. toc-image-width Table of Contents

toc-image-width specifies the width of the image that will be placed in the return link to the TOC in the text of the HTML headings tag. If toc-image-width or the toc-image-width property value is missing, the toc-image-width defaults to 16 pixels. Note that units are not supplied. In HTML5, the width attribute specifies the width of the image, in pixels.

3.2.6. toc-image-height Table of Contents

toc-image-height specifies the height of the image that will be placed in the return link to the TOC in the text of the HTML headings tag. If toc-image-height or the toc-image-height property value is missing, the toc-image-height defaults to 16 pixels. Note that units are not supplied. In HTML5, the height attribute specifies the height of the image, in pixels.

3.2.7. toc-header-level Table of Contents

toc-header-level specifies the HTML header level that will be used to display the TOC title. If the toc-header-level property is missing, the HTML header level "h2" will be used for the TOC title element.

The value of the property may be any of the HTML header levels "h2", "h3", "h4", "h5", or "h6". Any other value will be ignored and the toc-header-level value will become "h2".

3.2.8. toc-numbering Table of Contents

toc-numbering provides for the insertion of heading numbering within the HTML document. If toc-numbering is missing, heading numbering will not be inserted into the HTML document.

If toc-numbering is present but the toc-numbering property value is missing, heading numbering of all headers will be inserted into the HTML document using a level-list of "h21,h31,h41,h51,h61". This level-list will produce the following heading numbering:

1. H2 heading
1.1. H3 heading
1.1.1. H4 heading
1.1.1.1. H5 heading
1.1.1.1.1. H6 heading

If a large HTML document is broken into separate HTML documents, by using a level-list that differs from one HTML document to the next, heading numbering can be made continuous across the separate HTML documents.

For example, a portion of a large HTML document is:

<div class="toc"
     style="toc-numbering;"></div>
<h2>Heading 1</h2>
    :
    large amount of HTML
    :
<h2>Heading 2</h2>
    :
    large amount of HTML
    :
<h2>Heading 3</h2>
    :
    large amount of HTML
    :

Because the HTML generated text between the individual h2 elements is too large to fit into the desired page size, the HTML document will be broken into smaller HTML documents at the h2 header levels. However, heading numbering is desired to be continuous across all pieces of the document. By modifying the level-list property for each of the smaller HTML documents, a continuous header numbering can be achieved.

<div class="toc"
     style="toc-numbering:h21;"></div>
<h2>Heading 1</h2>
    :
    large amount of HTML
    :

<div class="toc"
     style="toc-numbering:h22;"></div>
<h2>Heading 2</h2>
    :
    large amount of HTML
    :

<div class="toc"
     style="toc-numbering:h23;"></div>
<h2>Heading 3</h2>
    :
    large amount of HTML
    :

Any level (h2 through h6) can have its starting level number specified.

4. Generated TOC Table of Contents

4.1. Generation Table of Contents

The following discussion assumes that the following TOC-div element is found in an HTML document being submitted to the HTML TOC Generator:

<div class="toc">
</div>

The TOC-div element will be rewritten to display the properties used during the processing of the HTML document. The toc-title is placed in the default header level entry (in this case <h2>) in the contents of the rewritten TOC-div element. The toc-title is assigned to the class "toc-generated". Following the title will appear a generated <div> that will contain the actual TOC. This <div> is assigned to the class "toc-generated". The first entry in this <div> will be the TOC bookmark ("toc_return_to_toc") used to return to the TOC from locations within the HTML document. Immediately following will be the opening tag for the unordered list that comprises the actual TOC.

So far, the generated TOC-div element would appear as

<div class="toc"
     style="toc-headers:h2,h3,h4,h5,h6;
            toc-return:true;
            toc-title:Table of Contents;
            toc-return-image:/app_themes/codeproject/img/gototop16.png;
            toc-image_width:16;
            toc-image_height:16;
            toc-header-level:h2;" >
  <h2 class="toc-generated">Table of Contents</h2>
  <div class="toc-generated">
    <a id="toc_return_to_toc" </a>
    <ul>

Totally dependent upon the contents of the document, a generated TOC entry is created using <li> elements embedded within, possibly nested, <ul> elements. Nesting occurs when a subordinate heading level is encountered. Given the following <h2> tag:

<h2>Introduction</h2>

the following entry will be generated in the TOC

<li><a href="#toc_bookmark_1">Introduction</a></li>

and the <h2> element will be modified to

<h2>Introduction
  <a id="toc_bookmark_1"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a>
</h2>

Any existing bookmark or link with a class of "toc-generated" will be removed. In the preceding example, all that would remain before regeneration, would be:

<h2>Introduction</h2>

The bookmark ID attribute value (i.e., " toc_bookmark_1") is generated by the HTML TOC Generation process and will be unique within the HTML document (that is assuming that no pathological case exists wherein the input HTML contains the generated value). In this example, an image link is also generated that, when clicked, will return the reader to the top of the TOC.

4.2. Numbering Table of Contents

If toc-numbering is specified, as in:

<div class="toc"
     style="toc-numbering;"
</div>

then the initial part of the TOC-div element will be generated as:

<div class="toc"
     style="toc-headers:h2,h3,h4,h5,h6;
            toc-return:true;
            toc-title:Table of Contents;
            toc-return-image:/app_themes/codeproject/img/gototop16.png;
            toc-image_width:16;
            toc-image_height:16;
            toc-header-level:h2;
            toc-numbering:h21,h31,h41,h51,h61;" >
  <h2 class="toc-generated">Table of Contents</h2>
  <div class="toc-generated">
    <a id="toc_return_to_toc" </a>
    <ul>

and if the second <h2> element is

<h2>Introduction</h2>

the following entry will be generated in the TOC

<li><a href="#toc_bookmark_1">2. Introduction</a></li>

and the <h2> element contents will be modified to

<h2><span class="toc-generated" >2. </span>Introduction
  <a id="toc_bookmark_1"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a>
</h2>

4.3. Example Table of Contents

If the following HTML document is submitted to the HTML TOC Generator:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="Content-type" content="text/html;charset=UTF-8" /> 
  <title>Test Auto TOC Generation</title>
  <link type="text/css" 
        rel="stylesheet" 
        href="http://s.codeproject.com/App_Themes/CodeProject/Css/Main.min.css?dt=2.6.130426.1" />
  <style type="text/css">
    .toc
      {
      }
    .toc-generated
      {
      }
  </style>
</head>
<body style="margin: 20px;">
<div class="toc">
</div>
<h2>Header Level <b>2</b> - <i>Number 1</i></h2>
<p>H2 1</p>
<h3>Header Level 3 - Number 1</h3>
<p>H3 1</p>
<h4>Header Level 4 - Number 1</h4>
<p>H4 1</p>
<h4>Header Level 4 - Number 2</h4>
<p>H4 2</p>
<h5>Header Level 5 - Number 1</h5>
<p>H5 1</p>
<h6>Header Level 6 - Number 1</h6>
<p>H6 1</p>
<h4>Header Level 4 - Number 3</h4>
<p>H4 3</p>
<h3>Header Level 3 - Number 2</h3>
<p>H3 2</p>
<h2>Header Level 2 - Number 2</h2>
<p>H2 2</p>
</body>
</html>

the HTML document that would be generated is:

en
<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="Content-type" content="text/html;charset=UTF-8" /> 
  <title>Test Auto TOC Generation</title>
  <link type="text/css" 
        rel="stylesheet" 
        href="http://s.codeproject.com/App_Themes/CodeProject/Css/Main.min.css?dt=2.6.130426.1" />
  <style type="text/css">
    .toc
      {
      }
    .toc-generated
      {
      }
  </style>
</head>
<body style="margin: 20px;">
<div class="toc"
     style="toc-headers:h2,h3,h4,h5,h6;
            toc-return:true;
            toc-title:Table of Contents;
            toc-return-image:/app_themes/codeproject/img/gototop16.png;
            toc-image_width:16;
            toc-image_height:16;
            toc-header-level:h2;" >
  <h2 class="toc-generated">Table of Contents</h2>
  <div class="toc-generated">
    <a id="toc_return_to_toc"</a>
    <ul>
      <li><a href="#toc_bookmark_0">Header Level <b>2</b> - <i>Number 1</i></a>
        <ul>
          <li><a href="#toc_bookmark_1">Header Level 3 - Number 1</a>
            <ul>
              <li><a href="#toc_bookmark_2">Header Level 4 - Number 1</a></li>
              <li><a href="#toc_bookmark_3">Header Level 4 - Number 2</a>
                <ul>
                  <li><a href="#toc_bookmark_4">Header Level 5 - Number 1</a>
                    <ul>
                      <li><a href="#toc_bookmark_5">Header Level 6 - Number 1</a></li>
                    </ul>
                  </li>
                </ul>
              </li>
              <li><a href="#toc_bookmark_6">Header Level 4 - Number 3</a></li>
            </ul>
          </li>
          <li><a href="#toc_bookmark_7">Header Level 3 - Number 2</a></li>
        </ul>
      </li>
      <li><a href="#toc_bookmark_8">Header Level 2 - Number 2</a></li>
    </ul>
    <p>
    The symbol 
    <a href="#toc_return_to_toc">
      <img alt="Table of Contents"
           title="Table of Contents"
           src="/app_themes/codeproject/img/gototop16.png"
           width="16"
           height="16" />
    </a> 
    returns the reader to the top of the Table of Contents.
    </p>
  </div>
</div>

<h2>Header Level <b>2</b> - <i>Number 1</i>
  <a id="toc_bookmark_0"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h2>
<p>H2 1</p>
<h3>Header Level 3 - Number 1
  <a id="toc_bookmark_1"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h3>
<p>H3 1</p>
<h4>Header Level 4 - Number 1
  <a id="toc_bookmark_2"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h4>
<p>H4 1</p>
<h4>Header Level 4 - Number 2
  <a id="toc_bookmark_3"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h4>
<p>H4 2</p>
<h5>Header Level 5 - Number 1
  <a id="toc_bookmark_4"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h5>
<p>H5 1</p>
<h6>Header Level 6 - Number 1
  <a id="toc_bookmark_5"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h6>
<p>H6 1</p>
<h4>Header Level 4 - Number 3
  <a id="toc_bookmark_6"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h4>
<p>H4 3</p>
<h3>Header Level 3 - Number 2
  <a id="toc_bookmark_7"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h3>
<p>H3 2</p>
<h2>Header Level 2 - Number 2
  <a id="toc_bookmark_8"
     class="toc-generated" >
  </a>
  <a href="#toc_return_to_toc"
     class="toc-generated" >
    <img alt="Table of Contents"
         title="Table of Contents"
         src="/app_themes/codeproject/img/gototop16.png"
         width="16"
         height="16" />
  </a> 
</h2>
<p>H2 2</p>
</body>
</html>

5. TOC Removal Table of Contents

When the TOC removal process is invoked against an HTML document that was processed by the HTML TOC Generator, all generated HTML will be removed. This includes the TOC-div element as well as all elements with the class "toc-generated".

6. Implementation Table of Contents

The HTML TOC Generator is encapsulated in the HTMLTOCGenerator.cs file. Two entry points provide the services: add_TOC_to_html and remove_TOC_from_html. Both take a single string argument that is the HTML upon which to execute. Both return a string that contains the possibly revised HTML.

The using directives for HTMLTOCGenerator.cs are:

C#
using System;
using System.Collections.Generic;
using System.Text;

using CONST = HTMLTOCGenerator.Constants;
using DATA = HTMLTOCGenerator.Data;
using ELEMENT = HTMLTOCGenerator.Element;
using HTMLPARSER = HTMLTOCGenerator.HTMLParser;
using NUMBERING = HTMLTOCGenerator.TOCNumbering;
using TOC = HTMLTOCGenerator.TOCDIV;
using TYPE = HTMLTOCGenerator.Constants.Element_Type;

The two methods are:

C#
// ********************************************* add_TOC_to_html

/// <summary>
/// returns the html that was revised by applying the TOC-div
/// found in the supplied html
///
/// if a TOC-div is not found, returns the source html
/// </summary>
public static string add_TOC_to_html ( string html )
    {
    HTMLPARSER     HTML_parser = new HTMLPARSER ( );
    string         rewriten_html = html;

    HTML_parser.collect_all_desired_elements ( html );
    if ( TOC.HaveTOCDIV )
        {
        HTML_parser.revise_element_content ( );
        HTML_parser.eliminate_unwanted_elements ( );
        rewriten_html = rewrite_html ( html );
        }

    return ( rewriten_html );
    }

// ************************************** remove_TOC_from_html

/// <summary>
/// returns the html that has all auto-generated TOC entries
/// removed from the source html
/// </summary>
public static string remove_TOC_from_html ( string html )
    {
    HTMLPARSER      HTML_parser = new HTMLPARSER ( );
    int             html_start = 0;
    int             html_to_copy = 0;
    StringBuilder   sb = new StringBuilder ( );

    HTML_parser.collect_all_desired_elements ( html );
    HTML_parser.revise_element_content ( );
    foreach ( ELEMENT element in DATA.Elements )
        {
                                // copy html up to the next
                                // header or TOC-div element
        html_to_copy = element.ElementStartsAt -
                       html_start - 1;
        if ( html_to_copy > 0 )
            {
            sb.Append ( html, html_start, html_to_copy );
            }
                                // copy in the rewritten
                                // contents of the element
        if ( element.ElementType == TYPE.TOCDIV )
            {

            }
        else
            {
            sb.AppendFormat ( "\n<{0}>{1}</{0}>\n",
                      element.TagName,
                      element.Content );
            }
        html_start = element.ElementEndsAt + 1;
        }
    html_to_copy = html.Length - html_start;
    sb.Append ( html, html_start, html_to_copy );

    return ( sb.ToString ( ) );
    }

The HTML parser is encapsulated in the class HTMLParser. The parser was originally developed by Jeff Heaton and is available as a C# Parser [^]. Major modifications were made to the parser so that it was self-limiting to <h2>, <h3>, <h4>, <h5>, <h6>, and <div> elements.

The generation, numbering, and removal processes make an single pass through the HTML. When all of the desired headers and <div>s have been identified, the HTML is copied to the output.

For the generation and numbering modes to modify source HTML, a TOC-div element must be located within the source HTML. If that element is not found, then the source HTML is returned unmodified. An invoking program can determine if this occurred by testing the lengths of the source and the returned HTML. If the lengths are the same, a TOC-div element was not found and no modifications were made.

The removal process does not require that a TOC-div element exist. It seeks all header elements containing <a> or <span> elements with the class of "toc-generated". It then removes these elements. It also removes any existing TOC-div element that it finds.

The copy process jumps through the HTML, guided by the collected header and <div> data. This is demonstrated by the remove_TOC_from_html source code, above.

I am considering replacing the HTML parser from one that uses an indexed buffer to one that uses StringReader [^]. The advantage to StringReader is its look ahead cabability (i.e., Peek method). The disadvantage is the time needed to implement the revision.

7. HTML TOC Generator Tool Table of Contents

Although the HTML TOC Generator Tool was designed to test the two methods add_TOC_to_html and remove_TOC_from_html, because it produces useful HTML, it is included in the downloads for this article.

The images supplied in this section are thumbnails. By clicking on the image, an expanded image can be viewed.

7.1. HTML TOC Generator Tool Startup Table of Contents

Input to the tool is made through the RichTextBox in the HTML Input tab. There are two ways in which input can be provided:

1. Directly copying HTML into the RichTextBox.
2. Choosing an HTML file by using the Browse button.

HTML TOC Start

7.2. HTML TOC Generator Tool Input Phase Table of Contents

When the HTML Input tab RichTextBox contents have been supplied, the Generate button appears. In the event that the HTML contains the string "toc-generated" a Remove button will also appear.

HTML TOC Input

In the example above, the Browse button was used to obtain the HTML for this article. Note that near the bottom, a TOC-div element is defined. Note too that the Remove button is visible even though TOC generation has not occurred. This happened because this article contains the string "toc-generated".

The contents of the HTML Input tab RichTextBox may be modified before the Generate button is clicked.

7.3. HTML TOC Generator Tool Creating TOC Table of Contents

The tool is not re-entrant. So once the Generate button is clicked, its visibility will be set to false. To apply the tool against another HTML file, it is necessary to re-execute the tool.

HTML TOC Revised

When the Generate button is clicked, the add_TOC_to_html method is invoked against the contents of the HTML Input tab RichTextBox. The results of its execution are placed in the Revised HTML tab RichTextBox.

In the example above, all headers have been modified. In addition (although not visible), the TOC-div element has been replaced as described above.

Navigation between the two tabs is supported.

If desired, the revised HTML may be saved. This is achieved by clicking on the Save button and completing the Save File Dialog. For ease of use, a filename is proposed for the save operation. It is constructed from the original input filename, with ".TOC" inserted after the input filename and before the extension.

7.4. HTML TOC Generator Tool TOC Removal Table of Contents

The removal process operates in much the same way as generation and numbering. An HTML file is chosen and, if "toc-generated" is found in the document, a "Remove" button is displayed. When clicked, all HTML TOC generated elements are removed. Also the TOC-div element is removed.

8. Return to TOC Image Table of Contents

Return To Toc

For an HTML document that will not be published by Code Project, a graphic, named ReturnToToc.png, that could be used for the toc-image is included in the download, in the HTMLTOCGeneratorDialogProject ZIP. The image is 31 x 31 pixels. I recommend that the width and height be set to 16 (as was done for the image to the left). There are no copyright restrictions on the image.

9. Conclusion Table of Contents

This article has presented revisions to an HTML authoring tool that generates a Table of Contents for an HTML document. Additionally, the tool can be directed to produce numbered HTML hearers.

10. References Table of Contents

11. Development Environment Table of Contents

The HTML TOC Generator was developed in the following environment:

Microsoft Windows 7 Professional Service Pack 1
Microsoft Visual Studio 2008 Professional
Microsoft .Net Framework Version 3.5 SP1
Microsoft Visual C# 2008

12. History Table of Contents

08/22/2017 HTML TOC Generator V4.1
04/10/2015 Original article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
In 1964, I was in the US Coast Guard when I wrote my first program. It was written in RPG (note no suffixing numbers). Programs and data were entered using punched cards. Turnaround was about 3 hours. So much for the "good old days!"

In 1970, when assigned to Washington DC, I started my MS in Mechanical Engineering. I specialized in Transportation. Untold hours in statistical theory and practice were required, forcing me to use the university computer and learn the FORTRAN language, still using punched cards!

In 1973, I was employed by the Norfolk VA Police Department as a crime analyst for the High Intensity Target program. There, I was still using punched cards!

In 1973, I joined Computer Sciences Corporation (CSC). There, for the first time, I was introduced to a terminal with the ability to edit, compile, link, and test my programs on-line. CSC also gave me the opportunity to discuss technical issues with some of the brightest minds I've encountered during my career.

In 1975, I moved to San Diego to head up an IR&D project, BIODAB. I returned to school (UCSD) and took up Software Engineering at the graduate level. After BIODAB, I headed up a team that fixed a stalled project. I then headed up one of the two most satisfying projects of my career, the Automated Flight Operations Center at Ft. Irwin, CA.

I left Anteon Corporation (the successor to CSC on a major contract) and moved to Pensacola, FL. For a small company I built their firewall, given free to the company's customers. An opportunity to build an air traffic controller trainer arose. This was the other most satisfying project of my career.

Today, I consider myself capable.

Comments and Discussions

 
Suggestionvery nice Pin
Sacha Barber19-May-15 20:10
Sacha Barber19-May-15 20:10 
GeneralRe: very nice Pin
gggustafson20-May-15 3:55
mvagggustafson20-May-15 3:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.