Click here to Skip to main content
15,885,875 members
Articles / Web Development / HTML
Tip/Trick

Parsing an HTML document by using a recursive function

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
14 Apr 2010CPOL 53.4K   9   3
This is an example of one way to parse an HTML document by using a recursive function. In this example, an html document is loaded from a text file, but the code also demonstrates (in remarks) using a web page as a source file.When the recursive function is called, a conditional statement...
This is an example of one way to parse an HTML document by using a recursive function. In this example, an html document is loaded from a text file, but the code also demonstrates (in remarks) using a web page as a source file.

When the recursive function is called, a conditional statement evaluates the html elements for child elements. If the element has children, the recursion occurs (the function calls itself) and the child of the candidate gets evaluated for children. Eventually, the function finds an element with no children, and the element's inner text, etc., is appended to a textbox. Running this program in debug mode might be helpful if my description is confusing.

import the namespaces as shown
Imports System.Windows.Forms.HtmlDocument
Imports System.Xml



Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click

    Try
        Dim HTMLDocument As HtmlDocument
        'Dim webclient As System.Net.WebClient = New System.Net.WebClient
        'Dim url As String = "http://www.somewebsite.com"
        Dim myHTML As String '= webclient.DownloadString(url)

        'instead of downloading the html, lets get it from a file
        Dim filePath As String = "C:\htmlsourcefile.txt"
        Dim myStreamReader = New System.IO.StreamReader(filePath)
        myHTML = myStreamReader.ReadToEnd

        WebBrowser1.Navigate("about:blank")
        Dim objectDoc = WebBrowser1.Document
        WebBrowser1.Document.Write(myHTML)
        WebBrowser1.ScriptErrorsSuppressed = True
        HTMLDocument = WebBrowser1.Document

        append("The document title is: " & HTMLDocument.Title)

        Dim headElementCollection As HtmlElementCollection = _
        HTMLDocument.GetElementsByTagName("head")

        'call the function (no value is returned)
        getChildren(headElementCollection)
        append(vbCrLf)
        headElementCollection = HTMLDocument.GetElementsByTagName("body")
        'same function again, just for the body this time
        getChildren(headElementCollection)


    Catch ex As Exception

        append(ex.ToString)

    End Try

End Sub


This is the recursive function
Private Function getChildren(ByVal xElementCollection As HtmlElementCollection)
    Dim xLabel As String

    Dim parentElement As HtmlElement

    For Each parentElement In xElementCollection
        If parentElement.Children.Count > 0 Then

            Select Case parentElement.TagName.ToLower
                Case "tr" : xLabel = "Row"
                Case "td" : xLabel = "Cell"
                Case "th" : xLabel = "Header"
                Case "a" : xLabel = "Anchor"
                Case "tbody" : xLabel = "T-Body"
                Case "div" : xLabel = "Division"
                Case "head" : xLabel = "Head"
                Case "body" : xLabel = "Body"
                Case "table" : xLabel = "Table"
                Case "p" : xLabel = "Paragraph"
                Case Else : xLabel = "element not specified"

            End Select

            append("<" & xLabel & ">")
            getChildren(parentElement.Children)
            append("<" & xLabel & " />")

        Else

            If parentElement.InnerText <> "" Then
                append("     " & parentElement.InnerText & "")
            Else
                append("     " & vbNull.ToString & "")
            End If

            If parentElement.GetAttribute("href").ToString <> "" Then
                append("     " & parentElement.GetAttribute("href") & "")
            End If

        End If
    Next

    Return Nothing

End Function


one last thing, I prefer not to use TextBox1.append("one two three"), so I do it this way....

Private Sub append(ByVal myTextToAppend As String)
    TextBox1.AppendText(myTextToAppend & vbCrLf)
    Application.DoEvents()
    outputXL = outputXL & myTextToAppend & vbCrLf
End Sub

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) Jacobs Technology
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralThis code worked very well for our needs with minimal tweaki... Pin
Trellium9-Oct-11 6:25
Trellium9-Oct-11 6:25 
GeneralRe: Thank you Trellium, and I hope it is still working for you. ... Pin
marc98896-Dec-11 3:44
marc98896-Dec-11 3:44 
GeneralI am trying to use this code by uncommenting the statements ... Pin
sumboddie1-May-11 7:27
sumboddie1-May-11 7:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.