Click here to Skip to main content
15,892,537 members
Articles / Programming Languages / Javascript
Article

Wrap your HTML parser to exclude scripting

Rate me:
Please Sign up or sign in to vote.
2.94/5 (8 votes)
25 Dec 20031 min read 40.7K   17  
Processing complex HTML pages will require sectional or content exclusion

Image 1

Introduction

Most parser enabled internet applications require script exclusion. This wrapper properly excludes script elements from testing, and possible script tainting. After reading the file it is entered into an array for line by line processing. If you are trying to disable anomalies caused by IE, clear line 2 of a saved document to keep it from reasserting the original document object model. It WILL do that on fresh documents. Clearing the line forces it to create a new model. Note that this is done in preparation for subsequent browser navigations, NOT this parsing session.

VB.NET
    Dim loc, z as long    
    Elements = Split(s, vbCrLf)

    Elements(1) = ""     
    in_script = False
    
    For i = 2 To UBound(Elements)
        z = 1
        If in_script = False Then
            loc = InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare)
            If loc > 0 Then
                If (InStr(z, UCase(Elements(i)), 
                    "<SCRIPT ", vbBinaryCompare) > 0 And 
                    InStr(z, UCase(Elements(i)), 
                    "</SCRIPT>", vbBinaryCompare) > 0) Then
                    in_script = False
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    z = loc + 8
                Else
                    Elements(i) = Replace(Elements(i), 
                      Page & "_files/", myscriptsfolder)
                    in_script = True
                End If
            End If
                      
'/////////////////////////////////////////////////         

            
'  ADD MORE PARSER METHODS HERE
            

'insert basetag method calls InsertBaseElement method
            
loc = InStr(z, Elements(i), "<HEAD>", vbBinaryCompare)
            
If loc > 0 
  Then
            
    If (objDocument.getElementsByTagName("BASE").length = 0) 
      Then                  
        Elements(i) = InsertBaseElement(Elements(i), loc)               
     Else            
        Elements(i) = Replace(Elements(i), s, ARCRoot)
            
  End If
End If           
            
'/////////////////////////////////////////////////<BR>
<BR>           DoEvents            
            
'/////////////////////////////////////////////////
            

            'This code can be modified to suit special 
            'requirements
            
'It is useful for chopping of a <BR>'document with dynamic footer content
            
'written by script methods
            
'(Coders may be trying to ensure some kind of difficulty getting a 
' clean archive document from their service.) This code attempts 
' to cleanup the non-compliant HTML footer.
           
            loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
            If loc > 0 Then
                in_script = False
                i = i + 1
                Elements(i) = "</BODY></HTML>"
                i = i + 1
                Do While i < UBound(Elements)
                    Elements(i) = ""
                    i = i + 1
                Loop
            End If
        Else
            'in_script = true so look for endtag
            loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare)
            If loc > 0 Then
                in_script = False
            End If
        End If

    Next

Using the Code

Insert your own methods to replace links, image tags, insert a table, footer etc. Leaving this wrapper intact will protect the script sections and it will also prevent the parser method from misbehaving. I added z value to let the parser process the HTML in strings having code after the found </SCRIPT> tag (as is possible with NYTimes pages.)

I need to appologize for not providing a working demonstration. Its difficult to just throw out a useful demonstration at this time without disseminating too much about the BOWSER parse method. The wrapper is used in my BOWSER demonstration.

Interesting Points

It seems that providers are using complex structures to prevent commercial quality archiving of their content. I have no problem handling the content of the average HTML website, but the NYT with its dynamic content insertions play at havoc, using techniques to cause my parser to either skip content or otherwise misbehave. Presently I'm adding code to process HTML found after the </SCRIPT> tags in what is like a cat & mouse game. The more sophisticated the parser becomes, the easier it will be to break.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
United States United States
Hubris is like armor. He is afraid to take on a Coder project for what it may do to him, both financially and to his health.

WEB Bowser is a hack that makes it easy to acquire internet content for archival. Its also possible to PROXY it, serving it to the internet as well. Given this capability, people will change the internet again.

I envision a programmable server that will acquire content for proxy. Users will then add proxy content to their own, making it available almost as if it were their own. We are already "information collectors." Its the opinions that are getting real hard to find.

-

Comments and Discussions

 
-- There are no messages in this forum --