Wrap your HTML parser to exclude scripting
Dec 26, 2003
8 min read
Win95
Win98
WinME
Win2K
WinXP
Win2003
VB6
VB
Javascript
Windows
.NET
Visual-Studio
Dev
Intermediate

by QUIETTA
Contributor
Introduction
Most parser enabled internet applications require script exclusion. This wrapper properly excludes script elements from testing, and possible script tainting. After reading the file it is entered into an array for line by line processing. If you are trying to disable anomalies caused by IE, clear line 2 of a saved document to keep it from reasserting the original document object model. It WILL do that on fresh documents. Clearing the line forces it to create a new model. Note that this is done in preparation for subsequent browser navigations, NOT this parsing session.
Dim loc, z as long Elements = Split(s, vbCrLf) Elements(1) = "" in_script = False For i = 2 To UBound(Elements) z = 1 If in_script = False Then loc = InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare) If loc > 0 Then If (InStr(z, UCase(Elements(i)), "<SCRIPT ", vbBinaryCompare) > 0 And InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare) > 0) Then in_script = False Elements(i) = Replace(Elements(i), Page & "_files/", myscriptsfolder) z = loc + 8 Else Elements(i) = Replace(Elements(i), Page & "_files/", myscriptsfolder) in_script = True End If End If '///////////////////////////////////////////////// ' ADD MORE PARSER METHODS HERE 'insert basetag method calls InsertBaseElement method loc = InStr(z, Elements(i), "<HEAD>", vbBinaryCompare) If loc > 0 Then If (objDocument.getElementsByTagName("BASE").length = 0) Then Elements(i) = InsertBaseElement(Elements(i), loc) Else Elements(i) = Replace(Elements(i), s, ARCRoot) End If End If '/////////////////////////////////////////////////
DoEvents '///////////////////////////////////////////////// 'This code can be modified to suit special 'requirements 'It is useful for chopping of a
'document with dynamic footer content 'written by script methods '(Coders may be trying to ensure some kind of difficulty getting a ' clean archive document from their service.) This code attempts ' to cleanup the non-compliant HTML footer. loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare) If loc > 0 Then in_script = False i = i + 1 Elements(i) = "</BODY></HTML>" i = i + 1 Do While i < UBound(Elements) Elements(i) = "" i = i + 1 Loop End If Else 'in_script = true so look for endtag loc = InStr(z, UCase(Elements(i)), "</SCRIPT>", vbBinaryCompare) If loc > 0 Then in_script = False End If End If Next
Using the Code
Insert your own methods to replace links, image tags, insert a table, footer etc. Leaving this wrapper intact will protect the script sections and it will also prevent the parser method from misbehaving. I added z value to let the parser process the HTML in strings having code after the found </SCRIPT> tag (as is possible with NYTimes pages.)
I need to appologize for not providing a working demonstration. Its difficult to just throw out a useful demonstration at this time without disseminating too much about the BOWSER parse method. The wrapper is used in my BOWSER demonstration.
Interesting Points
It seems that providers are using complex structures to prevent commercial quality archiving of their content. I have no problem handling the content of the average HTML website, but the NYT with its dynamic content insertions play at havoc, using techniques to cause my parser to either skip content or otherwise misbehave. Presently I'm adding code to process HTML found after the </SCRIPT> tags in what is like a cat & mouse game. The more sophisticated the parser becomes, the easier it will be to break.
License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)