Click here to Skip to main content
15,886,689 members
Articles / Web Development / HTML
Alternative
Tip/Trick

Remove all the HTML tags and display a plain text only inside (in case XML is not well formed)

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
18 Jan 2011CPOL2 min read 14.5K   2   6
Sorry, but I have to vote this way down. Your regular expression (or @Chris's) is not robust enough for what I would consider "real world" data. Especially if this is used on any kind of public web site, I would be afraid of JavaScript injection attacks and other things (depending on its usage)....
Sorry, but I have to vote this way down. Your regular expression (or @Chris's) is not robust enough for what I would consider "real world" data. Especially if this is used on any kind of public web site, I would be afraid of JavaScript injection attacks and other things (depending on its usage). Here is a quick example of where your regular expression fails for some completely valid HTML code:
<b title="test > fail"><i>The tag is about to be removed</i></b>

Applying your regular expression results in:
fail">The tag is about to be removed

While you may argue that it did in fact remove the tags, again, I would just have to say that I don't think it is safe to use in most cases. Here is a comment and the two regular expressions that we use.


Taken from http://haacked.com/archive/2005/04/22/Matching_HTML_With_Regex.aspx and slightly modified. I changed the first "\w" to "\S" so that tags like <namespace:tagname xmlns:namespace="#unknown"> will be found. The colon character is not part of "\w". While "\S" may be a bit overboard, I'm fine with that. Then I do the same thing with the second "\w", for an attribute name, but put it inside a subtraction set so that the "=", which is the delimiter between the attribute name and value, is not eaten up. Each of these matches is looked for one or more times, as few as possible, the "+?" after them.

Also, when pasting markup from Microsoft Word, it will include funny comment sections of the form:


<!--[if !mso]> st1\:*{behavior:url(#ieooui) } <![endif]-->

Because there is a '>' soon after the opening comment tag, the AllHtmlTags pattern will pick it up. It will obviously be caught as an unrecognized tag and removed, but the content in the middle will be left alone. This means that after submitting, the user will then have a bunch of meaningless text scattered throughout their description. To prevent this, we could change to parsing the entire Html without regular expressions, or we can try to use another regular expression to first match and remove all comments.

You can use "Expresso" regular expression tool to help analyze and explain this expression. Remember, some of the quotation marks are escaped out with another quote for the C# string, but not the regular expression.


HTML
AllHtmlTagsPattern = @"</?\S+?((\s+[\S-[=]]+?(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
CommentTagsPattern = @"<![\s\S]*?--[\s]*>";
</namespace:tagname>

Hopefully that the above code section is readable enough.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralRe: Oops. Didn't read the text properly, I thought that really _... Pin
Indivara18-Jan-11 13:57
professionalIndivara18-Jan-11 13:57 
GeneralGood alternative Pin
Nigam Patel19-Dec-11 17:34
Nigam Patel19-Dec-11 17:34 
GeneralUpdated alternate, check whether it is what you intended to ... Pin
Indivara17-Jan-11 15:46
professionalIndivara17-Jan-11 15:46 
GeneralRe: Thanks, Indivara. That is the correct string for the AllHtm... Pin
KevinAG18-Jan-11 6:26
KevinAG18-Jan-11 6:26 
GeneralDamn, OK, I will try escaping out the whole string myself. ... Pin
KevinAG17-Jan-11 14:48
KevinAG17-Jan-11 14:48 
GeneralWell, apparently CodeProject's parsing ate my code. The reg... Pin
KevinAG17-Jan-11 14:46
KevinAG17-Jan-11 14:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.