An attempt of applying Regular Expressions to extract data from HTML is a very usual in the beginners, and, in most cases, is a methodological mistake. First of all, it's most usual case when HTML is a well-formed XML. In this case, .NET XML parsers should be used, and they are always available. This is my short review of them:
- Use
System.Xml.XmlDocument
class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^]. - Use the class
System.Xml.XmlTextReader
; this is the fastest way of reading, especially is you need to skip some data.
See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^]. - Use the class
System.Xml.Linq.XDocument
; this is the most adequate way similar to that of XmlDocument
, supporting LINQ to XML Programming.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].
In more rare cases, well-formed XML cannot be assumed. Even though such cases, so to speak, simply have no right to exist, in real life in happens. Than you still need to use some HTML parser which can deal with such cases. I would advise to try this one:
http://www.majestic12.co.uk/projects/html_parser.php[
^].
You can try to find some more:
http://bit.ly/15ZhBKr[
^].
Good luck,
—SA