Regex a html

Question

1.00/5 (1 vote)

See more:

Hello everyone,

I have a little issue using Regular Expression to get the text from a html <textarea>.
This is the html source that has the information I would like to get.

HTML

<div id="description-parent" class="msg">
  <textarea id="description" class="text meninges" cols="43" rows="8" name="description" type="text" required="required">The Information starts here and continues through to the end of the end.


  Bla Bla Bla
  
Bla Bla Bla

  
 As you can see this informtion is not stored in any formatting.
 
 it is all just plaining text.

 
      Bla Bla Bla

  Bla Bla Bla

      
Lots and lots of information and this is the end.</textarea>
</div>

I can use regex to get values on one line but not the whole paragraph.
The text I need to get is all between:

HTML

name="description" type="text" required="required">

And

HTML

</textarea>

This is the current vb.net code that I am playing with to try to get the text information form the html source.

VB

<pre lang="xml">Dim regex As New System.Text.RegularExpressions.Regex("<div id=""description-parent"" class=""msg"">.*") ' I cannot figure out what to place here
Dim matches As MatchCollection = regex.Matches(My.Computer.FileSystem.ReadAllText("D:\temp\source.html").ToString) ' This is the html source

For Each items In matches
    Try
        MessageBox.Show(items.ToString) ' Once i can place the information into a variable then i can work with it
    Catch ex As Exception
        MessageBox.Show("Error: " & ex.Message)
    End Try
Next

Any help or advice is much appreciated, I am sure I am just overlooking one thing.

Posted 19-Mar-13 14:41pm

tm9333

Add a Solution

Comments

ZurdoDev 19-Mar-13 21:36pm

Are you saying that all you want is the text within the textarea? If so, just put a runat="server" on it and then access it via id.

tm9333 20-Mar-13 0:51am

Yes, I am trying to get the human readable text that resides within the “textarea” of the html source.

Unfortunately, I am not using ASP.NET.

ZurdoDev 20-Mar-13 7:03am

So, what happens if you use jquery, for example, $('#mytextarea').val();

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2013-03-19T15:57:00

An attempt of applying Regular Expressions to extract data from HTML is a very usual in the beginners, and, in most cases, is a methodological mistake. First of all, it's most usual case when HTML is a well-formed XML. In this case, .NET XML parsers should be used, and they are always available. This is my short review of them:

Use System.Xml.XmlDocument class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^].
Use the class System.Xml.XmlTextReader; this is the fastest way of reading, especially is you need to skip some data.
See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^].
Use the class System.Xml.Linq.XDocument; this is the most adequate way similar to that of XmlDocument, supporting LINQ to XML Programming.
See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].

In more rare cases, well-formed XML cannot be assumed. Even though such cases, so to speak, simply have no right to exist, in real life in happens. Than you still need to use some HTML parser which can deal with such cases. I would advise to try this one: http://www.majestic12.co.uk/projects/html_parser.php[^].

You can try to find some more: http://bit.ly/15ZhBKr[^].

Good luck,

—SA

PIEBALDconsult · Answer 2 · 2013-03-19T16:13:00

Solution 2

Please don't do that.

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html[^]

Posted 19-Mar-13 16:13pm

PIEBALDconsult

Comments

tm9333 20-Mar-13 0:53am

That made an excellent read. I still have not found a solution yet.
I am starting to think Regex is not the way to go. I do not know of any other ways to pull apart the html source that I need.

Regex a html

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0

Existing Members

...or Join us