Click here to Skip to main content
15,886,578 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
Hello everyone,

I have a little issue using Regular Expression to get the text from a html <textarea>.
This is the html source that has the information I would like to get.

HTML
<div id="description-parent" class="msg">
  <textarea id="description" class="text meninges" cols="43" rows="8" name="description" type="text" required="required">The Information starts here and continues through to the end of the end.


  Bla Bla Bla
  
Bla Bla Bla

  
 As you can see this informtion is not stored in any formatting.
 
 it is all just plaining text.

 
      Bla Bla Bla

  Bla Bla Bla

      
Lots and lots of information and this is the end.</textarea>
</div>


I can use regex to get values on one line but not the whole paragraph.
The text I need to get is all between:

HTML
name="description" type="text" required="required">

And
HTML
</textarea>


This is the current vb.net code that I am playing with to try to get the text information form the html source.

VB
<pre lang="xml">Dim regex As New System.Text.RegularExpressions.Regex("<div id=""description-parent"" class=""msg"">.*") ' I cannot figure out what to place here
Dim matches As MatchCollection = regex.Matches(My.Computer.FileSystem.ReadAllText("D:\temp\source.html").ToString) ' This is the html source

For Each items In matches
    Try
        MessageBox.Show(items.ToString) ' Once i can place the information into a variable then i can work with it
    Catch ex As Exception
        MessageBox.Show("Error: " & ex.Message)
    End Try
Next


Any help or advice is much appreciated, I am sure I am just overlooking one thing.
Posted
Comments
ZurdoDev 19-Mar-13 21:36pm    
Are you saying that all you want is the text within the textarea? If so, just put a runat="server" on it and then access it via id.
tm9333 20-Mar-13 0:51am    
Yes, I am trying to get the human readable text that resides within the “textarea” of the html source.

Unfortunately, I am not using ASP.NET.
ZurdoDev 20-Mar-13 7:03am    
So, what happens if you use jquery, for example, $('#mytextarea').val();

An attempt of applying Regular Expressions to extract data from HTML is a very usual in the beginners, and, in most cases, is a methodological mistake. First of all, it's most usual case when HTML is a well-formed XML. In this case, .NET XML parsers should be used, and they are always available. This is my short review of them:

  1. Use System.Xml.XmlDocument class. It implements DOM interface; this way is the easiest and good enough if the size if the document is not too big.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^].
  2. Use the class System.Xml.XmlTextReader; this is the fastest way of reading, especially is you need to skip some data.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.aspx[^].
  3. Use the class System.Xml.Linq.XDocument; this is the most adequate way similar to that of XmlDocument, supporting LINQ to XML Programming.
    See http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx[^], http://msdn.microsoft.com/en-us/library/bb387063.aspx[^].


In more rare cases, well-formed XML cannot be assumed. Even though such cases, so to speak, simply have no right to exist, in real life in happens. Than you still need to use some HTML parser which can deal with such cases. I would advise to try this one: http://www.majestic12.co.uk/projects/html_parser.php[^].

You can try to find some more: http://bit.ly/15ZhBKr[^].

Good luck,
—SA
 
Share this answer
 
Comments
tm9333 20-Mar-13 0:45am    
Sorry I have still not got it yet. I do like to learn new things but I have been at this for a couple of days now with no results to show for it. I guess I must be having one of those weeks.

If you could spend a little time creating a little vb.net source code to help provide a solution I would be much appreciated of you.
Sergey Alexandrovich Kryukov 20-Mar-13 1:26am    
Sorry, I suggest that you, as a more interested person, tried to parse your XML and ask follow-up questions if you have some problem.
You see, you already have all you need, and remaining part is just to do some work, and this is your work.
—SA
tm9333 20-Mar-13 1:50am    
Yes and I agree with you. People cannot learn by going to the back of the book.
I think I will set this part of the project aside for now and come back to it when I have a clear head.
Thank you for your replies and enjoy the rest of your day.
Sergey Alexandrovich Kryukov 20-Mar-13 2:31am    
Excellent decision.
Good luck,
—SA
 
Share this answer
 
Comments
tm9333 20-Mar-13 0:53am    
That made an excellent read. I still have not found a solution yet.
I am starting to think Regex is not the way to go. I do not know of any other ways to pull apart the html source that I need.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900