Ok got it.
Instead of trying to get the tags, get the pattern of the data.
Input string:
<div class="floated-field-value">768:pHC0p5mwel+twV39TD8mRF5rKJZsF6No2:o0p5mwelJ9TD8mv5ImGo</div>
Regx1 used:
((\d{3}):(\w*)\+(\w*):(\w*))
Regx2 used:
((\d{3}):(\w*):(\w*)|(\d{3}):(\w*)\+(\w*):(\w*))
Regx3 used:
((\d*):(\w*):(\w*)|(\d*):(\w*)\+(\w*):(\w*))
Output:
768:pHC0p5mwel+twV39TD8mRF5rKJZsF6No2:o0p5mwelJ9TD8mv5ImGo
the 2 outer "()" contains the search terms.Not sure if they are needed when parsing a site or not.
"(\d{3})" looks for three numbers
":" that char next
"(\w*)" alphanumeric word of any length
"\+ escape the plus and look for the plus sign next
"(\w*)" alphanumeric word of any length
":" that char next
"(\w*)" last word to extract
Thats it like I said not sure how it would work on a real site.
It should work as long as all data values contain a "+" otherwise it would need to be modified for that type. like an "Or" statement that dosen't use the "+" in it but most everthing else the same.
It does work in a small test app.
I hope this is not your homework :)
EDIT:
After looking up what SSDEEP is I tested the other 2 Regx added.
the second one is for catching if the "+" is there or not.
The third one after a review of SSDEEP the first section could be longer the 3 Char's so I fixed it to get any length of digits.
The best I can tell the 2 outside "()" would need to be there to match the entire pattern.