|
i thought it would be simpler, however i wasnt able to get regular expression for the below.
i want to extract the words which are not in a particular pattern.
say in the below sentence
clicking the (Run Match) button (or F5) to see what (happens).
i want to extract all the words which are not defined in brackets (). so the output will be
clicking
the
button
to
see
what
below is expression which i defined. it is not working. can any one point out the mistake in the expression ?
(?!\(\w+\))
modified 1-Mar-15 0:33am.
|
|
|
|
|
You haven't mentioned which language you're using. Different languages have different implementations of Regular Expressions, which support different features.
The following should work in .NET:
(?<!\()\b\w+\b(?!\))
This uses zero-width negative assertions[^] to ensure that the word doesn't start with an opening parenthesis, or end with a closing parenthesis.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
How about if we just drop all the words in (parens) and display what's left?
<br />
#!/usr/bin/perl<br />
use strict;<br />
use warnings;<br />
<br />
my(@a,@b,$i,$j,$k,$s,$t);<br />
my(@out,$ins,$outs);<br />
<br />
$ins="clicking the (Run Match) button (or F5) to see what (happens). ";<br />
print "\n";<br />
$outs=$ins;<br />
$outs=~s/\(.+?\)
$outs=undupespace($outs);<br />
print "$outs\n";<br />
<br />
exit; # Exit main pgm.<br />
###################################################################<br />
sub undupespace<br />
# Remove dupe spaces. Max 1 consecutive space.<br />
{my($l)=@_;<br />
<br />
$l=~s/ {2,}/ /g;<br />
return $l; # undupespace<br />
<br />
}<br />
<br />
Output:
clicking the button to see what .
|
|
|
|
|
Hi,
Now I need a pattern to detect last name possibilities. I think this pattern will be slightly more complicated. Names that I see in the database are like:
Jones
Jones-Smith
Jones Smith (no hyphen)
O'Leary
Van Allen (no hyphen)
Vander Ark (no hyphen)
I think that this pattern will work but I would like public opinion to make sure I am getting this right:
^[a-zA-Z\-\s']+$
Can you think of any last names where this will not work? In testing it seems to work out alright.
Thanks,
Rob
|
|
|
|
|
Don't forget the characters that include diacritical marks.
E.g., ö Å ç
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
Is there a way to check for that without having to list every Unicode character? I didn't see any accented names in our database but that certainly doesn't mean it can't happen in the future.
I'd prefer to not include all Unicode characters. Just the ones with a high likelihood of showing up. I imagine that it could only be characters that would be accepted by Active Directory.
|
|
|
|
|
At least with the .NET Regex
http://msdn.microsoft.com/en-us/library/20bw873z(v=vs.110).aspx#CategoryOrBlock[^]
(I don't know about others)
you can specify the Unicode character category (for "Letter") so your regex would be:
^[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\-\s']+$
possibly even just
^[\p{L}\-\s']+$
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
After looking at that link, a person could go crazy trying to catch every possibility. Looks like regex can be very thorough!
Thanks for the help!
|
|
|
|
|
Yes!!
There's a reason the "Mastering Regular Expressions" book[^] is 496 pages!!!
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
How would you allow for a period only at the end of the string where in the case a name ends in Jr. or Sr.? A period wouldn't normally appear in any other position in a last name. I'm going with the pattern below so far. I'm double checking names in Active Directory but I'm reasonably sure you can't use diacritical characters. I need to research that to be certain.
^[a-zA-Z\-\s']+$
|
|
|
|
|
^[a-zA-Z\-\s']+\.$
Add the \. right before the $
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
That works perfect. I'm really starting to get the hang of this.
|
|
|
|
|
Checkout the Expresso[^] tool (free) to explore regular expressions!
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
Right, that is actually the tool I'm using. I bumped into it a couple of years ago but this is the first time I ever used regex.
|
|
|
|
|
Well, this pattern was working yesterday on a different computer at work. I installed Expresso on my personal computer so I could work on my project over the weekend and now the pattern is not working.
^[a-zA-Z\-\s']+\.$
john1 = no matches
The pattern should match the number one because numbers are not allowed but the results are blank when I run this pattern. I could have sworn that this was working yesterday.
EDIT:
I did some further testing and discovered that the \. is breaking the pattern. If there is no period at the end; then count = 0. This pattern seems to require the period at the end and then it works correctly. The period should be allowed 0 or 1 times at the end of the string.
So the pattern below is working the way I want it to in Expresso but not when I use it in an HTA using vbscript to do the pattern matching. Vbscript is throwing an error at the line where the pattern is executed.
^[a-zA-Z\-\s']+?\.$
Not sure how to make a pattern that works in Expresso to also work with vbscript.
SOLUTION:
^[a-zA-Z\-\s']+?\.$ This pattern works when testing in Expresso but doesn't work with vbscript although this may work when used with other languages.
^[a-zA-Z\-\s']+\.{0,1}$ This is the pattern that behaves the same way as the pattern above but also works with vbscript.
MATCHES:
Jones
Jones-Smith
Jones Smith (no hyphen)
O'Leary
Van Allen (no hyphen)
Vander Ark (no hyphen)
Jones Sr.
Although this doesn't address diacritical characters, a few conversations with colleagues resulted in the decision that the risk is very low that they will be used in Active Directory. We currently have only 3 techs making entries into AD so informing them of how this pattern works will reduce the risk even further. I have worked for my organization for 14 years and no diacritical characters have been used until now so I feel pretty safe in not testing for them. It may not be the ultimate approach such as selling a product to the public but it does meet the needs of the specifications that were given to me.
Thank you! - I'd like to give a shout out to everyone who helped me out with this project! I really appreciate all of you taking the time to steer me in the right direction! I would go as far as to say that CodeProject could be just as valuable as sitting in any classroom. You may not get a certification here but the knowledge gained is invaluable. I was able to gain a solid understanding of regex in a matter of a few hours. I watched several videos but I would say this forum helped out the most because it specifically dealt with the solution that I was attempting to resolve.
modified 12-Oct-14 10:25am.
|
|
|
|
|
robwm1 wrote: ^[a-zA-Z\-\s']+?\.$
This was so.... close.
When I suggested the \. I forgot the conditional aspect of the the dot at the end. (Sorry.)
Just move the ? to be after the \.
^[a-zA-Z\-\s']+\.?$
the ? means exactly the same thing as {0,1}
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
I never thought to move the ? to the end. You're right though, it is the same result as {0,1}.
Thanks again!
|
|
|
|
|
I'd be awfully surprised if the only characters allowed in Active Directory worldwide are the basic ASCII-ish letters.
A positive attitude may not solve every problem, but it will annoy enough people to be worth the effort.
|
|
|
|
|
I know we have a least one person that has an accented 'e' in their last name but it's not that way in Active Directory. I don't know if that is due the person making the entry didn't know how to make the accented character or it was disallowed. I'll definitely research to be sure before I make a final decision to leave it out. I will post my findings here.
|
|
|
|
|
Hi,
I created an HTA that requires First Name, Last Name, and username to be entered. I am working on the First Name validation first.
The First Name should only be alpha characters but may include a hyphen. No numbers or symbols (besides hyphen) should be found in any position of the string being tested. I did find one user with a hyphen in the first name though so I need to allow that symbol. My approach has been to look for matches that are not alpha characters. If there is a match, I display a warning that tells the user to enter only alpha characters. Here is the regex pattern that I am testing:
[^a-zA-Z]+$
When I test this pattern, it is unable to detect a number or symbol (including hyphen) if it is in any position other than the end of the string. The pattern I posted here doesn't allow for a hyphen so I need to fix that as well.
What should this regex pattern look like if I want to detect anything other than an alpha character regardless of where it occurs in the string being tested?
Thanks,
Rob
modified 9-Oct-14 19:04pm.
|
|
|
|
|
You can try this
^[^a-zA-Z\-]+$
The hyphen is a keyword in regular expressions so you need to escape it with \-
This is a pretty good site to learn about regex. Regular-Expressions.info[^]
|
|
|
|
|
That's only going to detect a string that contains nothing but the disallowed characters.
For example, "1.2" will match, but "1.2a" will not.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
You are right. Forgot to check that.
I usually stay away from negations like that. It usually contains traps.
Your solution is probably better.
|
|
|
|
|
What would be considered the best approach then? I thought it made sense to look for what is disallowed and look at the count property. If the count property is > 0, then the data entered needs to be corrected.
This is my first time using regex so I am unaware of what would be considered best practice. I spent all day yesterday studying about regex to learn about and used Expresso to play around with possibilities. Like most programmers, I would prefer to follow best practices.
|
|
|
|
|
To match the characters that aren't allowed, try:
[^a-zA-Z\-]
To validate that the string doesn't contain any disallowed characters, use:
^[a-zA-Z\-]+$
For the HTML5 pattern attribute[^], use:
<input type="text" name="FirstName" required pattern="[a-zA-Z\-]+" />
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|