|
Doing this in a language such as C# would have been a trivial task.
It would have given you a lot more flexibility in handling e.g. standard name parts that are not capitalized, such as Ludvigvan Beethoven, Charlesde Gaulle or Bengtaf Klintberg. Lots of other special cases and variations could be handled in a much more maintainable way.
I have linked to this several times earlier, but it cannot be repeated too often: Geek & Poke: Yesterday's regex[^]
You cannot expect your name matching to be perfect on the first try. Or second. Or third. E.g. a list of prepositions such as "van", "de", "af" ... will grow and grow. Adding them to a C# list is far easier than updating your regex.
|
|
|
|
|
There are also English (and Welsh) surnames that start with "ff": it indicated "son of" in Middle Age English and was a single letter - literally an uppercase "F" was written as "ff" Until the end of the Middle Ages the initial capitalization of any name wasn't a thing - names were all written in lowercase. Some rich people* kept the lowercase starter to this day (and can get very shirty if you use uppercase!)
* Who mostly were the only ones with surnames anyway, they didn't become common practice until the aftermath of the Black Death.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
hi all,
so, I'm new to regex, trying to understand and i admit i'm lost.
here's what i need right now;
i have a list of string where i wish to extract the email address of users, each line looks like this:
DisplayName;Surname;Givenname;Mail;Company
which gives me something like:
$line = 'jsmith;john;smith;john.smith@someemail.com;acme'
since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing.
$line -match '\w+' = True
$line -match '\w+;' = true
$line -math '\w+;\w+;' = true
$line -match '\w+;\w+;\w+' = true
$line -match '\w+;\w+;\w+;' = false
$line -match '\w+;\w+;\w+;\.*' = false
at first i thought that this regex would give me the email but it fails.
$regex = '\w+;\w+;\w+;(\w+@\w+);\w+'
thanks for helping me.
|
|
|
|
|
Based on your description, you want to extract the fourth field from each line:
^([^;]*;){3}([^;]+); Demo[^]
However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values.
For example, given a display name of j;smith , would that end up as j\;smith ? j;;smith ? Something else? Or would it just corrupt the entire line?
Once you start having to account for "escaped" separators, parsing the line becomes much harder.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g fields[] = split(line, ';') (or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort.
In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : https://stackoverflow.com/a/201378 The next response on that SO page may also be useful, if you're using C#, which refers to the MailAddress class.
Keep Calm and Carry On
|
|
|
|
|
Kardock wrote: each line looks like this:
Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon.
So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).
|
|
|
|
|
you're right but that gave me a chance to try to understand regex.
|
|
|
|
|
Make sure to learn one lesson about regex: Don't overuse it.
I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex.
And there is Geek&Poke: Yesterdays regex[^]
Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)
|
|
|
|
|
So given that you just want to mess around with regex.
Kardock wrote: but it fails.
Presumably you mean it runs but it does not successfully match.
The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match.
The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.)
However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following
[^;]+
You should probably in fact match all of the columns that way.
So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.
|
|
|
|
|
I have a MySQL column with underscores and between those underscores a domain name (which is anything with a dot, except number.number), and I like to use REGEXP_SUBSTR to extract that domain, but only if there are at least 2 preceding underscores before it.
Examples:
1. Something_Else_Type_XY_Z_26.04.23_website.com => website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having something dot something that isn't number.number)
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU => website.com
Which regex could do so instead of ... in SELECT REGEXP_SUBSTR(full_name, ...) as domain_name?
|
|
|
|
|
One of these then:
\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$
\b((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
\b((?=[a-z0-9-]{1,63}\.)[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
Footnote: these make no sense of leading BLANK as your requirement states but I throw them out there anyway for your testing pleasure.
modified 14-Sep-23 14:11pm.
|
|
|
|
|
Unfortunately none of them worked at all.
REGEXP_SUBSTR(field_name, '^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$') immediately timed out (#3699 - Timeout exceeded in regular expression match ), and the rest just returned NULL for everything.
Any idea why?
|
|
|
|
|
Well, first of all those regex were std::regx not MySQL.
What exactly have timing out and error #3699 to do with this "SELECT" query? I'm not sure why none of them work because I'm not sure how you're leveraging this database. REGEX_SUBSTR? Is that a function?
Probably Oracle, he said.
I know nothing about Oracle, sorry.
|
|
|
|
|
|
I just tried a quick test at RegExr: Learn, Build, & Test RegEx[^] and the following regex finds all the web addresses in the set below (shown in bold). The only thing I cannot correct is that it includes the leading underscore. However it may be a starting point to help you (I have not actually tried it with MySQL).
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z].
Something_Else_Type_XY_Z_26.04.23_website.com
Something_Else_Type_XY_Z_website.co.uk_26.04.23
Something_Else_Type_XY_Z_26.04.23_website.com_Comment
Something_website.com_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU
|
|
|
|
|
Unfortunately REGEXP_SUBSTR(field_name, '_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]') just gives me the first 2 letters of the field_name ("So" in this case).
|
|
|
|
|
Yes, I got similar results, which does not make a lot of sense. However, on reading the documentation (https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^]) closely I discovered why. Hidden down near the bottom of the page is the following:
Quote: To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
So I tried with that, and a slight modification to my original pattern, and successfully extracted the fields as shown in the following results:
mysql> SELECT REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') from Books;
+---------------------------------------------------+
| REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') |
+---------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| _website.com |
| NULL |
| _website.com |
| _website.com |
| _website.co.uk |
| _website.com |
| NULL |
| NULL |
+---------------------------------------------------+
The NULL entries are records that do not contain a web address.
[EDIT]
I just tried without the leading underscore in the regular expression and added some numbers into the website names and got the following:
mysql> SELECT REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') from Books;
+-------------------------------------------------------------+
| REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') |
+-------------------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| website.com |
| NULL |
| google.co.it |
| website99.com |
| Bet365.co.uk |
| website.com |
| NULL |
| NULL |
+-------------------------------------------------------------+
12 rows in set (0.01 sec)
So it would appear to be simpler than I thought.
[/EDIT]
modified 20-Sep-23 4:18am.
|
|
|
|
|
Did you test according to my examples? What about:
Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
Your regex returns website.com there instead of NULL.
|
|
|
|
|
No, I am afraid I did not work that one out; I leave that as an exercise for you.
|
|
|
|
|
Concur here, using RegexBuddy and this string:
[a-zA-Z][[:alnum:]]*\.[a-zA-Z]*|\.[a-zA-Z]
Nothing but net (colloquial expression eliding formal disdain without turning away in disgust from self-realized talentlessness). Thanks for this.
modified 21-Sep-23 17:54pm.
|
|
|
|
|
MySQL's RE engine has one or two quirks of its own(after all, why make it easy?). The only part I could not figure out is how to ensure at least two underscores in the earlier part of the string. I bet @OriginalGriff or @RichardDeeming could do it with their eyes closed.
|
|
|
|
|
Using RegexBuddy (yeah, I guess I am falling into a hole here) the POSIX notation is perfectly acceptable, hence your [:alnum:]. I can substitute [:alpha:] there as well and I suspect others in the list are good actors. My brain infarct occurs though when I use Herr Gevertz' table of translations and try to replace the POSIX with his ASCII ([a-zA-Z0-9]) or for that matter UNICODE ([\p{L}\p{Nl}\p{Nd}]), neither of which work in his interface. Grand tool, RegexBuddy ... it's one of those feature-rich interfaces which swim with example/sample in the help volume.
My only complaint is that searching through documentation for words has no highlight of the found set so I've got to print out the page and use my pdf search to locate all instances!
|
|
|
|
|
Ok, (didn't read the doc link but thanks for that). Using your new regex:
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]
on the sample of your OP:
1. Something_Else_Type_XY_Z_26.04.23_website.com => _website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => _website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => _website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU = _website.com
Which isn't quite what you wanted (because of that leading "_"). And as for RM's take on it, as-written (without any _[a-zA-Z]*\.) THAT string returns no matches here either (I use a tool called RegexBuddy and after yesterday's experience, with MySql selected as the input language not std::regex).
So although I can confirm today's discovery is close it'll take me some more sleuthing to run down that leading character ascii 95 ...
{EDIT]
I'm seeing a warning when I try to tackle that underscore using "shorthand character classes" \w to the tune of "MySql doesn't support blah blah blah" so ... this could be a while.
[END EDIT]
modified 19-Sep-23 14:52pm.
|
|
|
|
|
See my update to yesterdays answer above.
|
|
|
|
|
You should keep in mind that this is going to be really slow when compared to other types of searches.
So if you are using large volumes of data and/or many searches you should keep that in mind.
You should also keep in mind that there are valid domain names that will have more than one dot (period).
|
|
|
|