|
I see your concern but this RegEx is to be implemented with a OCR engine which sometimes (unfortunately) adds whitespaces where there are none.
So in my example "Rog er" has actually entered "Roger" and the OCR interprets it as two parts. I want to eliminate those cases, if an actual "Rog er" appears then my customer is informed that hey have to look for "Roger".
"Rog-er / Rog-Er" is accounted for and allowed!
|
|
|
|
|
Norwegian names frequently omit the hyphen: In my school class there were both Per Erik, Hans Petter, Gunn Marit and Marit Irene (all first names, not first+family names). Others had double first names, but used only one of them except at formal occasions. Both my parents had double, un-hyphenated first names (and my father's second first name was used so rarely that I didn't know of it until my mid-teens!).
(In Sweden, you are quite likely to get in contact with people with names Norwegian style.)
The list of false assumptions in the article linked by Richard Dennings is great!
|
|
|
|
|
trønderen wrote: Richard Dennings
Who?
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Sorry. My mistake. When you open a reply window, the list of messages is by default hidden, and I took the name from my (incorrect) memory of the author of the message no longer in view.
I hope you were not terribly offended. I am sorry for my mistake anyway.
|
|
|
|
|
So the real problem is the OCR solution.
The reality is that you are unlikely to be able to deal with all of the error cases. Your solution might introduce more errors.
So it is a trade off.
If error reduction is considered a significant issue, then perhaps better to look into getting a different OCR solution and use both of them. Then compare the output from both and only apply fixes when there is a difference.
If as I said it is a significant problem then any additional cost should not be a problem. But if the cost is a problem then perhaps it isn't as significant as thought.
|
|
|
|
|
Doing this in a language such as C# would have been a trivial task.
It would have given you a lot more flexibility in handling e.g. standard name parts that are not capitalized, such as Ludvigvan Beethoven, Charlesde Gaulle or Bengtaf Klintberg. Lots of other special cases and variations could be handled in a much more maintainable way.
I have linked to this several times earlier, but it cannot be repeated too often: Geek & Poke: Yesterday's regex[^]
You cannot expect your name matching to be perfect on the first try. Or second. Or third. E.g. a list of prepositions such as "van", "de", "af" ... will grow and grow. Adding them to a C# list is far easier than updating your regex.
|
|
|
|
|
There are also English (and Welsh) surnames that start with "ff": it indicated "son of" in Middle Age English and was a single letter - literally an uppercase "F" was written as "ff" Until the end of the Middle Ages the initial capitalization of any name wasn't a thing - names were all written in lowercase. Some rich people* kept the lowercase starter to this day (and can get very shirty if you use uppercase!)
* Who mostly were the only ones with surnames anyway, they didn't become common practice until the aftermath of the Black Death.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
hi all,
so, I'm new to regex, trying to understand and i admit i'm lost.
here's what i need right now;
i have a list of string where i wish to extract the email address of users, each line looks like this:
DisplayName;Surname;Givenname;Mail;Company
which gives me something like:
$line = 'jsmith;john;smith;john.smith@someemail.com;acme'
since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing.
$line -match '\w+' = True
$line -match '\w+;' = true
$line -math '\w+;\w+;' = true
$line -match '\w+;\w+;\w+' = true
$line -match '\w+;\w+;\w+;' = false
$line -match '\w+;\w+;\w+;\.*' = false
at first i thought that this regex would give me the email but it fails.
$regex = '\w+;\w+;\w+;(\w+@\w+);\w+'
thanks for helping me.
|
|
|
|
|
Based on your description, you want to extract the fourth field from each line:
^([^;]*;){3}([^;]+); Demo[^]
However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values.
For example, given a display name of j;smith , would that end up as j\;smith ? j;;smith ? Something else? Or would it just corrupt the entire line?
Once you start having to account for "escaped" separators, parsing the line becomes much harder.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g fields[] = split(line, ';') (or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort.
In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : https://stackoverflow.com/a/201378 The next response on that SO page may also be useful, if you're using C#, which refers to the MailAddress class.
Keep Calm and Carry On
|
|
|
|
|
Kardock wrote: each line looks like this:
Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon.
So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).
|
|
|
|
|
you're right but that gave me a chance to try to understand regex.
|
|
|
|
|
Make sure to learn one lesson about regex: Don't overuse it.
I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex.
And there is Geek&Poke: Yesterdays regex[^]
Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)
|
|
|
|
|
So given that you just want to mess around with regex.
Kardock wrote: but it fails.
Presumably you mean it runs but it does not successfully match.
The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match.
The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.)
However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following
[^;]+
You should probably in fact match all of the columns that way.
So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.
|
|
|
|
|
I have a MySQL column with underscores and between those underscores a domain name (which is anything with a dot, except number.number), and I like to use REGEXP_SUBSTR to extract that domain, but only if there are at least 2 preceding underscores before it.
Examples:
1. Something_Else_Type_XY_Z_26.04.23_website.com => website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having something dot something that isn't number.number)
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU => website.com
Which regex could do so instead of ... in SELECT REGEXP_SUBSTR(full_name, ...) as domain_name?
|
|
|
|
|
One of these then:
\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$
\b((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
\b((?=[a-z0-9-]{1,63}\.)[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
Footnote: these make no sense of leading BLANK as your requirement states but I throw them out there anyway for your testing pleasure.
modified 14-Sep-23 14:11pm.
|
|
|
|
|
Unfortunately none of them worked at all.
REGEXP_SUBSTR(field_name, '^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$') immediately timed out (#3699 - Timeout exceeded in regular expression match ), and the rest just returned NULL for everything.
Any idea why?
|
|
|
|
|
Well, first of all those regex were std::regx not MySQL.
What exactly have timing out and error #3699 to do with this "SELECT" query? I'm not sure why none of them work because I'm not sure how you're leveraging this database. REGEX_SUBSTR? Is that a function?
Probably Oracle, he said.
I know nothing about Oracle, sorry.
|
|
|
|
|
|
I just tried a quick test at RegExr: Learn, Build, & Test RegEx[^] and the following regex finds all the web addresses in the set below (shown in bold). The only thing I cannot correct is that it includes the leading underscore. However it may be a starting point to help you (I have not actually tried it with MySQL).
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z].
Something_Else_Type_XY_Z_26.04.23_website.com
Something_Else_Type_XY_Z_website.co.uk_26.04.23
Something_Else_Type_XY_Z_26.04.23_website.com_Comment
Something_website.com_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU
|
|
|
|
|
Unfortunately REGEXP_SUBSTR(field_name, '_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]') just gives me the first 2 letters of the field_name ("So" in this case).
|
|
|
|
|
Yes, I got similar results, which does not make a lot of sense. However, on reading the documentation (https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^]) closely I discovered why. Hidden down near the bottom of the page is the following:
Quote: To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
So I tried with that, and a slight modification to my original pattern, and successfully extracted the fields as shown in the following results:
mysql> SELECT REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') from Books;
+---------------------------------------------------+
| REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') |
+---------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| _website.com |
| NULL |
| _website.com |
| _website.com |
| _website.co.uk |
| _website.com |
| NULL |
| NULL |
+---------------------------------------------------+
The NULL entries are records that do not contain a web address.
[EDIT]
I just tried without the leading underscore in the regular expression and added some numbers into the website names and got the following:
mysql> SELECT REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') from Books;
+-------------------------------------------------------------+
| REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') |
+-------------------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| website.com |
| NULL |
| google.co.it |
| website99.com |
| Bet365.co.uk |
| website.com |
| NULL |
| NULL |
+-------------------------------------------------------------+
12 rows in set (0.01 sec)
So it would appear to be simpler than I thought.
[/EDIT]
modified 20-Sep-23 4:18am.
|
|
|
|
|
Did you test according to my examples? What about:
Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
Your regex returns website.com there instead of NULL.
|
|
|
|
|
No, I am afraid I did not work that one out; I leave that as an exercise for you.
|
|
|
|
|
Concur here, using RegexBuddy and this string:
[a-zA-Z][[:alnum:]]*\.[a-zA-Z]*|\.[a-zA-Z]
Nothing but net (colloquial expression eliding formal disdain without turning away in disgust from self-realized talentlessness). Thanks for this.
modified 21-Sep-23 17:54pm.
|
|
|
|