RegEx - Complex Regex Function, Ignore Spaces, negate only certain letters

Question

0.00/5 (No votes)

See more:

Hello everyone,

I am writing a very specific and complex RegEx function and I can't figure out how to complete certain parts of it, I have more than one question on it:

The string I am searching for is similar to this one: 39SWB20002000 it is an MGRS Coordinate.
The string can be written in a few different ways 39SWB20002000, 39S WB 2000 2000, 39S WB 20002000 etc.

I am confused on how to write regEx for the following parameters:

The first two digits can be a number 01-60 or 1-60

The third digit can only be a letter C-Xc-x but not the letters IiOo

The 4th and 5th digits can be a letter A-Za-z but not the letters IiOo, same as above

The last portion of the coordinate is two pairs of numbers (2000 & 2000 in the example above), they can be written is several different ways, each number can be a number 1 digit to 6 digits but they both have to be the same amount of digits.

Here is the RegEx I have so far:

[0-6][0-9][C-HJ-NP-Xc-hj-np-x][A-HJ-NP-Za-hj-np-z]{2}

What is the best way to do this?

-Kyle

Posted 11-Oct-13 10:44am

Kyle A.B.

Add a Solution

2 solutions

Solution 1

I would split the whole thing into scanning and parsing.

According to your spec, the overall pattern looks as follows:

begin, 1-to-2-digits, 1-char, 2-chars, 2-evenly-split-digit-groups-of-up-to-4-digits-each, end

Where between all tokens, there may be zero or more spaces.

Lets define each token and the respective regex by having each token represent a regex group and parse then the tokens by group:

C#

string input = ...;
...
string[] tokens =
{ @"(\d\d?)"         // group 1
, @"([a-zA-Z])"      // group 2
, @"([a-zA-Z]{2})"   // group 3
, @"(?:(\d{4})\s*(\d{4})|(\d{3})\s*(\d{3})|(\d{2})\s*(\d{2})|(\d{1})\s*(\d{1}))" // groups  4/5, 6/7, 8/9, 10/11
};
string pattern = @"^\s*" + string.Join(@"\s*", tokens) + @"\s*$";
Match match = Regex.Match(input, pattern);

if (!match.Success) Error("...");

int n = int.Parse(march.Groups[1].Value);
if (n > 60) Error("...");

string a = match.Groups[2].Value);
if (Regex.IsMatch(a, @"[abioyzABIOYZ]")) Error("...");

string b = match.Groups[3].Value);
if (Regex.IsMatch(b, @"[ioIO]")) Error("...");

int u = int.Parse(match.Groups[4].Success
                ? match.Groups[4].Value
                : match.Groups[6].Success
                ? match.Groups[6].Value
                : match.Groups[8].Success
                ? match.Groups[8].Value
                : match.Groups[10].Value);
int v = int.Parse(match.Groups[5].Success
                ? match.Groups[5].Value
                : match.Groups[7].Success
                ? match.Groups[7].Value
                : match.Groups[9].Success
                ? match.Groups[9].Value
                : match.Groups[11].Value);
...

Cheers
Andi

Posted 11-Oct-13 12:10pm

Andreas Gieriet

Updated 11-Oct-13 12:50pm

v5

Comments

Kyle A.B. 11-Oct-13 23:53pm

I believe I might have solved this myself in a manner similar to the one you suggested. (Also of note, I figured out the third character can actually take a,b,y & z. So the new regex might look like this one:

[0-6]*[\d]\s*[A-HJ-NP-Xc-hj-np-z]\s*[A-HJ-NP-Za-hj-np-z]{2}\s*([\d]+\s+[\d]+|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))

Break down:
[0-6]*
The first digit - can be 0 through 6 or non-existent

[\d]\s*
The second digit - any digit and an optional space

[A-HJ-NP-Xc-hj-np-z]\s*
The third character - A through Z excluding I & O and an optional space

[A-HJ-NP-Za-hj-np-z]{2}\s*
Fourth and Fifth characters: The third character - 2x A through Z excluding I & O and an optional space

([\d]+\s+[\d]+|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))
Here's where it gets tricky, most of the time this matches an MGRS string fine but it will also match '13TEF 100100 5' as an MGRS so I think the below would be a better solution:
(([\d]{1}\s+[\d]{1}|[\d]{2}\s+[\d]{2}|[\d]{3}\s+[\d]{3}|[\d]{4}\s+[\d]{4}|[\d]{5}\s+[\d]{5}|[\d]{6}\s+[\d]{6}|[\d]{7}\s+[\d]{7}|)|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))

What are your thoughts on performance with this expression? Or am I missing anything?

Andreas Gieriet 12-Oct-13 2:23am

I think you expect too much from Regex. Regex is suitable to search for text pattern (e.g. to split into tokens) but not for detailed value evaluation.
E.g. your [0-6]?[\d] still allows for 61, etc. You could of course list all possible values (e.g. 1|2|3|4|5|6|7|8|9|10|...|60 but this is clearly not useful (even it is technically possible).
That's why I suggest to use Regex to tokenize and then "parse" the tokens and check for valid ranges, etc.
For what concerns performance: measure it! If it's performant enough, leave it as is. If too slow, you may think of doing a proper scanner and parser without Regex.
Cheers
Andi

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Kyle A.B. · Accepted Answer · 2013-10-12T04:03:00

This is my final:

[^\d]\s*([6][0]|[1-5][0-9]|[0]*[1-9])\s*[A-HJ-NP-Xc-hj-np-z]\s*[A-HJ-NP-Za-hj-np-z]{2}\s*([\d]{8}\s+[\d]{8}|[\d]{7}\s+[\d]{7}|[\d]{6}\s+[\d]{6}|[\d]{5}\s+[\d]{5}|[\d]{4}\s+[\d]{4}|[\d]{3}\s+[\d]{3}|[\d]{2}\s+[\d]{2}|[\d]{1}\s+[\d]{1}|\d{16}|\d{14}|\d{12}|\d{10}|\d{8}|\d{6}|\d{4}|\d{2})\s*[^\d]

It's lengthy but it works rather well, it will parse through documents and pick out coordinates without having to do very much extra coding.

Here's the breakdown of it, the Expression first searches for '60' if it can't find that it will search for 10-59 then it will search through 01-09 or 1-9:

[^\d]\s*
(
    [6][0]|
    [1-5][0-9]|
    [0]*[1-9]
)

The regex then searches for Three letters (but not I & O), with a possible space between the first and second letter:

\s*
[A-HJ-NP-Xc-hj-np-z]
\s*
[A-HJ-NP-Za-hj-np-z]{2}
\s*

Then it searches for the grid numbers at the end, they need to be either a group of even digits 2 through 16 or two groups of digits with a space in between. So I start the search looking for a possible space in between (largest numbers first) then search for the even digits last.

(
    [\d]{8}\s+[\d]{8}|
    [\d]{7}\s+[\d]{7}|
    [\d]{6}\s+[\d]{6}|
    [\d]{5}\s+[\d]{5}|
    [\d]{4}\s+[\d]{4}|
    [\d]{3}\s+[\d]{3}|
    [\d]{2}\s+[\d]{2}|
    [\d]{1}\s+[\d]{1}|
    \d{16}|
    \d{14}|
    \d{12}|
    \d{10}|
    \d{8}|
    \d{6}|
    \d{4}|
    \d{2}
)
\s*[^\d]

-and finally it is surrounded with [^\d]\s* ... \s*[^\d] to search for non-digits surrounding it and optional space in between. If the MGRS is surrounded by more numbers it is not likely an MGRS.
Now all I have to do is trim the last chars off the beginning and end.

-Kyle