Click here to Skip to main content
15,884,598 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello everyone,

I am writing a very specific and complex RegEx function and I can't figure out how to complete certain parts of it, I have more than one question on it:

The string I am searching for is similar to this one: 39SWB20002000 it is an MGRS Coordinate.
The string can be written in a few different ways 39SWB20002000, 39S WB 2000 2000, 39S WB 20002000 etc.

I am confused on how to write regEx for the following parameters:

The first two digits can be a number 01-60 or 1-60

The third digit can only be a letter C-Xc-x but not the letters IiOo

The 4th and 5th digits can be a letter A-Za-z but not the letters IiOo, same as above

The last portion of the coordinate is two pairs of numbers (2000 & 2000 in the example above), they can be written is several different ways, each number can be a number 1 digit to 6 digits but they both have to be the same amount of digits.

Here is the RegEx I have so far:

[0-6][0-9][C-HJ-NP-Xc-hj-np-x][A-HJ-NP-Za-hj-np-z]{2}


What is the best way to do this?

-Kyle
Posted

I would split the whole thing into scanning and parsing.

According to your spec, the overall pattern looks as follows:
begin, 1-to-2-digits, 1-char, 2-chars, 2-evenly-split-digit-groups-of-up-to-4-digits-each, end

Where between all tokens, there may be zero or more spaces.

Lets define each token and the respective regex by having each token represent a regex group and parse then the tokens by group:
C#
string input = ...;
...
string[] tokens =
{ @"(\d\d?)"         // group 1
, @"([a-zA-Z])"      // group 2
, @"([a-zA-Z]{2})"   // group 3
, @"(?:(\d{4})\s*(\d{4})|(\d{3})\s*(\d{3})|(\d{2})\s*(\d{2})|(\d{1})\s*(\d{1}))" // groups  4/5, 6/7, 8/9, 10/11
};
string pattern = @"^\s*" + string.Join(@"\s*", tokens) + @"\s*$";
Match match = Regex.Match(input, pattern);

if (!match.Success) Error("...");

int n = int.Parse(march.Groups[1].Value);
if (n > 60) Error("...");

string a = match.Groups[2].Value);
if (Regex.IsMatch(a, @"[abioyzABIOYZ]")) Error("...");

string b = match.Groups[3].Value);
if (Regex.IsMatch(b, @"[ioIO]")) Error("...");

int u = int.Parse(match.Groups[4].Success
                ? match.Groups[4].Value
                : match.Groups[6].Success
                ? match.Groups[6].Value
                : match.Groups[8].Success
                ? match.Groups[8].Value
                : match.Groups[10].Value);
int v = int.Parse(match.Groups[5].Success
                ? match.Groups[5].Value
                : match.Groups[7].Success
                ? match.Groups[7].Value
                : match.Groups[9].Success
                ? match.Groups[9].Value
                : match.Groups[11].Value);
...

Cheers
Andi
 
Share this answer
 
v5
Comments
Kyle A.B. 11-Oct-13 23:53pm    
I believe I might have solved this myself in a manner similar to the one you suggested. (Also of note, I figured out the third character can actually take a,b,y & z. So the new regex might look like this one:

[0-6]*[\d]\s*[A-HJ-NP-Xc-hj-np-z]\s*[A-HJ-NP-Za-hj-np-z]{2}\s*([\d]+\s+[\d]+|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))

Break down:
[0-6]*
The first digit - can be 0 through 6 or non-existent

[\d]\s*
The second digit - any digit and an optional space

[A-HJ-NP-Xc-hj-np-z]\s*
The third character - A through Z excluding I & O and an optional space

[A-HJ-NP-Za-hj-np-z]{2}\s*
Fourth and Fifth characters: The third character - 2x A through Z excluding I & O and an optional space

([\d]+\s+[\d]+|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))
Here's where it gets tricky, most of the time this matches an MGRS string fine but it will also match '13TEF 100100 5' as an MGRS so I think the below would be a better solution:
(([\d]{1}\s+[\d]{1}|[\d]{2}\s+[\d]{2}|[\d]{3}\s+[\d]{3}|[\d]{4}\s+[\d]{4}|[\d]{5}\s+[\d]{5}|[\d]{6}\s+[\d]{6}|[\d]{7}\s+[\d]{7}|)|(\d{2}|\d{4}|\d{6}|\d{8}|\d{10}|\d{12}|\d{14}))

What are your thoughts on performance with this expression? Or am I missing anything?
Andreas Gieriet 12-Oct-13 2:23am    
I think you expect too much from Regex. Regex is suitable to search for text pattern (e.g. to split into tokens) but not for detailed value evaluation.
E.g. your [0-6]?[\d] still allows for 61, etc. You could of course list all possible values (e.g. 1|2|3|4|5|6|7|8|9|10|...|60 but this is clearly not useful (even it is technically possible).
That's why I suggest to use Regex to tokenize and then "parse" the tokens and check for valid ranges, etc.
For what concerns performance: measure it! If it's performant enough, leave it as is. If too slow, you may think of doing a proper scanner and parser without Regex.
Cheers
Andi
This is my final:
[^\d]\s*([6][0]|[1-5][0-9]|[0]*[1-9])\s*[A-HJ-NP-Xc-hj-np-z]\s*[A-HJ-NP-Za-hj-np-z]{2}\s*([\d]{8}\s+[\d]{8}|[\d]{7}\s+[\d]{7}|[\d]{6}\s+[\d]{6}|[\d]{5}\s+[\d]{5}|[\d]{4}\s+[\d]{4}|[\d]{3}\s+[\d]{3}|[\d]{2}\s+[\d]{2}|[\d]{1}\s+[\d]{1}|\d{16}|\d{14}|\d{12}|\d{10}|\d{8}|\d{6}|\d{4}|\d{2})\s*[^\d]

It's lengthy but it works rather well, it will parse through documents and pick out coordinates without having to do very much extra coding.

Here's the breakdown of it, the Expression first searches for '60' if it can't find that it will search for 10-59 then it will search through 01-09 or 1-9:
[^\d]\s*
(
    [6][0]|
    [1-5][0-9]|
    [0]*[1-9]
)

The regex then searches for Three letters (but not I & O), with a possible space between the first and second letter:
\s*
[A-HJ-NP-Xc-hj-np-z]
\s*
[A-HJ-NP-Za-hj-np-z]{2}
\s*

Then it searches for the grid numbers at the end, they need to be either a group of even digits 2 through 16 or two groups of digits with a space in between. So I start the search looking for a possible space in between (largest numbers first) then search for the even digits last.
(
    [\d]{8}\s+[\d]{8}|
    [\d]{7}\s+[\d]{7}|
    [\d]{6}\s+[\d]{6}|
    [\d]{5}\s+[\d]{5}|
    [\d]{4}\s+[\d]{4}|
    [\d]{3}\s+[\d]{3}|
    [\d]{2}\s+[\d]{2}|
    [\d]{1}\s+[\d]{1}|
    \d{16}|
    \d{14}|
    \d{12}|
    \d{10}|
    \d{8}|
    \d{6}|
    \d{4}|
    \d{2}
)
\s*[^\d]

-and finally it is surrounded with [^\d]\s* ... \s*[^\d] to search for non-digits surrounding it and optional space in between. If the MGRS is surrounded by more numbers it is not likely an MGRS.
Now all I have to do is trim the last chars off the beginning and end.

-Kyle
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month


CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900