Click here to Skip to main content
15,121,505 members
Articles / General Programming / String
Article
Posted 3 Oct 2016

Stats

18.4K views
430 downloads
23 bookmarked

String parsing with custom patterns

Rate me:
Please Sign up or sign in to vote.
4.95/5 (12 votes)
5 Oct 2016CPOL5 min read
Simple library that parses a string according to custom patterns.

Introduction

StringPatternizer - is a simple library that allows to define custom patterns (similar to DateTime pattterns) and use them for string parsing.

Background

Recently I did an integration with a WebService that provides a location information. Location is sent in XML format and represented as two string fields: Lattitude and Longitude. The challenge was that those coordinates may be in any format but I needed them in decimal format. Here are few examples:

  • -22.856944
  • 25 15 30
  • 25° 15' 30"
  • -22.856944 (22 51' 25.0" S)

Turned out that location information is provided by a human and there is no any restrictions on an input, just a text box field.  In this situation hardcoded parsing is not an option. 

Naturally I came up to XML cofiguration file which suppouse to have a list of format patterns so end user be able to add missing patterns in the future. But what kind of patterns to use?

My first idea was a RegEx. But I realized that RegEx is too complex for end user. For instance, to parse string like this "25° 15' 30"", need to define following RegEx:

C#
(?<degrees>\d*[,.]?\d*)° (?<minutes>\d*[,.]?\d*)' (?<seconds>\d*[,.]?\d*)"

Of course it is totally unacceptable. The perfect solution would be something similar to DateTime parsing patterns. Instead of complex RegEx would be nice to write pattern like this:

C#
d° m' s"

, where 'd' - placeholder for degrees, 'm' - placeholder for minutes, 's' - placeholder for seconds. With this approach end user can take original coordinate string, replace values with a placeholder characters and he gets ready to use pattern. Much simplier than RegEx. 

After some googling I didn't find any library that provides such capabilities. So I write my own and would like to share my solution with the community.

Using the code

First of all you need to compile a source code (attached) and add a reference "StringPatternizer.dll" to your project.

The major class called "StringPatternizer" should be created first:

C#
StringPatternizer sp = new StringPatternizer();

Next step is to define character markers, that will be used as placeholders in the pattern:

C#
sp.Markers.Add('d', typeof(int));//marker for degrees
sp.Markers.Add('m', typeof(int));//marker for minutes
sp.Markers.Add('s', typeof(double));//marker for seconds
sp.Markers.Add('D', typeof(double));//marker for decimal value (coordinate may come in decimal format)
sp.Markers.Add('S', typeof(string));//marker for side of the world (North, South, East, West)

UPD:

For "StringPatternizer2" it is possible to use string markers instead of character. So code above may look like this:

C#
sp.Markers.Add("degrees", typeof(int));//marker for degrees
sp.Markers.Add("minutes", typeof(int));//marker for minutes
sp.Markers.Add("seconds", typeof(double));//marker for seconds
sp.Markers.Add("Decimal", typeof(double));//marker for decimal value (coordinate may come in decimal format)
sp.Markers.Add("Side", typeof(string));//marker for side of the world (North, South, East, West)

The code above defines 5 markers with their expected types. During the parsing StringPatternizer will use specified types to verify if extracted value format is correct. Of course you can register all markers with 'string' type, but the parsing will be less accurate.

Next step is to define a list of patterns:

C#
sp.Patterns.Add("d° m' s\"");
sp.Patterns.Add("d m s");
sp.Patterns.Add("d°m's\"");

UPD:

For "StringPatternizer2" pattern may look like this:

C#
sp.Patterns.Add("degrees° minutes' seconds\"");

Another approach is to have a list of patterns and register entire list:

C#
var patterns = new List<string>() 
            { 
                "D",
                "d m s S",
                "d m s",
                "d° m' s\" So",
                "d° m' s\" Se",
                "d° m' s\" S",
                "d° m' s\"",
                "d°m's\"S",
                "d°m's\"",
                "d?m's\"S",
                "d?m's\"",

                "d? m' s\"",

                "D (d m' s\" S)",
                "(d m' s\" S)",
                "d m' s\" S\"",
                
                "m' d°  s\"",
                "d m' s'' S",
                "dº m' s\" S",
                "dºm's"
            };

sp.Patterns.AddRange(_patterns);

There are two rules for definning a pattern:

  • Order of markers should reflect the order of values in incoming string
  • Need to specify neighborhood characters - one character from the left side of expected value and one character from the right. For instance, for coordinate "25° 15' 30"", minute value is surounded by space char from the left and apostrophe char from the right, so pattern should include them also: "d° m' s"".

At this point initialization is completed. StringPatternizer has all data for making parsing. There are two methods that provides a parsing. First one, called 'Match', finds the first matched pattern and use it for the parsing. Second method, called 'MatchAll' returns all matched patterns with parsed data. Here is an example:

C#
PatternizationResult pResult = sp.Match("25° 15' 30\"");
...
List<PatternizationResult> pResults = sp.MatchAll("25° 15' 30\"");

As a result of parsing we get PatternizationResult class. It has following properties:

  • Exception - 'null' if one of the pattern matched and parsing successfully completed; 'FormatException' if no pattern was found for specified string value.
  • Pattern - 'string.Empty' if no pattern matched; 'pattern value' if pattern was matched.
  • Result - Dictionary<char, object>, where Key - is marker symbol, Value - extracted value. UPD: for "StringPatternizer2" - Dictionary<string, object>, where Key - is marker string,

PatternizationResult class also has following methods:

  • bool MarkerHasValue(char marker) - usefull to check if specific marker has a value.
  • TValue GetMarkerValue<TValue>(char marker) - usefull to extract specific marker value with desired type.

UPD: for "StringPatternizer2":

  • bool MarkerHasValue(string marker)
  • TValue GetMarkerValue<TValue>(string marker)

Here is an example how to handle PatternizationResult:

C#
string location = "25° 15' 30\"";

var pResult = sp.Match(location);

if (pResult.Exception == null)
{
    Log.DebugFormat("Value '{0}' matched with pattern '{1}'.", location, pResult.Pattern);

    if (pResult.MarkerHasValue('D'))
    {
          return pResult.GetMarkerValue<double>('D');
    }
    else
    {
          var degrees = pResult.GetMarkerValue<int>('d');
          var minutes = pResult.GetMarkerValue<int>('m');
          var seconds = pResult.GetMarkerValue<double>('s');

          return ConvertToDecimalCoordinate(degrees, minutes, seconds);
    }
}
else
{
    throw pResult.Exception;
}

That's it! Simple enought I guess.

Points of Interest

Currently library supports following data types for parsing: int, double, decimal, float, bool, string. You can easilly add missing type in StringPatternizer.ConvertToType method.

Library handles localization issue in decimal values (dot or comma separator doesn't matter).

Sometimes it is hard to specify a non-English character in a pattern. For such cases library supports 'inline character code'. Let's assume we need to parse a string like this: "43�10�12,4\"". Character '�' has a code 65533. Instead of putting this char into pattern we can use its code: "d{65533}m{65533}s\"". Library will convert a code '{65533}' into a character '�'.

One of the possible improvement could be to use Parellel.ForEach for every pattern checking to increase the speed. For now I decided to keep the code simple so even newbie in C# can understand it.

 

UPD:

Thank's to the Emily Heiner's comment I improved the algorithm by using RegEx internally for extracting the values. It makes simple to implement "markers" as a string instead of character. And code became much simplier. So now this library kind of a wrapper on top of RegEx.

History

03.10.2016 - first version

04.10.2016 - version 2

  • use RegEx internally 
  • "marker" is a string now instead of character

05.10.2016 - fixed bug with RegEx empty groups, by replacing ".*" into ".+"

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Vladyslav Chernysh
Software Developer
Ukraine Ukraine
No Biography provided

Comments and Discussions

 
GeneralMy vote of 5 Pin
E. Scott McFadden6-Oct-16 5:18
professionalE. Scott McFadden6-Oct-16 5:18 
Excellent article. I did not know about markers. They look useful. Thanks for sharing.
PraiseI can see the applications of this Pin
Emily Heiner3-Oct-16 14:01
MemberEmily Heiner3-Oct-16 14:01 
GeneralRe: I can see the applications of this Pin
Vladyslav Chernysh4-Oct-16 8:36
MemberVladyslav Chernysh4-Oct-16 8:36 
GeneralRe: I can see the applications of this Pin
Vladyslav Chernysh4-Oct-16 22:26
MemberVladyslav Chernysh4-Oct-16 22:26 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.