Click here to Skip to main content
16,020,677 members
Articles / General Programming / String

I don't like Regex...

Rate me:
Please Sign up or sign in to vote.
4.77/5 (68 votes)
17 Jan 2013CPOL4 min read 152.8K   284   114   91
This article will introduce you with a set of 3 simple extension methods that can help you getting rid of Regex in many situations
Download The Source Files

Introduction

In fact I do like Regex : they do the job well. Even too well as all developers have to use them and there is no way to get rid of it.

Unfortunately whenever I need a new one I am facing the same issue : I have forgotten almost everything about their damned syntax... If I were to write one everyday I would probably easily remember it but that's not the case as I barely need to write a couple of them in a year...

Being fed up reading and learning that documentation again and again I decided to implement the following String extensions method...

Background

Regular expressions are a powerful and concise mean for processing large amount of text in order to validate, extract, edit, replace or delete part of a text given a predefined pattern (ex: an email address)

In order to make proper use of Regex you need:

  • a text to analyse
  • a regular expression engine
  • a regular expression (the pattern to look for in the text to analyse)

the regular expression syntax varies depending on the regular expression engine you use. In the Microsoft world the class that serves as the regular expression engine is System.Text.RegularExpressions.Regex and its syntax is described here : http://msdn.microsoft.com/en-us/library/az24scfc.aspx

If you are looking for an introduction to regular expression syntax please read this excellent article : http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial

The problem with regular expressions

They have the drawback of their advantages : the syntax (concise and powerful) is intended to be friendly for regular expression engines but not really to human beings.

When not familiar with the syntax you can spend a long time writing a valid expression.

You can spend another long time testing that expression and make it bullet proof. It is one thing to make sure your regular expression is matching what you expect but it is another thing to make sure it is matching ONLY what you expect.

The idea

If you are familiar with SQL you know the LIKE operator. Why not bringing that operator to C#?

Why not having a simplified syntax for the most frequent operations you would ask your Regex engine to perform?

A simplified syntax

... means less operators. Here is the list that I have, very arbitrary, come up with :

  • ? = Any char 
  • % = Zero or more character
  • * = Zero or more character but no white space (basically a word)
  • # = Any single digit (0-9)

examples of simple expressions:  

  • a Guid can be expressed as : ????????-????-????-????-????????????
  • an email address could be : *?@?*.?*
  • as for a date : ##/##/####

Regular expression aficionados are already jumping on their chairs: obviously nothing guarantees the latest expression match a valid date and they are right (that expression would match 99/99/9999). But in no way that syntax replace the regular expressions one. It is far from offering the same level of capabilities especially in terms of validation.  

Frequent operations

What are the frequent operations you need a regular expression engine for?

  1. determining if the text to analyse matches a given pattern : Like 
  2. finding an occurrence of a given pattern  in the text to analyse : Search 
  3. retrieving string(s) in the text to analyse :Extract 

these 3 operations  'Like', 'Search' and 'Extract' have been implemented as extension methods of strings as an alternative to a Regular expression engine. 

Let's start describing their usage first and code will follow... 

1. Determining if a string is 'like' a given pattern  

You know SQL then you know what I am talking about...  

the Like extension simply returns true when the input string match the given pattern. 

All following examples are returning true, meaning input strings are like their patterns. 

example: a string is a guid

C#
var result0 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("????????-????-????-????-????????????");

example: a string ends with a guid

C#
var result1 = "This is a guid TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("
%????????-????-????-????-????????????");

example: a string starts with a guid 

C#
var result2 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495 is a guid".Like("????????-????-????-????????????%");

example: a string contains a guid  

C#
var result3 = "this string TA0E02391-A0DF-4772-B39A-C11F7D63C495 contains a guid".Like("%????????-????-????-????-????????????%");

example: a string ends with a guid  

C#
var result4 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("%????????-????-????-????-????????????");

2. 'Searching' for a particular pattern in a string  

The Search extension methods retrieve the first occurrence of the given pattern inside the provided text. 

example: Search for a guid inside a text

C#
var result5 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Search("[????????-????-????-????-????????????]");
Console.WriteLine(result5); // output: [TA0E02391-A0DF-4772-B39A-C11F7D63C495]

3. 'Extracting' values out of a string  given a known pattern 

Almost like searching but does not bring back the whole string that matches the pattern but an array of the strings matching the pattern groups.

example: retrieving the consituents of a guid inside a text

C#
var result6 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Extract("[????????-????-????-????-????????????]");
// result is an array containing each part of the pattern: {"TA0E02391", "A0DF", "4772", "B39A", "C11F7D63C495"}

example: retrieving the consituents of an email inside a text

C#
var result7 = "this string contains an email: toto@domain.com".Extract("*?@?*.?*");
// result is an array containing each part of the pattern: {"toto", "domain", "com"}

Here's the code

The simple trick here is that the 3 different public methods relies on GetRegex which transforms the simplified expression into a valid .net one 

C#
public static class StringExt
{
    public static bool Like(this string item, string searchPattern)
    {
        var regex = GetRegex("^" + searchPattern);
        return regex.IsMatch(item);
    }

    public static string Search(this string item, string searchPattern)
    {
        var match = GetRegex(searchPattern).Match(item);
        if (match.Success)
        {
            return item.Substring(match.Index, match.Length);
        }
        return null;
    }

    public static List<string> Extract(this string item, string searchPattern)
    {
        var result = item.Search(searchPattern);
        if (!string.IsNullOrWhiteSpace(result))
        {
            var splitted = searchPattern.Split(new[] { '?', '%', '*', '#' }, StringSplitOptions.RemoveEmptyEntries);
            var temp = result;
            var final = new List<string>();
            foreach(var x in splitted)
            {
                var pos = temp.IndexOf(x);
                if (pos > 0)
                {
                    final.Add(temp.Substring(0, pos));
                    temp = temp.Substring(pos);
                }
                temp = temp.Substring(x.Length);
            }
            if (temp.Length > 0) final.Add(temp);
            return final;
        }
        return null;
    }

    // private method which accepts the simplified pattern and transform it into a valid .net regex pattern:
    // it escapes standard regex syntax reserved characters 
    // and transforms the simplified syntax into the native Regex one
    static Regex GetRegex(string searchPattern)
    {
        return new Regex(searchPattern
                .Replace("\\", "\\\\")
                .Replace(".", "\\.")
                .Replace("{", "\\{")
                .Replace("}", "\\}")
                .Replace("[", "\\[")
                .Replace("]", "\\]")
                .Replace("+", "\\+")
                .Replace("$", "\\$")
                .Replace(" ", "\\s")
                .Replace("#", "[0-9]")
                .Replace("?", ".")
                .Replace("*", "\\w*")
                .Replace("%", ".*")
                , RegexOptions.IgnoreCase);
    }
}

Conclusion

As stated above the intent is not to replace Regex but to provide a very simple approach for solving about 80% of the cases I previously had the need for Regex. This approach keeps basic tasks very simple and makes the client code very easy to write and obvious to understand to anyone who is not expert with Regex syntax.  

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect
Switzerland Switzerland
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
AnswerRe: My 4 Pin
Guirec21-Jan-13 14:25
professionalGuirec21-Jan-13 14:25 
GeneralMy vote of 4 Pin
baxiqiuxing18-Jan-13 14:47
baxiqiuxing18-Jan-13 14:47 
GeneralRe: My vote of 4 Pin
Guirec21-Jan-13 14:25
professionalGuirec21-Jan-13 14:25 
GeneralMy vote of 5 Pin
  Forogar  18-Jan-13 3:36
professional  Forogar  18-Jan-13 3:36 
AnswerRe: My vote of 5 Pin
Guirec18-Jan-13 3:56
professionalGuirec18-Jan-13 3:56 
GeneralMy vote of 5 Pin
Jerome Vibert18-Jan-13 3:00
Jerome Vibert18-Jan-13 3:00 
GeneralRe: My vote of 5 Pin
Guirec18-Jan-13 3:53
professionalGuirec18-Jan-13 3:53 
QuestionMy Vote of 2 Pin
robocodeboy18-Jan-13 0:28
robocodeboy18-Jan-13 0:28 
The article is well written, but I find that the whole thing you did is to give a really small subset of regular expressions some non-standard aliases.

I mean, why in the world a '?' should be more intuitive or easy to remember than a '.'?

Nice try, but I think it's a completely wrong approach.

I learned regexes and I can use them in all the languages I write software with.

Anyone would gain a lot more in learning that the world is using '.' to define any char and * to define "match what's before the star, repeated zero or more times".

Maybe a fluent syntax exposed to Intellisense could be better suited to what you wanted to accomplish.
AnswerRe: My Vote of 2 Pin
Guirec18-Jan-13 3:52
professionalGuirec18-Jan-13 3:52 
GeneralMy vote of 5 Pin
Mark Lemke17-Jan-13 23:18
Mark Lemke17-Jan-13 23:18 
AnswerRe: My vote of 5 Pin
Guirec18-Jan-13 3:48
professionalGuirec18-Jan-13 3:48 
AnswerI can see your point Pin
Clifford Nelson17-Jan-13 14:16
Clifford Nelson17-Jan-13 14:16 
AnswerRe: I can see your point Pin
Guirec17-Jan-13 14:26
professionalGuirec17-Jan-13 14:26 
GeneralRe: I can see your point Pin
Clifford Nelson18-Jan-13 13:40
Clifford Nelson18-Jan-13 13:40 
QuestionRe: I can see your point Pin
Guirec20-Jan-13 14:42
professionalGuirec20-Jan-13 14:42 
Questiondownloadable Pin
filmee2416-Jan-13 7:17
filmee2416-Jan-13 7:17 
AnswerRe: downloadable Pin
Guirec16-Jan-13 14:13
professionalGuirec16-Jan-13 14:13 
GeneralRe: downloadable Pin
filmee2417-Jan-13 5:24
filmee2417-Jan-13 5:24 
AnswerRe: downloadable Pin
Guirec17-Jan-13 14:05
professionalGuirec17-Jan-13 14:05 
GeneralMy vote of 5 Pin
torial8-Nov-12 16:32
torial8-Nov-12 16:32 
AnswerRe: My vote of 5 Pin
Guirec8-Nov-12 16:53
professionalGuirec8-Nov-12 16:53 
GeneralMy vote of 4 Pin
ymanzon17-Jul-12 5:55
ymanzon17-Jul-12 5:55 
GeneralRe: My vote of 4 Pin
Guirec8-Nov-12 16:55
professionalGuirec8-Nov-12 16:55 
GeneralMy vote of 3 Pin
James Hurburgh8-May-12 18:00
James Hurburgh8-May-12 18:00 
GeneralRe: My vote of 3 Pin
Mario Majčica29-Oct-12 5:55
professionalMario Majčica29-Oct-12 5:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.