Click here to Skip to main content
15,888,527 members
Articles / Programming Languages / C#

ReplaceOmatic

Rate me:
Please Sign up or sign in to vote.
4.95/5 (12 votes)
30 Nov 2016CPOL5 min read 12.7K   195   7   1
Performs replacements within strings

Background

This code is in response to the Coding challenge: bad word filter[^] posed by Chris.

Introduction

This code performs replacements within strings as specified via configuration string pairs. You may review the specifications on the Code Challenge's page. Most entries so far have used Regular Expressions and this one does too, but in a fairly limited capacity. Because Regular Expressions can be inefficient when presented with such a non-regular request, rather than use Regular Expressions to scan the whole string (perhaps several times), this code splits the string into parts and then uses Regular Expressions on the parts. In the interest of performance, it avoids Replace, Substring, Split, and ToString as much as possible.

Using the code

A ReplaceOmatic can be instantiated and used like this:

C#
PIEBALD.Type.ReplaceOmatic replacer = 
new PIEBALD.Type.ReplaceOmatic
(
  new System.Tuple<string,string> ( "PHB!"        , "boss!"  ) 
,
  new System.Tuple<string,string> ( "gotten"      , "become" ) 
,
  new System.Tuple<string,string> ( "*p[o0]{2}p*" , "p**p"   ) 
) ;

string intext = "My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!" ;
 
string outtext = intext ;

int total = replacer.Replace ( ref outtext ) ;

System.Console.WriteLine ( "{0}\n{1}\n{2}" , intext , total , outtext ) ;

This class is not intended to be instantiated, used once, and then disposed. I recommend keeping one long-term. This allows the class to gather statistics about which Regular Expressions encounter the most matches. With that data, the List of ReplacementSpecs can be sorted so the ones with the most hits can be tried first.

StringSlice

This class is used to define a section of a string. The goal is to avoid using Substring unless truly necessary. ToString will call Substring on the parent string, so we want to do that as seldom as possible.

C#
private sealed partial class StringSlice
{
  private string source ;

  public int Start  { get ; private set ; }
  public int Length { get ; private set ; }

  private StringSlice
  (
    string Source
  ,
    int    Start
  ,
    int    Length
  )
  {
    this.source = Source ;

    this.Start  = Start  ;
    this.Length = Length ;

    return ;
  }

  public override string
  ToString
  (
  )
  {
    return ( this.source.Substring ( this.Start , this.Length ) ) ;
  }

TrimPunctuation

This method assists with defining a section of a string that has neither leading nor trailing punctuation characters. For the current Challenge, I must allow punctuation within a section, such as sh!t and b(.)(.)bs.

C#
private static void
TrimPunctuation
(
  string  Source
,
  ref int Start
,
  ref int Length
)
{
  /* Trim leading punctuation. */
  while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start ] ) )
  {
    Start++  ;
    Length-- ;
  }

  /* Trim trailing punctuation. */
  while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start + Length - 1 ] ) )
  {
    Length-- ;
  }

  return ;
}

Tokenize

This method is the only way to create instances of StringSlice. Each StringSlice will define a non-empty section of the input string. Sections are delimited by whitespace and may or may not include leading and trailing punctuation. For the current Challenge, I include neither leading nor trailing punctuation characters. Similarly, when punctuation trimming is performed, a StringSlice will never contain only punctuation.

The method is an Iterator to reduce resource usage. Rather than scanning the entire string at once and producing many StringSlices, one section is produced at a time, so the caller can process it and throw it out.

C#
  public static System.Collections.Generic.IEnumerable<StringSlice>
  Tokenize
  (
    string Source
  ,
    int    Start
  ,
    int    Length
  ,
    bool   Trim
  )
  {
    int i   = 0 ;
    int len ;

    while ( i < Length )
    {
      if ( System.Char.IsWhiteSpace ( Source [ i ] ) )
      {
        len = i - Start ;

        if ( Trim )
        {
          TrimPunctuation ( Source , ref Start , ref len ) ;
        }

        if ( len > 0 )
        {
          yield return ( new StringSlice ( Source , Start , len ) ) ;
        }

        Start = i + 1 ;
      }

      i++ ;
    }

    len = i - Start ;

    if ( Trim )
    {
      TrimPunctuation ( Source , ref Start , ref len ) ;
    }

    if ( len > 0 )
    {
      yield return ( new StringSlice ( Source , Start , len ) ) ;
    }

    yield break ;
  }
}

The tokens from my test string are:

My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!

String.Split can't do this reliably, plus you lose the information about where the token was, and I need that later.

ReplacementSpec

This class holds the information required for each desired replacement.

C#
private sealed partial class ReplacementSpec : System.IComparable<ReplacementSpec>
{
  public static System.Text.RegularExpressions.RegexOptions DefaultRegexOptions =
    System.Text.RegularExpressions.RegexOptions.Compiled
    |
    System.Text.RegularExpressions.RegexOptions.ExplicitCapture ;

  private System.Text.RegularExpressions.Regex regex ;

  private string replacement ;

  private bool duplicatecasing ;

  /* How many replacements have been made. Can be used to sort a list of these. */
  public int HitCount { get ; private set ; }

  private                                       ReplacementSpec
  (
    string                                      Target
  ,
    string                                      Replacement
  ,
    System.Text.RegularExpressions.RegexOptions Option
  ,
    bool                                        DuplicateCasing
  )
  {
    this.regex = new System.Text.RegularExpressions.Regex
    (
      Target
    ,
      Option
    ) ;

    this.replacement = Replacement ;

    this.duplicatecasing = DuplicateCasing ;

    this.HitCount = 0 ;

    return ;
  }

  public int
  CompareTo
  (
    ReplacementSpec Op
  )
  {
    return ( this.HitCount.CompareTo ( Op.HitCount ) ) ;
  }

Create

This method is used to process the specifications as defined in the Challenge to produce a ReplacementSpec instance.

The Target may include one or more prefix or suffix characters that specify some details of the search and replace operation. These characters will be removed once they have been interpreted as instructions.

  • If the Target does not end with an exclamation point (!), then add the Regex Option to IgnoreCase.
  • If the Target does not end with an asterisk (*), then add a dollar sign ($) at the end of the Regex.
  • If the Target does not start with an asterisk (*), then add a caret (^) at the beginning of the Regex.
  • If the Replacement does not end with an exclamation point (!), then the replace operation needs to duplicate the casing of the source. (This is my own addition to the spec.)

Other than the specified prefix and suffix characters above, the Target must be a valid Regular Expression. When specifying the Target, remember that it will be matched against strings that contain no whitespace and neither leading nor trailing punctuation. Try to keep the Regular Expressions simple and efficient; avoid expressions that may lead to backtracking.

Furthermore, beware of unintentionally including "special" characters in the Target, as it might break the Regular Expression. For instance, to add a Target for b(.)(.)bs, you must remember to escape -- prefix with a backslash (\) -- the dots and parentheses, for example: @"b\(\.\)\(\.\)bs".

C#
public static ReplacementSpec
Create
(
  string Target
,
  string Replacement
)
{
  System.Text.RegularExpressions.RegexOptions opt = DefaultRegexOptions ;

  if ( Target.EndsWith ( "!" ) )
  {
    Target = Target.TrimEnd ( new char[] { '!' } ) ;
  }
  else
  {
    opt |= System.Text.RegularExpressions.RegexOptions.IgnoreCase ;
  }

  if ( Target.EndsWith ( "*" ) )
  {
    Target = Target.TrimEnd ( new char[] { '*' } ) ;
  }
  else
  {
    Target += "$" ;
  }

  if ( Target.StartsWith ( "*" ) )
  {
    Target = Target.TrimStart ( new char[] { '*' } ) ;
  }
  else
  {
    Target = "^" + Target ;
  }

  bool dup ;

  if ( Replacement.EndsWith ( "!" ) )
  {
    Replacement = Replacement.TrimEnd ( new char[] { '!' } ) ;

    dup = false ;
  }
  else
  {
    dup = true ;
  }

  return ( new ReplacementSpec ( Target , Replacement , opt , dup ) ) ;
}

Replace

This is where the magic happens. Given a Candidate string, which is a section of the string being searched, use the Regular Expression to find matches.

If any matches are found, then replace each with the specified replacement text in accordance with whether or not to duplicate the source string's casing. A match may be preceded by non-matched text which must be copied. The final match may be followed by non-matched text which must be copied.

If no matches are found, then no other processing occurs.

C#
  public int
  Replace
  (
    ref string Candidate
  )
  {
    int result = 0 ;

    System.Text.RegularExpressions.MatchCollection mat = this.regex.Matches ( Candidate ) ;

    if ( mat.Count > 0 )
    {
      int offset = 0 ;

      System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;

      for ( int i = 0 ; i < mat.Count ; i++ )
      {
        /* Copy characters after the previous match and before the current match. */
        sb.Append ( Candidate , offset , mat [ i ].Index - offset ) ;

        /* Copy the replacement text. */
        if ( this.duplicatecasing )
        {
          int j = 0 ;

          for ( ; ( j < mat [ i ].Length ) && ( j < this.replacement.Length ) ; j++ )
          {
            if ( System.Char.IsUpper ( mat [ i ].Value , j ) )
            {
              sb.Append ( System.Char.ToUpper ( this.replacement [ j ] ) ) ;
            }
            else
            {
              sb.Append ( System.Char.ToLower ( this.replacement [ j ] ) ) ;
            }
          }

          /* If the replacement is longer than the matched text... */
          if ( j < this.replacement.Length )
          {
            /* Use the casing of the last character. */
            CasingDelegate casing =
              System.Char.IsUpper ( mat [ i ].Value , j - 1 )
              ? (CasingDelegate) System.Char.ToUpper
              : (CasingDelegate) System.Char.ToLower ;

            for ( ; j < this.replacement.Length ; j++ )
            {
              sb.Append ( casing ( this.replacement [ j ] ) ) ;
            }
          }
        }
        else
        {
          sb.Append ( this.replacement , 0 , this.replacement.Length ) ;
        }

        offset = mat [ i ].Index + mat [ i ].Length ;

         /* Track how many characters were replaced. */
        result += mat [ i ].Length ;
      }

      /* Copy any remaining characters. */
      sb.Append ( Candidate , offset , Candidate.Length - offset ) ;

      /* Call StringBuilder.ToString only if there were replacements. */
      Candidate = sb.ToString() ;

      this.HitCount += mat.Count ;
    }

    /* Result is the total number of characters replaced. */
    return ( result ) ;
  }
}

ReplaceOmatic

This class holds a List of ReplacementSpec instances to apply to a string. The constructor accepts pairs of strings and simply passes each to ReplacementSpec.Create .

C#
namespace PIEBALD.Type
{
  public sealed partial class ReplaceOmatic : System.IDisposable
  {
    private System.Collections.Generic.List<ReplacementSpec> rep ;

    public ReplaceOmatic
    (
      params System.Tuple<string,string>[] Context
    )
    {
      this.rep = new System.Collections.Generic.List<ReplacementSpec> ( Context.Length ) ;

      for ( int i = 0 ; i < Context.Length ; i++ )
      {
        this.rep.Add ( ReplacementSpec.Create ( Context [ i ].Item1 , Context [ i ].Item2 ) ) ;
      }

      return ;
    }

    public void
    Dispose
    (
    )
    {
      this.rep.Clear() ;

      return ;
    }

    public void
    Sort
    (
    )
    {
      /* This will sort in descending order. */
      this.rep.Sort ( ( x , y ) => y.CompareTo ( x ) ) ;

      return ;
    }

Replace

This method iterates the sections of the input string, applies each ReplacementSpec to each section, and produces a new string if (and only if) any replacements were made.

C#
    public int
    Replace
    (
      ref string Candidate
    )
    {
      int result = 0 ;

      int offset = 0 ;

      System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;

      foreach 
      ( 
        StringSlice sub 
      in 
        StringSlice.Tokenize ( Candidate , 0 , Candidate.Length , true ) 
      )
      {
        string temp = sub.ToString() ; /* Performs Substring on the source string. */

        int r = 0 ; 

        /* Apply ReplacementSpecs to the section until we run out of them or we've replaced the entire section. */
        for ( int i = 0 ; ( i < this.rep.Count ) && ( r < Candidate.Length ) ; i++ )
        {
          /* Track how many characters have been replaced so far. */
          r += this.rep [ i ].Replace ( ref temp ) ;
        }

        /* If at least one character was replaced... */
        if ( r > 0 )
        {
          /* Copy characters after the previous replacement and before the current slice. */
          sb.Append ( Candidate , offset , sub.Start - offset ) ;
            
          sb.Append ( temp ) ; 

          offset = sub.Start + sub.Length ;

          /* Track how many characters were replaced. */
          result += r ;
        }
      }

      if ( result > 0 )
      {
        /* Copy any remaining characters. */
        sb.Append ( Candidate , offset , Candidate.Length - offset ) ;
        
        /* Call StringBuilder.ToString only if there were replacements. */
        Candidate = sb.ToString() ;
      }

      /* Result is the total number of characters replaced. */
      return ( result ) ;
    }
  }
}

Conclusion

Testing with the provided input and expected output values, plus some other tests of my own, seems to validate the correctness of the algorithm.

My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!
5
My boss is such a p**phead. It's become worse since his promotion. My boss has started his new blog phblog.com. He's SUCH A MISBEGOTTEN P**PHEAD!

Points of Interest

At first I was upset about the absence of an overload for RegEx.Matches ( string , int , int ), but it turned out not to matter and the algorithm is actually better without it.

History

2016-11-27 First published.

2016-11-29 Updates: No longer uses a separate method to copy to the StringBuilder. Stops trying to apply edits if the entire section has already been replaced. Returns the number of characters replaced.

2016-11-30 Updates: Made ReplacementSpec IComparable and added Sort to ReplaceOmatic.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
BSCS 1992 Wentworth Institute of Technology

Originally from the Boston (MA) area. Lived in SoCal for a while. Now in the Phoenix (AZ) area.

OpenVMS enthusiast, ISO 8601 evangelist, photographer, opinionated SOB, acknowledged pedant and contrarian

---------------

"I would be looking for better tekkies, too. Yours are broken." -- Paul Pedant

"Using fewer technologies is better than using more." -- Rico Mariani

"Good code is its own best documentation. As you’re about to add a comment, ask yourself, ‘How can I improve the code so that this comment isn’t needed?’" -- Steve McConnell

"Every time you write a comment, you should grimace and feel the failure of your ability of expression." -- Unknown

"If you need help knowing what to think, let me know and I'll tell you." -- Jeffrey Snover [MSFT]

"Typing is no substitute for thinking." -- R.W. Hamming

"I find it appalling that you can become a programmer with less training than it takes to become a plumber." -- Bjarne Stroustrup

ZagNut’s Law: Arrogance is inversely proportional to ability.

"Well blow me sideways with a plastic marionette. I've just learned something new - and if I could award you a 100 for that post I would. Way to go you keyboard lovegod you." -- Pete O'Hanlon

"linq'ish" sounds like "inept" in German -- Andreas Gieriet

"Things would be different if I ran the zoo." -- Dr. Seuss

"Wrong is evil, and it must be defeated." –- Jeff Ello

"A good designer must rely on experience, on precise, logical thinking, and on pedantic exactness." -- Nigel Shaw

“It’s always easier to do it the hard way.” -- Blackhart

“If Unix wasn’t so bad that you can’t give it away, Bill Gates would never have succeeded in selling Windows.” -- Blackhart

"Use vertical and horizontal whitespace generously. Generally, all binary operators except '.' and '->' should be separated from their operands by blanks."

"Omit needless local variables." -- Strunk... had he taught programming

Comments and Discussions

 
QuestionYou are right Pin
Mycroft Holmes27-Nov-16 21:41
professionalMycroft Holmes27-Nov-16 21:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.